When filtering with multi column restriction present all other restrictions were ignored.
So a query like:
`SELECT * FROM WHERE pk = 0 AND (ck1, ck2) < (0, 0) AND regular_col = 0 ALLOW FILTERING;`
would ignore the restriction `regular_col = 0`.
This was caused by a bug in the filtering code:
2779a171fc/cql3/selection/selection.cc (L433-L449)
When multi column restrictions were detected, the code checked if they are satisfied and returned immediately.
This is fixed by returning only when these restrictions are not satisfied. When they are satisfied the other restrictions are checked as well to ensure all of them are satisfied.
This code was introduced back in 2019, when fixing #3574.
Perhaps back then it was impossible to mix multi column and regular columns and this approach was correct.
Fixes: #6200Fixes: #12014Closes#12031
* github.com:scylladb/scylladb:
cql-pytest: add a reproducer for #12014, verify that filtering multi column and regular restrictions works
boost/restrictions-test: uncomment part of the test that passes now
cql-pytest: enable test for filtering combined multi column and regular column restrictions
cql3: don't ignore other restrictions when a multi column restriction is present during filtering
(cherry picked from commit 2d2034ea28)
Closes#12086
When stopping the read, the multishard reader will dismantle the
compaction state, pushing back (unpopping) the currently processed
partition's header to its originating reader. This ensures that if the
reader stops in the middle of a partition, on the next page the
partition-header is re-emitted as the compactor (and everything
downstream from it) expects.
It can happen however that there is nothing more for the current
partition in the reader and the next fragment is another partition.
Since we only push back the partition header (without a partition-end)
this can result in two partitions being emitted without being separated
by a partition end.
We could just add the missing partition-end when needed but it is
pointless, if the partition has no more data, just drop the header, we
won't need it on the next page.
The missing partition-end can generate an "IDL frame truncated" message
as it ends up causing the query result writer to create a corrupt
partition entry.
Fixes: https://github.com/scylladb/scylladb/issues/9482Closes#11914
* github.com:scylladb/scylladb:
test/cql-pytest: add regression test for "IDL frame truncated" error
mutation_compactor: detach_state(): make it no-op if partition was exhausted
treewide: fix headers
Wrong access to an uninitialized token instead of the actual
generated string caused the parser to crash, this wasn't
detected by the ANTLR3 compiler because all the temporary
variables defined in the ANTLR3 statements are global in the
generated code. This essentialy caused a null dereference.
Tests: 1. The fixed issue scenario from github.
2. Unit tests in release mode.
Fixes#11774
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20190612133151.20609-1-eliransin@scylladb.com>
Closes#11777
(cherry picked from commit ab7429b77d)
detach_state() allows the user to resume a compaction process later,
without having to keep the compactor object alive. This happens by
generating and returning the mutation fragments the user has to re-feed
to a newly constructed compactor to bring it into the exact same state
the current compactor was at the point of stopping the compaction.
This state includes the partition-header (partition-start and static-row
if any) and the currently active range tombstone.
Detaching the state is pointless however when the compaction was stopped
such that the currently compacted partition was completely exhausted.
Allowing the state to be detached in this case seems benign but it
caused a subtle bug in the main user of this feature: the partition
range scan algorithm, where the fragments included in the detached state
were pushed back into the reader which produced them. If the partition
happened to be exhausted -- meaning the next fragment in the reader was
a partition-start or EOS -- this resulted in the partition being
re-emitted later without a partition-end, resulting in corrupt
query-result being generated, in turn resulting in an obscure "IDL frame
truncated" error.
This patch solves this seemingly benign but sinister bug by making the
return value of `detach_state()` an std::optional and returning a
disengaged optional when the partition was exhausted.
(cherry picked from commit 70b4158ce0)
cql3::util::maybe_quote() is a utility function formatting an identifier
name (table name, column name, etc.) that needs to be embedded in a CQL
statement - and might require quoting if it contains non-alphanumeric
characters, uppercase characters, or a CQL keyword.
maybe_quote() made an effort to only quote the identifier name if neccessary,
e.g., a lowercase name usually does not need quoting. But lowercase names
that are CQL keywords - e.g., to or where - cannot be used as identifiers
without quoting. This can cause problems for code that wants to generate
CQL statements, such as the materialized-view problem in issue #9450 - where
a user had a column called "to" and wanted to create a materialized view
for it.
So in this patch we fix maybe_quote() to recognize invalid identifiers by
using the CQL parser, and quote them. This will quote reserved keywords,
but not so-called unreserved keywords, which *are* allowed as identifiers
and don't need quoting. This addition slows down maybe_quote(), but
maybe_quote() is anyway only used in heavy operations which need to
generate CQL.
This patch also adds two tests that reproduce the bug and verify its
fix:
1. Add to the low-level maybe_quote() test (a C++ unit test) also tests
that maybe_quote() quotes reserved keywords like "to", but doesn't
quote unreserved keywords like "int".
2. Add a test reproducing issue #9450 - creating a materialized view
whose key column is a keyword. This new test passes on Cassandra,
failed on Scylla before this patch, and passes after this patch.
It is worth noting that maybe_quote() now has a "forward compatiblity"
problem: If we save CQL statements generated by maybe_quote(), and a
future version introduces a new reserved keyword, the parser of the
future version may not be able to parse the saved CQL statement that
was generated with the old mayb_quote() and didn't quote what is now
a keyword. This problem can be solved in two ways:
1. Try hard not to introduced new reserved keywords. Instead, introduce
unreserved keywords. We've been doing this even before recognizing
this maybe_quote() future-compatibility problem.
2. In the next patch we will introduce quote() - which unconditionally
quotes identifier names, even if lowercase. These quoted names will
be uglier for lowercase names - but will be safe from future
introduction of new keywords. So we can consider switching some or
all uses of maybe_quote() to quote().
Fixes#9450
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220118161217.231811-1-nyh@scylladb.com>
(cherry picked from commit 5d2f694a90)
The problem was incompatibility with cassandra, which accepts bool
as a string in `fromJson()` UDF. The difference between Cassandra and
Scylla now is Scylla accepts whitespaces around word in string,
Cassandra don't. Both are case insensitive.
Fixes: #7915
(cherry picked from commit 1902dbc9ff)
EC2 instance metadata service can be busy, ret's retry to connect with
interval, just like we do in scylla-machine-image.
Fixes#10250
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#11688
(cherry picked from commit 6b246dc119)
(cherry picked from commit e2809674d2)
As described in issue #11801, we saw in Alternator when a GSI has both partition and sort keys which were non-key attributes in the base, cases where updating the GSI-sort-key attribute to the same value it already had caused the entire GSI row to be deleted.
In this series fix this bug (it was a bug in our materialized views implementation) and add a reproducing test (plus a few more tests for similar situations which worked before the patch, and continue to work after it).
Fixes#11801Closes#11808
* github.com:scylladb/scylladb:
test/alternator: add test for issue 11801
MV: fix handling of view update which reassign the same key value
materialized views: inline used-once and confusing function, replace_entry()
(cherry picked from commit e981bd4f21)
When being stopped compaction manager may step on ENOSPC. This is not a
reason to fail stopping process with abort, better to warn this fact in
logs and proceed as if nothing happened
refs: #11245
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it the future-returning method and setup the _stop_future in its
only caller. Makes next patch much simpler
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Scylla's Bloom filter implementation has a minimal false-positive rate
that it can support (6.71e-5). When setting bloom_filter_fp_chance any
lower than that, the compute_bloom_spec() function, which writes the bloom
filter, throws an exception. However, this is too late - it only happens
while flushing the memtable to disk, and a failure at that point causes
Scylla to crash.
Instead, we should refuse the table creation with the unsupported
bloom_filter_fp_chance. This is also what Cassandra did six years ago -
see CASSANDRA-11920.
This patch also includes a regression test, which crashes Scylla before
this patch but passes after the patch (and also passes on Cassandra).
Fixes#11524.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11576
(cherry picked from commit 4c93a694b7)
DescribeTable is currently hard-coded to return PAY_PER_REQUEST billing
mode. Nevertheless, even in PAY_PER_REQUEST mode, the DescribeTable
operation must return a ProvisionedThroughput structure, listing both
ReadCapacityUnits and WriteCapacityUnits as 0. This requirement is not
stated in some DynamoDB documentation but is explictly mentioned in
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_ProvisionedThroughput.html
Also in empirically, DynamoDB returns ProvisionedThroughput with zeros
even in PAY_PER_REQUEST mode. We even had an xfailing test to confirm this.
The ProvisionedThroughput structure being missing was a problem for
applications like DynamoDB connectors for Spark, if they implicitly
assume that ProvisionedThroughput is returned by DescribeTable, and
fail (as described in issue #11222) if it's outright missing.
So this patch adds the missing ProvisionedThroughput structure, and
the xfailing test starts to pass.
Note that this patch doesn't change the fact that attempting to set
a table to PROVISIONED billing mode is ignored: DescribeTable continues
to always return PAY_PER_REQUEST as the billing mode and zero as the
provisioned capacities.
Fixes#11222
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11298
(cherry picked from commit 941c719a23)
The generator was first setting the marker then applied tombstones.
The marker was set like this:
row.marker() = random_row_marker();
Later, when shadowable tombstones were applied, they were compacted
with the marker as expected.
However, the key for the row was chosen randomly in each iteration and
there are multiple keys set, so there was a possibility of a key clash
with an earlier row. This could override the marker without applying
any tombstones, which is conditional on random choice.
This could generate rows with markers uncompacted with shadowable tombstones.
This broken row_cache_test::test_concurrent_reads_and_eviction on
comparison between expected and read mutations. The latter was
compacted because it went through an extra merge path, which compacts
the row.
Fix by making sure there are no key clashes.
Closes#11663
(cherry picked from commit 5268f0f837)
Said method currently emits a partition-end. This method is only called
when the last fragment in the stream is a range tombstone change with a
position after all clustered rows. The problem is that
consume_partition_end() is also called unconditionally, resulting in two
partition-end fragments being emitted. The fix is simple: make this
method a no-op, there is nothing to do there.
Also add two tests: one targeted to this bug and another one testing the
crawling reader with random mutations generated for random schema.
Fixes: #11421Closes#11422
(cherry picked from commit be9d1c4df4)
Some tests want to generate a fixed amount of random partitions, make
their life easier.
(cherry picked from commit 98f3d516a2)
Ref #11421 (prerequisite)
Only for reasons other than "no such KS", i.e. when the failure is
presumed transient and the batch in question is not deleted from
batchlog and will be retried in the future.
(Would info be more appropriate here than warning?)
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Closes#10556Fixes#10636
(cherry picked from commit 00ed4ac74c)
It's not uncommong for cleanup to be issued against an entire keyspace,
which may be composed of tons of tables. To increase chances of success
if low on space, cleanup will now start from smaller tables first, such
that bigger tables will have more space available, once they're reached,
to satisfy their space requirement.
parallel_for_each() is dropped and wasn't needed given that manager
performs per-shard serialization of cleanup jobs.
Refs #9504.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211130133712.64517-1-raphaelsc@scylladb.com>
(cherry picked from commit 0d5ac845e1)
In functions such as upgrade_to_v2 (excerpt below), if the constructor
of transforming_reader throws, r needs to be destroyed, however it
hasn't been closed. However, if a reader didn't start any operations, it
is safe to destruct such a reader. This issue can potentially manifest
itself in many more readers and might be hard to track down. This commit
adds a bool indicating whether a close is anticipated, thus avoiding
errors in the destructor.
Code excerpt:
flat_mutation_reader_v2 upgrade_to_v2(flat_mutation_reader r) {
class transforming_reader : public flat_mutation_reader_v2::impl {
// ...
};
return make_flat_mutation_reader_v2<transforming_reader>(std::move(r));
}
Fixes#9065.
(cherry picked from commit 9ada63a9cb)
When configuring tcp-nodelay unconditionally, messaging service thinks
gossiper uses group index 1, though it had changed some time ago and now
those verbs belong to group 0.
fixes: #11465
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 2c74062962)
Long-term index caching in the global cache, as introduced in 4.6, is a major
pessimization for workloads where accesses to the index are (spacially) sparse.
We want to have a way to disable it for the affected workloads.
There is already infrastructure in place for disabling it for BYPASS CACHE
queries. One way of solving the issue is hijacking that infrastructure.
This patch adds a global flag (and a corresponding CLI option) which controls
index caching. Setting the flag to `false` causes all index reads to behave
like they would in BYPASS CACHE queries.
Consequences of this choice:
- The per-SSTable partition_index_cache is unused. Every index_reader has
its own, and they die together. Independent reads can no longer reuse the
work of other reads which hit the same index pages. This is not crucial,
since partition accesses have no (natural) spatial locality. Note that
the original reason for partition_index_cache -- the ability to share
reads for the lower and upper bound of the query -- is unaffected.
- The per-SSTable cached_file is unused. Every index_reader has its own
(uncached) input stream from the index file, and every
bsearch_clustered_cursor has its own cached_file, which dies together with
the cursor. Note that the cursor still can perform its binary search with
caching. However, it won't be able to reuse the file pages read by
index_reader. In particular, if the promoted index is small, and fits inside
the same file page as its index_entry, that page will be re-read.
It can also happen that index_reader will read the same index file page
multiple times. When the summary is so dense that multiple index pages fit in
one index file page, advancing the upper bound, which reads the next index
page, will read the same index file page. Since summary:disk ratio is 1:2000,
this is expected to happen for partitions with size greater than 2000
partition keys.
Fixes#11202
(cherry picked from commit cdb3e71045)
An incorrect size is returned from the function, which could lead to
crashes or undefined behavior. Fix by erroring out in these cases.
Fixes#11476
(cherry picked from commit 1c2eef384d)
Detecting a secondary index by checking for a dot
in the table name is wrong as tables generated by Alternator
may contain a dot in their name.
Instead detect bot hmaterialized view and secondary indexes
using the schema()->is_view() method.
Fixes#10526
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit aa127a2dbb)
The error message incorrectly stated that the timeout value cannot
be longer than 24h, but it can - the actual restriction is that the
value cannot be expressed in units like days or months, which was done
in order to significantly simplify the parsing routines (and the fact
that timeouts counted in days are not expected to be common).
Fixes#10286Closes#10294
(cherry picked from commit 85e95a8cc3)
When `check_and_repair_cdc_streams` encountered a node with status LEFT, Scylla
would throw. This behavior is fixed so that LEFT nodes are simply ignored.
Fixes#9771Closes#9778
(cherry picked from commit 351f142791)
Scenario:
cache = [
row(pos=2, continuous=false),
row(pos=after(2), dummy=true)
]
Scanning read starts, starts populating [-inf, before(2)] from sstables.
row(pos=2) is evicted.
cache = [
row(pos=after(2), dummy=true)
]
Scanning read finishes reading from sstables.
Refreshes cache cursor via
partition_snapshot_row_cursor::maybe_refresh(), which calls
partition_snapshot_row_cursor::advance_to() because iterators are
invalidated. This advances the cursor to
after(2). no_clustering_row_between(2, after(2)) returns true, so
advance_to() returns true, and maybe_refresh() returns true. This is
interpreted by the cache reader as "the cursor has not moved forward",
so it marks the range as complete, without emitting the row with
pos=2. Also, it marks row(pos=after(2)) as continuous, so later reads
will also miss the row.
The bug is in advance_to(), which is using
no_clustering_row_between(a, b) to determine its result, which by
definition excludes the starting key.
Discovered by row_cache_test.cc::test_concurrent_reads_and_eviction
with reduced key range in the random_mutation_generator (1024 -> 16).
Fixes#11239Closes#11240
* github.com:scylladb/scylladb:
test: mvcc: Fix illegal use of maybe_refresh()
tests: row_cache_test: Add test_eviction_of_upper_bound_of_population_range()
tests: row_cache_test: Introduce one_shot mode to throttle
row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy
This is a backport of https://github.com/scylladb/scylla/pull/10420 to branch 5.0.
Branch 5.0 had somewhat different code in this expression area, so the backport was not automatically, but nevertheless was fairly straightforward - just copy the exact same checking code to its right place, and keep the exact same tests to see we indeed fixed the bug.
Refs #10535.
The original cover letter from https://github.com/scylladb/scylla/pull/10420:
In the filtering expression "WHERE m[?] = 2", our implementation was buggy when either the map, or the subscript, was NULL (and also when the latter was an UNSET_VALUE). Our code ended up dereferencing null objects, yielding bizarre errors when we were lucky, or crashes when we were less lucky - see examples of both in issues https://github.com/scylladb/scylla/issues/10361, https://github.com/scylladb/scylla/issues/10399, https://github.com/scylladb/scylla/pull/10401. The existing test test_null.py::test_map_subscript_null reproduced all these bugs sporadically.
In this series we improve the test to reproduce the separate bugs separately, and also reproduce additional problems (like the UNSET_VALUE). We then define both m[NULL] and NULL[2] to result in NULL instead of the existing undefined (and buggy, and crashing) behavior. This new definition is consistent with our usual SQL-inspired tradition that NULL "wins" in expressions - e.g., NULL < 2 is also defined as resulting in NULL.
However, this decision differs from Cassandra, where m[NULL] is considered an error but NULL[2] is allowed. We believe that making m[NULL] be a NULL instead of an error is more consistent, and moreover - necessary if we ever want to support more complicate expressions like m[a], where the column a can be NULL for some rows and non-NULL for others, and it doesn't make sense to return an "invalid query" error in the middle of the scan.
Fixes https://github.com/scylladb/scylla/issues/10361
Fixes https://github.com/scylladb/scylla/issues/10399
Fixes https://github.com/scylladb/scylla/pull/10401Closes#11142
* github.com:scylladb/scylla:
test/cql-pytest: reproducer for CONTAINS NULL bug
expressions: don't dereference invalid map subscript in filter
expressions: fix invalid dereference in map subscript evaluation
test/cql-pytest: improve tests for map subscripts and nulls
(cherry picked from commit 23a34d7e42)
lookup_readers might fail after populating some readers
and those better be closed before returning the exception.
Fixes#10351
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10425
(cherry picked from commit 055141fc2e)
Need to erase the shared sstable from _sstables
if insertion to _sstables_reversed fails.
Fixes#10787
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit cd68b04fbf)
There is a bug introduced in e74c3c8 (4.6.0) which makes memtable
reader skip one a range tombstone for a certain pattern of deletions
and under certain sequence of events.
_rt_stream contains the result of deoverlapping range tombstones which
had the same position, which were sipped from all the versions. The
result of deoverlapping may produce a range tombstone which starts
later, at the same position as a more recent tombstone which has not
been sipped from the partition version yet. If we consume the old
range tombstone from _rt_stream and then refresh the iterators, the
refresh will skip over the newer tombstone.
The fix is to drop the logic which drains _rt_stream so that
_rt_stream is always merged with partition versions.
For the problem to trigger, there have to be multiple MVCC versions
(at least 2) which contain deletions of the following form:
[a, c] @ t0
[a, b) @ t1, [b, d] @ t2
c > b
The proper sequence for such versions is (assuming d > c):
[a, b) @ t1,
[b, d] @ t2
Due to the bug, the reader will produce:
[a, b) @ t1,
[b, c] @ t0
The reader also needs to be preempted right before processing [b, d] @
t2 and iterators need to get invalidated so that
lsa_partition_reader::do_refresh_state() is called and it skips over
[b, d] @ t2. Otherwise, the reader will emit [b, d] @ t2 later. If it
does emit the proper range tombstone, it's possible that it will violate
fragment order in the stream if _rt_stream accumulated remainders
(possible with 3 MVCC versions).
The problem goes away once MVCC versions merge.
Fixes#10913Fixes#10830Closes#10914
(cherry picked from commit a6aef60b93)
[avi: backport prerequisite position_range_to_clustering_range() too]
All snitch drivers are supposed to snitch info on some shard and
replicate the dc/rack info across others. All, but azure really do so.
The azure one gets dc/rack on all shards, which's excessive but not
terrible, but when all shards start to replicate their data to all the
others, this may lead to use-after-frees.
fixes: #10494
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit c6d0bc87d0)
Rewrite operations are scrub, cleanup and upgrade.
Race can happen because 'selection of sstables' and 'mark sstables as
compacting' are decoupled. So any deferring point in between can lead
to a parallel compaction picking the same files. After commit 2cf0c4bbf,
files are marked as compacting before rewrite starts, but it didn't
take into account the commit c84217ad which moved retrieval of
candidates to a deferring thread, before rewrite_sstables() is even
called.
Scrub isn't affected by this because it uses a coarse grained approach
where whole operation is run with compaction disabled, which isn't good
because regular compaction cannot run until its completion.
From now on, selection of files and marking them as compacting will
be serialized by running them with compaction disabled.
Now cleanup will also retrieve sstables with compaction disabled,
meaning it will no longer leave uncleaned files behind, which is
important to avoid data resurrection if node regains ownership of
data in uncleaned files.
Fixes#8168.
Refs #8155.
[backport notes:
- minor conflict around run_with_compaction_disabled()
- bumped into our old friend
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95111,
so I had to use std::ref() on local copy of lambda
- with the yielding part of candidate retrieval now happening in
rewrite_sstables(), task registration is moved to after run_with_
compaction_disabled() call, so the latter won't incorrectly try
to stop the task that called it, which triggers an assert in
debug mode.
]
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211129133107.53011-1-raphaelsc@scylladb.com>
(cherry picked from commit 80a1ebf0f3)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#10963
The code which applied view filtering (i.e. a condition placed
on a view column, e.g. "WHERE v = 42") erroneously used a wildcard
selection, which also assumes that static columns are needed,
if the base table contains any such columns.
The filtering code currently assumes that no such columns are fetched,
so the selection is amended to only ask for regular columns
(primary key columns are sent anyway, because they are enabled
via slice options, so no need to ask for them explicitly).
Fixes#10851Closes#10855
(cherry picked from commit bc3a635c42)
If the number of streams exceeds the number of token ranges
it indicates that some spurious streams from decommissioned
nodes are present.
In such a situation - simply regenerate.
Fixes#9772Closes#9780
(cherry picked from commit ea46439858)
In 10dd08c9 ("messaging_service: supply and interpret rpc isolation_cookies",
4.2), we added a mechanism to perform rpc calls in remote scheduling groups
based on the connection identity (rather than the verb), so that
connection processing itself can run in the correct group (not just
verb processing), and so that one verb can run in different groups according
to need.
In 16d8cdadc ("messaging_service: introduce the tenant concept", 4.2), we
changed the way isolation cookies are sent:
scheduling_group
messaging_service::scheduling_group_for_verb(messaging_verb verb) const {
return _scheduling_info_for_connection_index[get_rpc_client_idx(verb)].sched_group;
@@ -665,11 +694,14 @@ shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::ge
if (must_compress) {
opts.compressor_factory = &compressor_factory;
}
opts.tcp_nodelay = must_tcp_nodelay;
opts.reuseaddr = true;
- opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie;
+ // We send cookies only for non-default statement tenant clients.
+ if (idx > 3) {
+ opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie;
+ }
This effectively disables the mechanism for the default tenant. As a
result some verbs will be executed in whatever group the messaging
service listener was started in. This used to be the main group,
but in 554ab03 ("main: Run init_server and join_cluster inside
maintenance scheduling group", 4.5), this was change to the maintenance
group. As a result normal read/writes now compete with maintenance
operations, raising their latency significantly.
Fix by sending the isolation cookie for all connections. With this,
a 2-node cassandra-stress load has 99th percentile increase by just
3ms during repair, compared to 10ms+ before.
Fixes#9505.
Closes#10673
(cherry picked from commit c83393e819)
Checking if the type is string is subtly broken for reversed types,
and these types will not be recognized as strings, even though they are.
As a result, if somebody creates a column with DESC order and then
tries to use operator LIKE on it, it will fail because the type
would not be recognized as a string.
Fixes#10183Closes#10181
* github.com:scylladb/scylla:
test: add a case for LIKE operator on a descending order column
types: fix is_string for reversed types
(cherry picked from commit 733672fc54)
It was assumed that offstrategy compaction is always triggered by streaming/repair
where it would inherit the caller's scheduling group.
However, offstrategy is triggered by a timer via table::_off_strategy_trigger so I don't see
how the expiration of this timer will inherit anything from streaming/repair.
Also, since d309a86, offstrategy compaction
may be triggered by the api where it will run in the default scheduling group.
The bottom line is that the compaction manager needs to explicitly perform offstrategy compaction
in the maintenance scheduling group similar to `perform_sstable_scrub_validate_mode`.
Fixes#10151
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302084821.2239706-1-bhalevy@scylladb.com>
(cherry picked from commit 0764e511bb)
Storage field of "coredumpctl info" changed at systemd-v248, it added
"(present)" on the end of line when coredump file available.
Fixes#10669Closes#10714
(cherry picked from commit ad2344a864)
In DynamoDB one can retrieve only a subset of the attributes using the
AttributesToGet or ProjectionExpression paramters to read requests.
Neither allows an empty list of attributes - if you don't want any
attributes, you should use Select=COUNT instead.
Currently we correctly refuse an empty ProjectionExpression - and have
a test for it:
test_projection_expression.py::test_projection_expression_toplevel_syntax
However, Alternator is missing the same empty-forbidding logic for
AttributesToGet. An empty AttributesToGet is currently allowed, and
basically says "retrieve everything", which is sort of unexpected.
So this patch adds the missing logic, and the missing test (actually
two tests for the same thing - one using GetItem and the other Query).
Fixes#10332
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220405113700.9768-1-nyh@scylladb.com>
(cherry picked from commit 9c1ebdceea)
We still consider the TTL support in Alternator to be experimental, so we
don't want to allow a user to enable TTL on a table without turning on a
"--experimental-features" flag. However, there is no reason not to allow
the DescribeTimeToLive call when this experimental flag is off - this call
would simply reply with the truth - that the TTL feature is disabled for
the table!
This is important for client code (such as the Terraform module
described in issue #10660) which uses DescribeTimeToLive for
information, even when it never intends to actually enable TTL.
The patch is trivial - we simply remove the flag check in
DescribeTimeToLive, the code works just as before.
After this patch, the following test now works on Scylla without
experimental flags turned on:
test/alternator/run test_ttl.py::test_describe_ttl_without_ttl
Refs #10660
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 8ecf1e306f)
When entry loading fails and there is another request blocked on the
same page, attempt to erase the failed entry will abort because that
would violate entry_ptr guarantees, which is supposed to keep the
entry alive.
The fix in 92727ac36c was incomplete. It
only helped for the case of a single loader. This patch makes a more
general approach by relaxing the assert.
The assert manifested like this:
scylla: ./sstables/partition_index_cache.hh:71: sstables::partition_index_cache::entry::~entry(): Assertion `!is_referenced()' failed.
Fixes#10617Closes#10653
(cherry picked from commit f87274f66a)
Get all flush permits to serialize with any
ongoing flushes and preventing further flushes
during table::clear, in particular calling
discard_completed_segments for every table and
clearing the memtables in clear_and_add.
Fixes#10423
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit aae532a96b)
Dtest triggers the problem by:
1) creating table with LCS
2) disabling regular compaction
3) writing a few sstables
4) running maintenance compaction, e.g. cleanup
Once the maintenance compaction completes, disengaged optional _last_compacted_keys
triggers an exception in notify_completion().
_last_compacted_keys is used by regular for its round-robin file picking
policy. It stores the last compacted key for each level. Meaning it's
irrelevant for any other compaction type.
Regular compaction is responsible for initializing it when it runs for
the first time to pick files. But with it disabled, notify_completion()
will find it uninitialized, therefore resulting in bad_optional_access.
To fix this, the procedure is skipped if _last_compacted_keys is
disengaged. Regular compaction, once re-enabled, will be able to
fill _last_compacted_keys by looking at metadata of the files.
compaction_test.py::TestCompaction::test_disable_autocompaction_doesnt_
block_user_initiated_compactions[CLEANUP-LeveledCompactionStrategy]
now passes.
Fixes#10378.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#10508
(cherry picked from commit 8e99d3912e)
One user observed this assertion fail, but it's an extremely rare event.
The root cause - interlacing of processing STARTUP and OPTIONS messages -
is still there, but now it's harmless enough to leave it as is.
Fixes#10487Closes#10503
(cherry picked from commit 603dd72f9e)
_estimated_remaining_tasks gets updated via get_next_non_expired_sstables ->
get_compaction_candidates, but otherwise if we return earlier from
get_sstables_for_compaction, it does not get updated and may go out of sync.
Refs #10418
(to be closed when the fix reaches branch-4.6)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10419
(cherry picked from commit 01f41630a5)
It seams that batch prepared statements always return false for
depends_on, this in turn renders the removal criteria from the
prepared statements cache to always be false which result by the
queries not being evicted.
Here we change the function to return the true state meaning,
they will return true if one of the sub queries is dependant
upon the keyspace and/ or column family.
Fixes#10129
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
(cherry picked from commit 4eb0398457)
purpose
Cql statements used to have two API functions, depends_on_keyspace and
depends_on_column_family. The former, took as a parameter only a table
name, which makes no sense. There could be multiple tables with the same
name each in a different keyspace and it doesn't make sense to
generalize the test - i.e to ask "Does a statement depend on any table
named XXX?"
In this change we unify the two calls to one - depends on that takes a
keyspace name and optionally also a table name, that way every logical
dependency tests that makes sense is supported by a single API call.
(cherry picked from commit bf50dbd35b)
Ref #10129
If we are redefining the log table, we need to ensure any dropped
columns are registered in "dropped_columns" table, otherwise clients will not
be able to read data older than now.
Includes unit test.
Should probably be backported to all CDC enabled versions.
Fixes#10473Closes#10474
(cherry picked from commit 78350a7e1b)
There are two issues with current implementation of remove/remove_if:
1) If it happens concurrently with get_ptr(), the latter may still
populate the cache using value obtained from before remove() was
called. remove() is used to invalidate caches, e.g. the prepared
statements cache, and the expected semantic is that values
calculated from before remove() should not be present in the cache
after invalidation.
2) As long as there is any active pointer to the cached value
(obtained by get_ptr()), the old value from before remove() will be
still accessible and returned by get_ptr(). This can make remove()
have no effect indefinitely if there is persistent use of the cache.
One of the user-perceived effects of this bug is that some prepared
statements may not get invalidated after a schema change and still use
the old schema (until next invalidation). If the schema change was
modifying UDT, this can cause statement execution failures. CQL
coordinator will try to interpret bound values using old set of
fields. If the driver uses the new schema, the coordinaotr will fail
to process the value with the following exception:
User Defined Type value contained too many fields (expected 5, got 6)
The patch fixes the problem by making remove()/remove_if() erase old
entries from _loading_values immediately.
The predicate-based remove_if() variant has to also invalidate values
which are concurrently loading to be safe. The predicate cannot be
avaluated on values which are not ready. This may invalidate some
values unnecessarily, but I think it's fine.
Fixes#10117
Message-Id: <20220309135902.261734-1-tgrabiec@scylladb.com>
(cherry picked from commit 8fa704972f)
Said method has to evict all querier cache entries, belonging to the to-be-dropped table. This is already the case, but there was a window where new entries could sneak in, causing a stale reference to the table to be de-referenced later when they are evicted due to TTL. This window is now closed, the entries are evicted after the method has waited for all ongoing operations on said table to stop.
Fixes: #10450Closes#10451
* github.com:scylladb/scylla:
replica/database: drop_column_family(): drop querier cache entries after waiting for ops
replica/database: finish coroutinizing drop_column_family()
replica/database: make remove(const column_family&) private
(cherry picked from commit 7f1e368e92)
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.
Found during code review, no user impact.
Fixes#10364.
Message-Id: <20220411224741.644113-1-tgrabiec@scylladb.com>
(cherry picked from commit 0c365818c3)
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.
Found during code review, no known user impact.
Fixes#10363.
Message-Id: <20220411222605.641614-1-tgrabiec@scylladb.com>
(cherry picked from commit 01eeb33c6e)
[avi: make max_chunk_capacity() public for backport]
Protocol v4 added WRITE_FAILURE and READ_FAILURE. When running under v3
we downgrade these exceptions to WRITE_TIMEOUT and READ_TIMEOUT (since
the client won't understand the v4 errors), but we still send the new
error codes. This causes the client to become confused.
Fix by updating the error codes.
A better fix is to move the error code from the constructor parameter
list and hard-code it in the constructor, but that is left for a follow-up
after this minimal fix.
Fixes#5610.
Closes#10362
(cherry picked from commit 987e6533d2)
If reserve() allocates more than one chunk, push_back() should not
work with the last chunk. This can result in items being pushed to the
wrong chunk, breaking internal invariants.
Also, pop_back() should not work with the last chunk. This breaks when
there is more than one chunk.
Currently, the container is only used in the sstable partition index
cache.
Manifests by crashes in sstable reader which touch sstables which have
partition index pages with more than 1638 partition entries.
Introduced in 78e5b9fd85 (4.6.0)
Fixes#10290
Message-Id: <20220407174023.527059-1-tgrabiec@scylladb.com>
(cherry picked from commit 41fe01ecff)
Since our Docker image moved to Ubuntu, we mistakenly copy
dist/docker/etc/sysconfig/scylla-server to /etc/sysconfig, which is not
used in Ubuntu (it should be /etc/default).
So /etc/default/scylla-server is just default configuration of
scylla-server .deb package, --log-to-stdout is 0, same as normal installation.
We don't want keep the duplicated configuration file anyway,
so let's drop dist/docker/etc/sysconfig/scylla-server and configure
/etc/default/scylla-server in build_docker.sh.
Fixes#10270Closes#10280
(cherry picked from commit bdefea7c82)
Previous versions of Docker image runs scylla as root, but cb19048
accidently modified it to scylla user.
To keep compatibility we need to revert this to root.
Fixes#10261Closes#10325
(cherry picked from commit f95a531407)
We changed supervisor service name at cb19048, but this breaks
compatibility with scylla-operator.
To fix the issue we need to revert the service name to previous one.
Fixes#10269Closes#10323
(cherry picked from commit 41edc045d9)
When a query contains IN restriction on its partition key,
it's currently not eligible for indexing. It was however
erroneously qualified as such, which lead to fetching incorrect
results. This commit fixes the issue by not allowing such queries
to undergo indexing, and comes with a regression test.
Fixes#10300Closes#10302
(cherry picked from commit c0fd53a9d7)
Following up on a57c087c89,
compare_atomic_cell_for_merge should compare the ttl value in the
reverse order since, when comparing two cells that are identical
in all attributes but their ttl, we want to keep the cell with the
smaller ttl value rather than the larger ttl, since it was written
at a later (wall-clock) time, and so would remain longer after it
expires, until purged after gc_grace seconds.
Fixes#10173
Test: mutation_test.test_cell_ordering, unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220306091913.106508-1-bhalevy@scylladb.com>
(cherry picked from commit a085ef74ff)
Unlike atomic_cell_or_collection::equals, compare_atomic_cell_for_merge
currently returns std::strong_ordering::equal if two cells are equal in
every way except their ttl:s.
The problem with that is that the cells' hashes are different and this
will cause repair to keep trying to repair discrepancies caused by the
ttl being different.
This may be triggered by e.g. the spark migrator that computes the ttl
based on the expiry time by subtracting the expiry time from the current
time to produce a respective ttl.
If the cell is migrated multiple times at different times, it will generate
cells that the same expiry (by design) but have different ttl values.
Fixes#10156
Test: mutation_test.test_cell_ordering, unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com>
(cherry picked from commit a57c087c89)
No need to check first the the cells' expiry is different
or that deletion_time is different before comparing them
with `<=>`.
If they are the same the function returns std::strong_ordering::equal
anyhow and that is the same as `<=>` comparing identical values.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302113833.2308533-1-bhalevy@scylladb.com>
(cherry picked from commit be865a29b8)
Ref #10156
compare_atomic_cell_for_merge() was placed in database.cc, before
atomic_cell.cc existed. Move it to its correct place.
Closes#9889
(cherry picked from commit 6c53717a39)
Currently any unhandled error during deferred shutdown
is rethrown in a noexcept context (in ~deferred_action),
generating a core dump.
The core dump is not helpful if the cause of the
error is "environmental", i.e. in the system, rather
than in scylla itself.
This change detects several such errors and calls
_Exit(255) to exit the process early, without leaving
a coredump behind. Otherwise, call abort() explicitly,
rather than letting terminate() be called implicitly
by the destructor exception handling code.
Fixes#9573
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220227101054.1294368-1-bhalevy@scylladb.com>
(cherry picked from commit 132c9d5933)
As observed in #10026, after schema changes it somehow happened
that a column defition that does not match any of the base table
columns was passed to expression verification code.
The function that looks up the index of a column happens to return
-1 when it doesn't find anything, so using this returned index
without checking if it's nonnegative results in accessing invalid
vector data, and a segfault or silent memory corruption.
Therefore, an explicit check is added to see if the column was actually
found. This serves two purposes:
- avoiding segfaults/memory corruption
- making it easier to investigate the root cause of #10026Closes#10039
(cherry picked from commit 7b364fec9849e9a342af1c240e3a7185bf5401ef)
When a new CDC generation is created (during bootstrap or otherwise), it
is assigned a timestamp. The timestamp must be propagated as soon as
possible, so all live nodes can learn about the generation before their
clocks reach the generation's timestamp. The propagation mechanism for
generation timestamps is gossip.
When bootstrap RBNO was enabled this was not the case: the generation
timestamp was inserted into gossiper state too late, after the repair
phase finished. Fix this.
Also remove an obsolete comment.
Fixes https://github.com/scylladb/scylla/issues/10149.
Closes#10154
* github.com:scylladb/scylla:
service: storage_service: announce new CDC generation immediately with RBNO
service: storage_service: fix indentation
(cherry picked from commit f1b2ff1722)
Tables waiting for a chance to run reshape wouldn't trigger stop
exception, as the exception was only being triggered for ongoing
compactions. Given that stop reshape API must abort all ongoing
tasks and all pending ones, let's change run_custom_job() to
trigger the exception if it found that the pending task was
asked to stop.
Tests:
dtest: compaction_additional_test.py::TestCompactionAdditional::test_stop_reshape_with_multiple_keyspaces
unit: dev
Fixes#9836.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211223002157.215571-1-raphaelsc@scylladb.com>
(cherry picked from commit 07fba4ab5d)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220311183053.46625-1-raphaelsc@scylladb.com>
Since regular compaction may run in parallel no lock
is required per-table.
We still acquire a read lock in this patch, for backporting
purposes, in case the branch doesn't contain
6737c88045.
But it can be removed entirely in master in a follow-up patch.
This should solve some of the slowness in cleanup compaction (and
likely in upgrade sstables seen in #10060, and
possibly #10166.
Fixes#10175
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10177
(cherry picked from commit 11ea2ffc3c)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220314151416.2496374-1-bhalevy@scylladb.com>
Base tables that use compact storage may have a special artificial
column that has an empty type.
c010cefc4d fixed the main CDC path to
handle such columns correctly and to not include them in the CDC Log
schema.
This patch makes sure that generation of preimage ignores such empty
column as well.
Fixes#9876Closes#9910
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit 09d4438a0d)
Add the missing partition-key validation in INSERT JSON statements.
Scylla, following the lead of Cassandra, forbids an empty-string partition
key (please note that this is not the same as a null partition key, and
that null clustering keys *are* allowed).
Trying to INSERT, UPDATE or DELETE a partition with an empty string as
the partition key fails with a "Key may not be empty". However, we had a
loophole - you could insert such empty-string partition keys using an
"INSERT ... JSON" statement.
The problem was that the partition key validation was done in one place -
`modification_statement::build_partition_keys()`. The INSERT, UPDATE and
DELETE statements all inherited this same method and got the correct
validation. But the INSERT JSON statement - insert_prepared_json_statement
overrode the build_partition_keys() method and this override forgot to call
the validation function. So in this patch we add the missing validation.
Note that the validation function checks for more than just empty strings -
there is also a length limit for partition keys.
This patch also adds a cql-pytest reproducer for this bug. Before this
patch, the test passed on Cassandra but failed on Scylla.
Reported by @FortTell
Fixes#9853.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220116085216.21774-1-nyh@scylladb.com>
(cherry picked from commit 8fd5041092)
cached_page::on_evicted() is invoked in the LSA allocator context, set in the
reclaimer callback installed by the cache_tracker. However,
cached_pages are allocated in the standard allocator context (note:
page content is allocated inside LSA via lsa_buffer). The LSA region
will happily deallocate these, thinking that they these are large
objects which were delegated to the standard allocator. But the
_non_lsa_memory_in_use metric will underflow. When it underflows
enough, shard_segment_pool.total_memory() will become 0 and memory
reclamation will stop doing anything, leading to apparent OOM.
The fix is to switch to the standard allocator context inside
cached_page::on_evicted(). evict_range() was also given the same
treatment as a precaution, it currently is only invoked in the
standard allocator context.
The series also adds two safety checks to LSA to catch such problems earlier.
Fixes#10056
\cc @slivne @bhalevy
Closes#10130
* github.com:scylladb/scylla:
lsa: Abort when trying to free a standard allocator object not allocated through the region
lsa: Abort when _non_lsa_memory_in_use goes negative
tests: utils: cached_file: Validate occupancy after eviction
test: sstable_partition_index_cache_test: Fix alloc-dealloc mismatch
utils: cached_file: Fix alloc-dealloc mismatch during eviction
(cherry picked from commit ff2cd72766)
This reverts commit 4c05e5f966.
Moving cleanup to maintenance group made its operation time up to
10x slower than previous release. It's a blocker to 4.6 release,
so let's revert it until we figure this all out.
Probably this happens because maintenance group is fixed at a
relatively small constant, and cleanup may be incrementally
generating backlog for regular compaction, where the former is
fighting for resources against the latter.
Fixes#10060.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220213165147.56204-1-raphaelsc@scylladb.com>
Ref: a9427f150a
The stall report uses the millisecond unit, but actually reports
nanoseconds.
Switch to microseconds (milliseconds are a bit too coarse) and
use the safer "duration / 1us" style rather than "duration::count()"
that leads to unit confusion.
Fixes#9733.
Closes#9734
(cherry picked from commit f907205b92)
DynamoDB allows an UpdateItem operation "REMOVE x.y" when a map x
exists in the item, but x.y doesn't - the removal silently does
nothing. Alternator incorrectly generated an error in this case,
and unfortunately we didn't have a test for this case.
So in this patch we add the missing test (which fails on Alternator
before this patch - and passes on DynamoDB) and then fix the behavior.
After this patch, "REMOVE x.y" will remain an error if "x" doesn't
exist (saying "document paths not valid for this item"), but if "x"
exists and is a map, but "x.y" doesn't, the removal will silently
do nothing and will not be an error.
Fixes#10043.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220207133652.181994-1-nyh@scylladb.com>
(cherry picked from commit 9982a28007)
If the Docker startup script is passed both "--alternator-port" and
"--alternator-https-port", a combination which is supposed to be
allowed, it passes to Scylla the "--alternator-address" option twice.
This isn't necessary, and worse - not allowed.
So this patch fixes the scyllasetup.py script to only pass this
parameter once.
Fixes#10016.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220202165814.1700047-1-nyh@scylladb.com>
(cherry picked from commit cb6630040d)
Fixes#10020
Previous fix 445e1d3 tried to close one double invocation, but added
another, since it failed to ensure all potential nullings of the opt
shared_future happened before a new allocator could reset it.
This simplifies the code by making clearing the shared_future a
pre-requisite for resolving its contents (as read by waiters).
Also removes any need for try-catch etc.
Closes#10024
(cherry picked from commit 1e66043412)
Refs #9896
Found by @eliransin. Call to new_segment was wrapped in with_timeout.
This means that if primary caller timed out, we would leave new_segment
calls running, but potentially issue new ones for next caller.
This could lead to reserve segment queue being read simultanously. And
it is not what we want.
Change to always use the shared_future wait, all callers, and clear it
only on result (exception or segment)
Closes#10001
(cherry picked from commit 445e1d3e41)
If memory reclamation is triggered inside _cache.emplace(), the _cache
btree can get corrupted. Reclaimers erase from it, and emplace()
assumes that the tree is not modified during its execution. It first
locates the target node and then does memory allocation.
Fix by running emplace() under allocating section, which disables
memory reclamation.
The bug manifests with assert failures, e.g:
./utils/bptree.hh:1699: void bplus::node<unsigned long, cached_file::cached_page, cached_file::page_idx_less_comparator, 12, bplus::key_search::linear, bplus::with_debug::no>::refill(Less) [Key = unsigned long, T = cached_file::cached_page, Less = cached_file::page_idx_less_comparator, NodeSize = 12, Search = bplus::key_search::linear, Debug = bplus::with_debug::no]: Assertion `p._kids[i].n == this' failed.
Fixes#9915
Message-Id: <20220130175639.15258-1-tgrabiec@scylladb.com>
(cherry picked from commit b734615f51)
Fixes#9955
In #9348 we handled the problem of failing to delete segment files on disk, and
the need to recompute disk footprint to keep data flow consistent across intermittent
failures. However, because _reserve_segments and _recycled_segments are queues, we
have to empty them to inspect the contents. One would think it is ok for these
queues to be empty for a while, whilst we do some recaclulating, including
disk listing -> continuation switching. But then one (i.e. I) misses the fact
that these queues use the pop_eventually mechanism, which does _not_ handle
a scenario where we push something into an empty queue, thus triggering the
future that resumes a waiting task, but then pop the element immediately, before
the waiting task is run. In fact, _iff_ one does this, not only will things break,
they will in fact start creating undefined behaviour, because the underlying
std::queue<T, circular_buffer> will _not_ do any bounds checks on the pop/push
operations -> we will pop an empty queue, immediately making it non-empty, but
using undefined memory (with luck null/zeroes).
Strictly speakging, seastar::queue::pop_eventually should be fixed to handle
the scenario, but nontheless we can fix the usage here as well, by simply copy
objects and do the calculation "in background" while we potentially start
popping queue again.
Closes#9966
(cherry picked from commit 43f51e9639)
We found that monitor mode of mdadm does not work on RAID0, and it is
not a bug, expected behavior according to RHEL developer.
Therefore, we should stop enabling mdmonitor when RAID0 is specified.
Fixes#9540
----
This reverts 0d8f932 and introduce correct fix.
Closes#9970
* github.com:scylladb/scylla:
scylla_raid_setup: use mdmonitor only when RAID level > 0
Revert "scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8"
(cherry picked from commit df22396a34)
If the permit was admitted, _base_resources was already accounted in
_resource and therefore has to be deducted from it, otherwise the permit
will think it leaked some resources on destruction.
Test:
dtest(repair_additional_test.py.test_repair_one_missing_row_diff_shard_count)
Refs: #9751
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220119132550.532073-1-bdenes@scylladb.com>
(cherry picked from commit a65b38a9f7)
Appending an empty range adjacent to an existing range tombstone would
not deoverlap (by dropping the empty range tombstone) resulting in
different (non canoncial) result depending on the order of appending.
Suppose that range tombstone [a, b] covers range tombstone [x, x), and [a, x) and [x, b) are range tombstones which correspond to [a, b] split around position x.
Appending [a, x) then [x, b) then [x, x) would give [a, b)
Appending [a, x) then [x, x) then [x, b) would give [a, x), [x, x), [x, b)
The fix is to drop empty range tombstones in range_tombstone_list so that the result is canonical.
Fixes#9661Closes#9764
* github.com:scylladb/scylla:
range_tombstone_list: Deoverlap adjacent empty ranges
range_tombstone_list: Convert to work in terms of position_in_partition
(cherry picked from commit b2a62d2b59)
"
Repair obtains a permit for each repair-meta instance it creates. This
permit is supposed to track all resources consumed by that repair as
well as ensure concurrency limit is respected. However when the
non-local reader path is used (shard config of master != shard config of
follower), a second permit will be obtained -- for the shard reader of
the multishard reader. This creates a situation where the repair-meta's
permit can block the shard permit, creating a deadlock situation.
This patch solves this by dropping the count resource on the
repair-meta's permit when a non-local reader path is executed -- that is
a multishard reader is created.
Fixes: #9751
"
* 'repair-double-permit-block/v4' of https://github.com/denesb/scylla:
repair: make sure there is one permit per repair with count res
reader_permit: add release_base_resource()
(cherry picked from commit 52b7778ae6)
Fixes#9653
When doing an outgoing connection, in a internode_encryption=dc/rack situation
we should not use endpoint/local broadcast solely to determine if we can
downgrade a connection.
If gossip/message_service determines that we will connect to a different
address than the "official" endpoint address, we should use this to determine
association of target node, and similarly, if we bind outgoing connection
to interface != bc we need to use this to decide local one.
Note: This will effectively _disable_ internode_encryption=dc/rack on ec2 etc
until such time that gossip can give accurate info on dc/rack for "internal"
ip addresses of nodes.
(cherry picked from commit 4778770814)
The error-handling code removes the cache entry but this leads to an
assertion because the entry is still referenced by the entry pointer
instance which is returned on the normal path. To avoid this clear the
pointer on the error path and make sure there are no additional
references kept to it.
Fixes#9887
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220105140859.586234-2-bdenes@scylladb.com>
(cherry picked from commit 92727ac36c)
Fixes#9798
If an exception in allocate_segment_ex is (sub)type of std::system_error,
commit_error_handler might _not_ cause throw (doh), in which case the error
handling code would forget the current exception and return an unusable
segment.
Now only used as an exception pointer replacer.
Closes#9870
(cherry picked from commit 3c02cab2f7)
alloc_buf() calls new_buf_active() when there is no active segment to
allocate a new active segment. new_buf_active() allocates memory
(e.g. a new segment) so may cause memory reclamation, which may cause
segment compaction, which may call alloc_buf() and re-enter
new_buf_active(). The first call to new_buf_active() would then
override _buf_active and cause the segment allocated during segment
compaction to be leaked.
This then causes abort when objects from the leaked segment are freed
because the segment is expected to be present in _closed_segments, but
isn't. boost::intrusive::list::erase() will fail on assertion that the
object being erased is linked.
Introduced in b5ca0eb2a2.
Fixes#9821Fixes#9192Fixes#9825Fixes#9544Fixes#9508
Refs #9573
Message-Id: <20211229201443.119812-1-tgrabiec@scylladb.com>
(cherry picked from commit 7038dc7003)
When the UpdateTable operation is called for a non-existent table, the
appropriate error is ResourceNotFoundException, but before this patch
we ran into an exception, which resulted in an ugly "internal server
error".
In this patch we use the existing get_table() function which most other
operations use, and which does all the appropriate verifications and
generates the appropriate Alternator api_error instead of letting
internal Scylla exceptions escape to the user.
This patch also includes a test for UpdateTable on a non-existent table,
which used to fail before this patch and pass afterwards. We also add a
test for DeleteTable in the same scenario, and see it didn't have this
bug. As usual, both tests pass on DynamoDB, which confirms we generate
the right error codes.
Fixes#9747.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211206181605.1182431-1-nyh@scylladb.com>
(cherry picked from commit 31eeb44d28)
Commit dcc73c5d4e introduced a semaphore
for excluding concurrent recalculations - _reserve_recalculation_guard.
Unfortunately, the two places in the code which tried to take this
guard just called get_units() - which returns a future<units>, not
units - and never waited for this future to become available.
So this patch adds the missing "co_await" needed to wait for the
units to become available.
Fixes#9770.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211214122612.1462436-1-nyh@scylladb.com>
(cherry picked from commit b8786b96f4)
On CentOS8, mdmonitor.service does not works correctly when using
mdadm-4.1-15.el8.x86_64 and later versions.
Until we find a solution, let's pinning the package version to older one
which does not cause the issue (4.1-14.el8.x86_64).
Fixes#9540Closes#9782
(cherry picked from commit 0d8f932f0b)
Currently when scrub/validate is stopped (e.g. via the api),
scrub_validate_mode_validate_reader co_return:s without
closing the reader passed to it - causing a crash due
to internal error check, see #9766.
Throwing a compaction_stopped_exception rather than co_return:ing
an exception will be handled as any other exeption, including closing
the reader.
Fixes#9766
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211213125528.2422745-1-bhalevy@scylladb.com>
(cherry picked from commit c89876c975)
The B-tree's insert_before() is throwing operation, its caller
must account for that. When the rows_entry's collection was
switched on B-tree all the risky places were fixed by ee9e1045,
but few places went under the radar.
In the cache_flat_mutation_reader there's a place where a C-pointer
is inserted into the tree, thus potentially leaking the entry.
In the partition_snapshot_row_cursor there are two places that not
only leak the entry, but also leave it in the LRU list. The latter
it quite nasty, because those entry can be evicted, eviction code
tries to get rows_entry iterator from "this", but the hook happens
to be unattached (because insertion threw) and fails the assert.
fixes: #9728
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit ee103636ac)
Both places get the C-pointer on the freshly allocated rows_entry,
insert it where needed and return back the dereferenced pointer.
The C-pointer is going to become smart-pointer that would go out
of scope before return. This change prepares for that by constructing
the ensure_result from the iterator, that's returned from insertion
of the entry.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 9fd8db318d)
Ref #9728
To avoid failing scylla-housekeeping in strict umask environment,
we need to chmod a+r on repository file and housekeeping.uuid.
Fixes#9683Closes#9739
(cherry picked from commit ea20f89c56)
The test suite names seen by Jenkins are suboptimal: there is
no distinction between modes, and the ".cc" suffix of file names
is interpreted as a class name, which is converted to a tree node
that must be clicked to expand. Massage the names to remove
unnecessary information and add the mode.
Closes#9696
(cherry picked from commit ef3edcf848)
Fixes#9738.
The recent parallelization of boost unit tests caused an increase
in xml result files. This is challenging to Jenkins, since it
appears to use rpc-over-ssh to read the result files, and as a result
it takes more than an hour to read all result files when the Jenkins
main node is not on the same continent as the agent.
To fix this, merge the result files in test.py and leave one result
file per mode. Later we can leave one result file overall (integrating
the mode into the testsuite name), but that can wait.
Tested on a local Jenkins instance (just reading the result files,
not the entire build).
Closes#9668
(cherry picked from commit b23af15432)
Fixes#9738
Backport of series to 4.6
Upstream merge commit: e2c27ee743.
Refs #9348Closes#9702
* github.com:scylladb/scylla:
commitlog: Recalculate footprint on delete_segment exceptions
commitlog_test: Add test for exception in alloc w. deleted underlying file
commitlog: Ensure failed-to-create-segment is re-deleted
commitlog::allocate_segment_ex: Don't re-throw out of function
Fixes#9348
If we get exceptions in delete_segments, we can, and probably will, loose
track of footprint counters. We need to recompute the used disk footprint,
otherwise we will flush too often, and even block indefinately on new_seg
iff using hard limits.
Tests that we can handle exception-in-alloc cleanup if the file actually
does not exist. This however uncovers another weakness (addressed in next
patch) - that we can loose track of disk footprint here, and w. hard limits
end up waiting for disk space that never comes. Thus test does not use hard
limit.
Fixes#9343
If we fail in allocate_segment_ex, we should push the file opened/created
to the delete set to ensure we reclaim the disk space. We should also
ensure that if we did not recycle a file in delete_segments, we still
wake up any recycle waiters iff we made a file delete instead.
Included a small unit test.
We've been observing hard to explain crashes recently around
lsa_buffer destruction, where the containing segment is absent in
_segment_descs which causes log_heap::adjust_up to abort. Add more
checks to catch certain impossible senarios which can lead to this
sooner.
Refs #9192.
Message-Id: <20211116122346.814437-1-tgrabiec@scylladb.com>
(cherry picked from commit bf6898a5a0)
We cannot recover from a failure in this method. The implementation
makes sure it never happens. Invariants will be broken if this
throws. Detect violations early by marking as noexcept.
We could make it exception safe and try to leave the data structures
in a consistent state but the reclaimer cannot make progress if this throws, so
it's pointless.
Refs #9192
Message-Id: <20211116122019.813418-1-tgrabiec@scylladb.com>
(cherry picked from commit 4d627affc3)
Indexed queries are using paging over the materialized view
table. Results of the view read are then used to issue reads of the
base table. If base table reads are short reads, the page is returned
to the user and paging state is adjusted accordingly so that when
paging is resumed it will query the view starting from the row
corresponding to the next row in the base which was not yet
returned. However, paging state's "remaining" count was not reset, so
if the view read was exhausted the reading will stop even though the
base table read was short.
Fix by restoring the "remaining" count when adjusting the paging state
on short read.
Tests:
- index_with_paging_test
- secondary_index_test
Fixes#9198
Message-Id: <20210818131840.1160267-1-tgrabiec@scylladb.com>
(cherry picked from commit 1e4da2dcce)
shared_promise::get_shared_future() is marked noexcept, but can
allocate memory. It is invoked by sstable partition index cache inside
an allocating section, which means that allocations can throw
bad_alloc even though there is memory to reclaim, so under normal
conditions.
Fix by allocating the shared_promise in a stable memory, in the
standard allocator via lw_shared_ptr<>, so that it can be accessed outside
allocating section.
Fixes#9666
Tests:
- build/dev/test/boost/sstable_partition_index_cache_test
Message-Id: <20211122165100.1606854-1-tgrabiec@scylladb.com>
(cherry picked from commit 1d84bc6c3b)
"
When gossiper processes its messages in the background some of
the continuations may pop up after the gossiper is shutdown.
This, in turn, may result in unwanted code to be executed when
it doesn't expect.
In particular, storage_service notification hooks may try to
update system keyspace (with "fresh" peer info/state/tokens/etc).
This update doesn't work after drain because drain shuts down
commitlog. The intention was that gossiper did _not_ notify
anyone after drain, because it's shut down during drain too.
But since there are background continuations left, it's not
working as expected.
refs: #9567
tests: unit(dev), dtest.concurrent_schema_changes.snapshot(dev)
"
* 'br-gossiper-background-messages-2' of https://github.com/xemul/scylla:
gossiper: Guard background processing with gate
gossiper: Helper for background messaging processing
(cherry picked from commit 9e2b6176a2)
On scylla_unit.py, we provide `systemd_unit.is_active()` to return `systemctl is-active` output.
When we introduced systemd_unit class, we just returned `systemctl is-active` output as string, but we changed the return value to bool after that (2545d7fd43).
This was because `if unit.is_active():` always becomes True even it returns "failed" or "inactive", to avoid such scripting bug.
However, probably this was mistake.
Because systemd unit state is not 2 state, like "start" / "stop", there are many state.
And we already using multiple unit state ("activating", "failed", "inactive", "active") in our Cloud image login prompt:
https://github.com/scylladb/scylla-machine-image/blob/next/common/scylla_login#L135
After we merged 2545d7fd43, the login prompt is broken, because it does not return string as script expected (https://github.com/scylladb/scylla-machine-image/issues/241).
I think we should revert 2545d7fd43, it should return exactly same value as `systemctl is-active` says.
Fixes#9627Fixesscylladb/scylla-machine-image#241Closes#9628
* github.com:scylladb/scylla:
scylla_ntp_setup: use string in systemd_unit.is_active()
Revert "scylla_util.py: return bool value on systemd_unit.is_active()"
(cherry picked from commit c17101604f)
There's at least one tiny race in generic_server code. The trailing
.handle_exception after the conn->process() captures this, but since the
whole continuation chain happens in the background, that this can be
released thus causing the whole lambda to execute on freed generic_server
instance. This, in turn, is not nice because captured this is used to get
a _logger from.
The fix is based on the observation that all connections pin the server
in memory until all of them (connections) are destructed. Said that, to
keep the server alive in the aforementioned lambda it's enough to make
sure the conn variable (it's lw_shared_ptr on the connection) is alive in
it. Not to generate a bunch of tiny continuations with identical set of
captures -- tail the single .then_wrapped() one and do whatever is needed
to wrap up the connection processing in it.
tests: unit(dev)
fixes: #9316
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211115105818.11348-1-xemul@scylladb.com>
(cherry picked from commit ba16318457)
When building a docker we relay on `VERSION` value from
`SCYLLA-VERSION-GEN` . For `rc` releases only there is a different
between the configured version (X.X.rcX) and the actualy debian package
we generate (X.X~rcX)
Using a similar solution as i did in dcb10374a5Fixes: #9616Closes#9617
(cherry picked from commit 060a91431d)
clang evaluates function arguments from left to right, while gcc does so
in reverse. Therefore, this code can be correct on clang and incorrect
on gcc:
```
f(x.sth(), std::move(x))
```
This patch fixes one such instance of this bug, in memtable.cc.
Fixes#9605.
Closes#9606
(cherry picked from commit eff392073c)
Due to an error in transforming the above routine, readers who have <= a
buffer worth of content are dropped without consuming them.
This is due to the outer consume loop being conditioned on
`is_end_of_stream()`, which will be set for readers that eagerly
pre-fill their buffer and also have no more data then what is in their
buffer.
Change the condition to also check for `is_buffer_empty()` and only drop
the reader if both of these are true.
Fixes: #9594
Tests: unit(mutation_writer_test --repeat=200, dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211108092923.104504-1-bdenes@scylladb.com>
(cherry picked from commit 4b6c0fe592)
Refs #9331
In segment::close() we add space to managers "wasted" counter. In destructor,
if we can cleanly delete/recycle the file we remove it. However, if we never
went through close (shutdown - ok, exception in batch_cycle - not ok), we can
end up subtracting numbers that were never added in the first place.
Just keep track of the bytes added in a var.
Observed behaviour in above issue is timeouts in batch_cycle, where we
declare the segment closed early (because we cannot add anything more safely
- chunks could get partial/misplaced). Exception will propagate to caller(s),
but the segment will not go through actual close() call -> destructor should
not assume such.
Closes#9598
(cherry picked from commit 3929b7da1f)
The schema has a private constructor, which means it can't be
constructed with `make_lw_shared()` even by classes which are otherwise
able to invoke the private constructor themselves.
This results in such classes (`schema_builder`) resorting to building a
local schema object, then invoking `make_lw_shared()` with the schema's
public move constructor. Moving a schema is not cheap at all however, so
each `schema_builder::build()` call results in two expensive schema
construction operations.
We could make `make_lw_shared()` a friend of `schema` to resolve this,
but then we'd de-facto open the private consctructor to the world.
Instead this patch introduces a private tag type, which is added to the
private constructor, which is then made public. Everybody can invoke the
constructor but only friends can create the private tag instance
required to actually call it.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211105085940.359708-1-bdenes@scylladb.com>
This PR started by realizing that in the memtable reversing reader, it
never happened on tests that `do_refresh_state` was called with
`last_row` and `last_rts` which are not `std::nullopt`.
Changes
- fix memtable test (`tesst_memtable_with_many_versions_conforms_to_mutation_source`), so that there is a background job forcing state refreshes,
- fix the way rt_slice is computed (was `(last_rts, cr_range_snapshot.end]`, now is `[cr_range_snapshot.start, last_rts)`).
Fixes#9486Closes#9572
* github.com:scylladb/scylla:
partition_snapshot_reader: fix indentation in fill_buffer
range_tombstone_list: {lower,upper,}slice share comparator implementation
test: memtable: add full_compaction in background
partition_snapshot_reader: fix obtaining rt_slice, if Reversing and _last_rts was set
range_tombstone_list: add lower_slice
implementation
slice (2 overloads), upper_slice, lower_slice previously had
implementations of a comparator. Move out the common structs, so that
all 4 of them can share implementation.
compile_commands.json (a.k.a. "compdb",
https://clang.llvm.org/docs/JSONCompilationDatabase.html) is intended
to help stand-alone C-family LSP servers index the codebase as
precisely as possible.
The actively maintained LSP servers with good C++ support are:
- Clangd (https://clangd.llvm.org/)
- CCLS (https://github.com/MaskRay/ccls)
This change causes a successful invocation of configure.py to create a
unified Scylla+Seastar+Abseil compdb for every selected build mode,
and to leave a valid symlink in the source root (if a valid symlink
already exists, it will be left alone).
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Closes#9558
The rjson::set() *sounds* like it can set any member of a JSON object
(i.e., map), but that's not true :-( It calls the RapidJson function
AddMember() so it can only add a member to an object which doesn't have
a member with the same name (i.e., key). If it is called with a key
that already has a value, the result may have two values for the same
key, which is ill-formed and can cause bugs like issue #9542.
So in this patch we begin by renaming rjson::set() and its variant to
rjson::add() - to suggest to its user that this function only adds
members, without checking if they already exist.
After this rename, I was left with dozens of calls to the set() functions
that need to changed to either add() - if we're sure that the object
cannot already have a member with the same name - or to replace() if
it might.
The vast majority of the set() calls were starting with an empty item
and adding members with fixed (string constant) names, so these can
be trivially changed to add().
It turns out that *all* other set() calls - except the one fixed in
issue #9542 - can also use add() because there are various "excuses"
why we know the member names will be unique. A typical example is
a map with column-name keys, where we know that the column names
are unique. I added comments in front of such non-obvious uses of
add() which are safe.
Almost all uses of rjson except a handful are in Alternator, so I
verified that all Alternator test cases continue to pass after this
patch.
Fixes#9583
Refs #9542
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211104152540.48900-1-nyh@scylladb.com>
This patch fixes a bug in UpdateItem's ReturnValues=ALL_NEW, which in
some cases returned the OLD (pre-modification) value of some of the
attributes, instead of its NEW value.
The bug was caused by a confusion in our JSON utility function,
rjson::set(), which sounds like it can set any member of a map, but in
fact may only be used to add a *new* member - if a member with the same
name (key) already existed, the result is undefined (two values for the
same key). In ReturnValues=ALL_NEW we did exactly this: we started with
a copy of the original item, and then used set() to override some of the
members. This is not allowed.
So in this patch, we introduce a new function, rjson::replace(), which
does what we previously thought that rjson::set() does - i.e., replace a
member if it exists, or if not, add it. We call this function in
the ReturnValues=ALL_NEW code.
This patch also adds a test case that reproduces the incorrect ALL_NEW
results - and gets fixed by this patch.
In an upcoming patch, we should rename the confusingly-named set()
functions and audit all their uses. But we don't do this in this patch
yet. We just add some comments to clarify what set() does - but don't
change it, and just add one new function for replace().
Fixes#9542
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211104134937.40797-1-nyh@scylladb.com>
Add full compaction in test_memtable_with_many_versions_conforms_to_mutation_source
in background. Without it, some paths in the partition snapshot reader
weren't covered, as the tests always managed to read all range
tombstones and rows which cover a given clustering range from just a
single snapshot. Now, when full_compaction happens in process of reading
from a clustering range, we can force state refresh with non-nullopt
positions of last row and last range tombstone.
Note: this inability to test affected only the reversing reader.
_last_rts was set
If Reversing and _last_rts was set, the created rt_slice still contained
range tombstones between *_last_rts and (snapshot) clustering range end.
This is wrong - the correct range is between (snapshot) clustering range
begin and *_last_rts.
Cleanup and improvements for compaction
* 'compaction_cleanup_and_improvements_v2' of https://github.com/raphaelsc/scylla:
compaction: fix outdated doc of compact_sstables()
table: fix indentation in compact_sstables()
table: give a more descriptive name to compaction_data in compact_sstables()
compaction_manager: rename submit_major_compaction to perform_major_compaction
compaction: fix indentantion in compaction.hh
compaction: move incremental_owned_ranges_checker into cleanup_compaction
compaction: make owned ranges const in cleanup_compaction
compaction: replace outdated comment in regular_compaction
compaction: give a more descriptive name to compaction_data
compaction_manager: simplify creation of compaction_data
for symmetry, let's call it perform_* as it doesn't work like submission
functions which doesn't wait for result, like the one for minor
compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
info is no longer descriptive, as compaction now works with
compaction_data instead of compaction_info.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
there's no need for wrapping compaction_data in shared_ptr, also
let's kill unused params in create_compaction_data to simplify
its creation.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
GKE metadata server does not provide same metadata as GCE, we should not
return True on is_gce().
So try to fetch machine-type from metadata server, return False if it
404 not found.
Fixes#9471
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#9582
"
Some time ago there was introduced the --parallel-cases option
that was set to False by default. Now everything is ready for
making it True.
Running in a BYO job shows that it takes 30 minutes less to
complete the debug tests. Other timings remain almost the same.
tests: unit(dev), unit(debug)
"
* 'br-parallel-cases-by-default' of https://github.com/xemul/scylla:
test.py: Run parallel cases by default
test, raft: Keep many-400 case out of debug mode
test.py: Cache collected test-cases
There were few missing bits before making this the default.
- default max number of AIOs, now tests are run with the greatly
reduced value
- 1.5 hours single case from database_test, now it's split and
scales with --parallel-cases
- suite add_test methods called in a loop for --repeat options,
patch #1 from this set fixes it
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This case takes 45+ minutes which is 1.5 times longer then the
second longest case out there. I propose to keep the many-400
case out of debug runs, there's many-100 one which is configured
the same way but uses 4x times less "nodes".
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The add_test method of a siute can be called several times in a
row e.g. in case of --repeat option or because there are more
than one custom_args entries in the suite.yaml file. In any case
it's pointless to re-collect the test cases by launching the
test binary again, it's much faster (and 100% safe) to keep the
list of cases from the previous call and re-use it if the test
name matches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
node_exporter is packaged with some random uid/gid in the tarball.
When extracting it as an ordinary user this isn't a problem, since
the uid/gid are reset to the current user, but that doesn't happen
under dbuild since `tar` thinks the current user is root. This causes
a problem if one wants to delete the build directory later, since it
becomes owned by some random user (see /etc/subuid)
Reset the uid/gid infomation so this doesn't happen.
Closes#9579
Compaction efficiency can be defined as how much backlog is reduced per
byte read or written.
We know a few facts about efficiency:
1) the more files are compacted together (the fan-in) the higher the
efficiency will be, however...
2) the bigger the size difference of input files the worse the
efficiency, i.e. higher write amplification.
so compactions with similar-sized files are the most efficient ones,
and its efficiency increases with a higher number of files.
However, in order to not have bad read amplification, number of files
cannot grow out of bounds. So we have to allow parallel compaction
on different tiers, but to avoid "dilution" of overall efficiency,
we will only allow a compaction to proceed if its efficiency is
greater than or equal to the efficiency of ongoing compactions.
By the time being, we'll assume that strategies don't pick candidates
with wildly different sizes, so efficiency is only calculated as a
function of compaction fan-in.
Now when system is under heavy load, then fan-in threshold will
automatically grow to guarantee that overall efficiency remains
stable.
Please note that fan-in is defined in number of runs. LCS compaction
on higher levels will have a fan-in of 2. Under heavy load, it
may happen that LCS will temporarily switch to size-tiered mode
for compaction to keep up with amount of data being produced.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211103215110.135633-2-raphaelsc@scylladb.com>
STCS is considering the smallest bucket, out of the ones which contain
more than min_threshold elements, to be the most interesting one
to compact now.
That's basically saying we'll only compact larger tiers once we're
done with smaller ones. That can be problematic because under heavy
load, larger tiers cannot be compacted in a timely manner even though
they're the ones contributing the most to read amplification.
For example, if we're producing sstables in smaller tiers at roughly
the same rate that we can compact them, then it may happen that
larger tiers will not be compacted even though new sstables are
being pushed to them. Therefore, backlog will not be reduced in a
satisfactory manner, so read latency is affected.
By picking the bucket with largest fan-in instead, we'll choose the
most efficient compaction, as we'll target buckets which can reduce
more from backlog once compacted.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211103215110.135633-1-raphaelsc@scylladb.com>
If memtable flush is segregated into multiple files, partition
estimation becomes innacurate and consequently bloom filters are
bigger than needed, leading to an increase in memory consumption.
To fix this, let's wire adjust_partition_estimate() into the flush
procedure, such that original estimation will be adjusted if
segregation is going to be performed. That's done by feeding
mutation_source_metadata, which will leave original estimation
unchanged if no segregation is needed, but will adjust it
otherwise.
Fixes#9581.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211103141600.65806-2-raphaelsc@scylladb.com>
Without tweaking interface, there was no way to adjust estimated
partitions on flush. For example, when segregating a memtable for
TWCS, all produced sstables would have an estimation equal to
the memtable size, even though each only contains a subset of it,
which leads to a significant increase in memory consumption for
bloom filters. Subsequent work will use this interface to perform
the adjustment.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211103141600.65806-1-raphaelsc@scylladb.com>
"
There are 3 overlapping problems with the test case. It
has use after move that covers wrond window selection and
relies on a time-since-epoch being aligned with the time
window by chance.
tests: unit(dev)
"
* 'br-twcs-test-fixes' of https://github.com/xemul/scylla:
test, compaction: Do not rely on random timestamp
test, compaction: Fix use after move in twcs reshape
There are two APIs for checking the repair status and they behave
differently in case the id is not found.
```
{"host": "192.168.100.11:10001", "method": "GET", "uri":
"/storage_service/repair_async/system_auth?id=999", "duration": "1ms",
"status": 400, "bytes": 49, "dump": "HTTP/1.1 400 Bad
Request\r\nContent-Length: 49\r\nContent-Type: application/json\r\nDate:
Wed, 03 Nov 2021 10:49:33 GMT\r\nServer: Seastar
httpd\r\n\r\n{\"message\": \"unknown repair id 999\", \"code\": 400}"}
{"host": "192.168.100.11:10001", "method": "GET", "uri":
"/storage_service/repair_status?id=999&timeout=1", "duration": "0ms",
"status": 500, "bytes": 49, "dump": "HTTP/1.1 500 Internal Server
Error\r\nContent-Length: 49\r\nContent-Type: application/json\r\nDate:
Wed, 03 Nov 2021 10:49:33 GMT\r\nServer: Seastar
httpd\r\n\r\n{\"message\": \"unknown repair id 999\", \"code\": 500}"}
```
The correct status code is 400 as this is a parameter error and should
not be retried.
Returning status code 500 makes smarter http clients retry the request
in hopes of server recovering.
After this patch:
curl -X PGET
'http://127.0.0.1:10000/storage_service/repair_async/system_auth?id=9999'
{"message": "unknown repair id 9999", "code": 400}
curl -X GET
'http://127.0.0.1:10000/storage_service/repair_status?id=9999'
{"message": "unknown repair id 9999", "code": 400}
Fixes#9576Closes#9578
Again, there's a sub-case with sequential time stamps that still
works by chance. This time it's because splitting 256 sstables
into buckets of maximum 8 ones is allowed to have the 1st and the
last ones with less than 8 items in it, e.g. 3, 8, ..., 8, 5. The
exact generation depends on the time-since-epoch at which it
starts.
When all the cases are run altogether this time luckily happens
to be well-aligned with 8-hours and the generated buckets are
filled perfectly. When this particular test-case is run all alone
(e.g. by --run_test or --parallel-cases) then the starting time
becomes different and it gets less than 4 sstables in its first
bucket.
The fix is in adjusting the starting time to be aligned with the
8 hours window.
Actually, the 8 hours appeared in the previous patch, before which
it was 24 hours. Nonetheless, the above reasoning applies to any
size of the time window that's less than 256, so it's still an
independent fix.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The options are std::move-d twice -- first into schema builder then
into compaction strategy. Surprisingly, but the 2nd move makes the
test work.
There's a sub-case in this case that checks sstables with incremental
timestamps with 1 hour step -- 0h, 1h, 2h, ... 255h. Next, the twcs
buckets generator obeys a minimal threshold of 4 sstables per bucket.
Those with less sstables in are not included in the job. Finally,
since the options used to create the twcs are empty after the 1st
move the default window of 24 hours is used. If they were configured
correctly with 1 hour window then all buckets would contain 1 sstable
and the generated job would become empty.
So the fix is both -- don't move after move and make the window size
large enough to fit more sstables than the mentioned minimum.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In UpdateItem's AttributeUpdates (old-style parameter) we were missing
support for the ADD operation - which can increment a number, or add
items to sets (or to lists, even though this fact isn't documented).
This two-patch series add this missing feature. The first patch just moves
an existing function to where we can reuse it, and the second patch is
the actual implementation of the feature (and enabling its test).
Fixes#5893Closes#9574
* github.com:scylladb/scylla:
alternator: add support for AttributeUpdates ADD operation
alternator: move list_concatenate() function
Currently, scylla-server fails to start on ARM instances because scylla_io_setup does not have preset parameters even instance type added to 'supported instance'.
To fix this, we need to add io parameter preset on scylla_io_setup.
Also, we mistakenly added EBS only instances at a004b1da30, need to remove them.
Instrances does not have ephemeral disk should be 'unsupported instance', we still run our AMI on it, but we print warning message on login prompt, and user requires to run scylla_io_setup.
Fixes#9493Closes#9532
* github.com:scylladb/scylla:
scylla_util.py: remove EBS only ARM instances from support instance list
scylla_io_setup: support ARM instances on AWS
In UpdateItem's AttributeUpdates (old-style parameter) we were missing
support for the ADD operation - which can increment a number, or add
items to sets (or to lists, even though this fact isn't documented).
This patch adds this feature, and the test for it begins to pass so its
"xfail" marker is removed.
Fixes#5893
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The list_concatenate() function was only used for UpdateExpression's
ADD operation, so we made it a static function in the source file where
it was used. In the next patch, we'll want to use it in another place
(AttributeUpdates' ADD operation), so let's move it to the same file
where similar functions for sets exist.
This patch is almost entirely a code move, but also makes one small
change: list_concatenate() used to throw an exception if one of the
arguments wasn't a list, but the text of this exception was specific to
UpdateExpression. So in the new version, we return a null value in this
case - and the caller checks for it and throws the right exception.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The "Authorization" HTTP header is used in DynamoDB API to sign
requests. Our parser for this header, in server::verify_signature(),
required the different components of this header to be separated by
a comma followed by a whitespace - but it turns out that in DynamoDB
both spaces and commas are optional - one of them is enough.
At least one DynamoDB client library - the old "boto" (which predated
boto3) - builds this header without spaces.
In this patch we add a test that shows that an Authorization header
with spaces removed works fine in DynamoDB but didn't work in
Alternator, and after this patch modifies the parsing code for this
header, the test begins to pass (and the other tests show that the
previously-working cases didn't break).
Fixes#9568
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211101214114.35693-1-nyh@scylladb.com>
Since we required ephemeral disks for our AMI, these EBS only ARM
instances cannot add in it is 'supported instance' list.
We still able to run our AMI on these instance types but login message
warns it is 'unsupported instance type', and requires to run
scylla_io_setup manually.
* seastar 083898a172...a189cdc45d (7):
> print: deprecate print() family
> treewide: replace uses of fmt_print() and fprint() with direct fmt calls
> circular_buffer: mark clear noexcept
> circular_buffer: mark trivial methods noexcept
> Merge "file: allow destroying append_challenged_posix_file_impl following a close failure" from Benny
> merge: Add parsing HTTP response status
> inet_address: fix usage of `htonl` for clang
scrub_compaction assumes that `make_interposer_consumer()` is called
only when `use_interposer_consumer()` returns true. This is false, so in
effect scrub always ends up using the segregating interposer. Fix this
by short-circuiting the former method when the latter returns true,
returning the passed-in consumer unchanged.
Tests: unit(dev)
Fixes#9541Closes#9564
The current disk-based segregation method works well enough for most cases, but it struggles greatly when there are a lot of partitions in the input. When this is the case it will produce tons of buckets (sstables), in the order of hundreds or even thousands. This puts a huge strain on different parts of the system.
This series introduces a new segregation method which specializes on the lots of small partitions case. If the conditions are right, it can cause a drastic reduction of buckets. In one case I tested, a 1.1GB sstable with 3.6M partitions in it produced just 2 output sstables, down from the 500+ with the on-disk method.
This new method uses a memtable to sort out-of-order partitions. In-order partitions bypass this sorting altogether and go to the disk directly. This method is not suitable for cases where either the partition are large or the total amount of data is large. For those the disk-based method should be used. Scrub compaction decides on the method to use based on heuristics.
Tests: unit(dev)
Closes#9548
* github.com:scylladb/scylla:
compaction: scrub_compaction: add bucket count to finish message
test/boost: mutation_writer_test: harden the partition-based segregator test
mutation_writer: remove now unused on-disk partition segregator
compaction,test: use the new in-memory segregator for scrub
mutation_writer/partition_based_splitting_writer: add memtable-based segregator
We introduce a new operation to the framework: `reconfiguration`.
The operation sends a reconfiguration request to a Raft cluster. It
bounces a few times in case of `not_a_leader` results.
A side effect of the operation is modifying a `known` set of nodes which
the operation's state has a reference to. This `known` set can then be
used by other operations (such as `raft_call`s) to find the current
leader.
For now we assume that reconfigurations are performed sequentially. If a
reconfiguration succeeds, we change `known` to the new configuration. If
it fails, we change `known` to be the set sum of the previous
configuration and the current configuration (because we don't know what
the configuration will eventually be - the old or the attempted one - so
any member of the set sum may eventually become a leader).
We use a dedicated thread (similarly to the network partitioning thread)
to periodically perform random reconfigurations.
* kbr/reconfig-v2:
test: raft: randomized_nemesis_test: perform reconfigurations in basic_generator_test
test: raft: randomized_nemesis_test: improve the bouncing algorithm
test: raft: randomized_nemesis_test: handle more error types
test: raft: randomized_nemesis_test put `variant` and `monostate` `ostream` `operator<<`s into `std` namespace
test: raft: randomized_nemesis_test: `reconfiguration` operation
It is useful to know how many buckets (output sstables) scrub produced
in total. The end compaction message will only report those still open
when the scrub finished, but will omit those that were closed in the
middle.
Test both methods: the "old" disk-based one and the recently added
in-memory one, with different configurations and also add additional
checks to ensure they don't loose data.
The current method of segregating partitions doesn't work well for huge
number of small partitions. For especially bad input, it can produce
hundreds or even thousands of buckets. This patch adds a new segregator
specialized for this use-case. This segregator uses a memtable to sort
out-of-order partitions in-memory. When the memtable size reaches the
provided max-memory limit, it is flushed to disk and a new empty one is
created. In-order partitions bypass the sorting altogether and go to the
fast-path bucket.
The new method is not used yet, this will come in the next patch.
The range::subtract() was used as a trick to implement
range::intersection() when intersection was not available at that time.
The following code is problematic:
dht::token_range given_range_complement(tok_end, tok_start);
because dht::token_range should be non-wrapping range.
To fix, use compat::unwrap_into() to unwrap the range and use
the range::intersection() to calculate the intersection.
Note even if the given_range_complement is problematic, current code
generates correct intersections.
Example 1:
$ curl -X POST 'http://127.0.0.1:10000/storage_service/repair_async/keyspace1?startToken=5&endToken=100'
[shard 0] repair - starting user-requested
repair for keyspace keyspace1, repair id [id=1,
uuid=aa71a192-5967-4f05-99b8-5febd9d81d50], options {{ endToken -> 100}, { startToken -> 5}}
[shard 0] repair - start=5, end=100, given_range_complement=(100, 5], wrange=(100, 5], is_wrap_around=true,
ranges={(-inf, -7612759882658906007], (-7612759882658906007,
-6766703710995023384], (-6766703710995023384, 2918449800065200439],
(2918449800065200439, 8039072586540417979], (8039072586540417979,
+inf)}, intersections={(5, 100]}
[shard 0] repair - New method intersections={(5, 100]}
Example 2:
$ curl -X POST
'http://127.0.0.1:10000/storage_service/repair_async/keyspace1?startToken=100&endToken=5'
[shard 0] repair - starting user-requested
repair for keyspace keyspace1, repair id [id=1,
uuid=f6076438-015c-4bdc-8ebd-0a55664365fa], options {{ endToken -> 5}, { startToken -> 100}}
[shard 0] repair - start=100, end=5, given_range_complement=(5, 100], wrange=(5, 100], is_wrap_around=false,
ranges={(-inf, -7612759882658906007], (-7612759882658906007,
-6766703710995023384], (-6766703710995023384, 2918449800065200439],
(2918449800065200439, 8039072586540417979], (8039072586540417979,
+inf)}, intersections={
(-inf, -7612759882658906007],
(-7612759882658906007, -6766703710995023384],
(-6766703710995023384, 5],
(100, 2918449800065200439],
(2918449800065200439, 8039072586540417979],
(8039072586540417979, +inf)}
[shard 0] repair - New method intersections={
(-inf, -7612759882658906007],
(-7612759882658906007, -6766703710995023384],
(-6766703710995023384, 5],
(100, 2918449800065200439],
(2918449800065200439, 8039072586540417979],
(8039072586540417979, +inf)}
Fixes#9560Closes#9561
The two cql-pytest tests:
test_frozen_collection.py::test_wrong_set_order_in_nested
test_frozen_collection.py::test_wrong_set_order_in_nested_2
Which used to fail, and therefore marked "xfail", got fixed by commit
5589f348e7 ("cql3: expr: Implement
evaluate(expr::bind_variable). That commit made the handling of bound
variables in prepared statements more rigorous, and in particular
made sure that sets are re-sorted not only if they are at the top
level of the value (as happened in the old code), but also if they
are nested inside some other container. This explains the surprising
fact that we could only reproduce bug with prepared statements, and
only with nested sets - while top-level sets worked correctly.
As the tests no longer failed and the bug tested by them really did
get fixed, in this patch we remove the "xfail" marker from these
tests.
Closes#7856. This issue was really fixed by the aforementioned commit,
but let's close it now.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211029221731.1113554-1-nyh@scylladb.com>
We shouldn't be using Seastar as a text formatting library; that's
not its focus. Use fmt directly instead. fmt::print() doesn't return
the output stream which is a minor inconvenience, but that's life.
Closes#9556
The partition based splitting writer (used by scrub) was found to be exception-unsafe, converting an `std::bad_alloc` to an assert failure. This series fixes the problem and adds a unit test checking the exception safety against `std::bad_alloc`:s fixing any other related problems found.
Fixes: https://github.com/scylladb/scylla/issues/9452Closes#9453
* github.com:scylladb/scylla:
test: mutation_writer_test: add exception safety test for segregate_by_partition()
mutation_writer: segregate_by_partition(): make exception safe
mutation_reader: queue_reader_handle: make abandoned() exception safe
mutation_writer: feed_writers(): make it a coroutine
mutation_writer: partition_based_splitting_writer: erase old bucket if we fail to create replacement
It's much more efficient to have a separate compaction task that consists
completely from expired sstables and make sure it gets a unique "weight" than
mixing expired sstables with non-expired sstables adding an unpredictable
latency to an eviction event of an expired sstable.
This change also improves the visibility of eviction events because now
they are always going to appear in the log as compactions that compact into
an empty set.
Fixes#9533
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Closes#9534
Although the DynamoDB API responses are JSON, additional conventions apply
to these responses - such as how error codes are encoded in JSON. For this
reason, DynamoDB uses the content type `application/x-amz-json-1.0` instead
of the standard `application/json` in its responses.
Until this patch, Scylla used `application/json` in its responses. This
unexpected content-type didn't bother any of the AWS libraries which we
tested, but it does bother the aiodynamo library (see HENNGE/aiodynamo#27).
Moreover, we should return the x-amz-json-1.0 content type for future
proofing: It turns out that AWS already defined x-amz-json-1.1 - see:
https://awslabs.github.io/smithy/1.0/spec/aws/aws-json-1_1-protocol.html
The 1.1 content type differs (only) in how it encodes error replies.
If one day DynamoDB starts to use this new reply format (it doesn't yet)
and if DynamoDB libraries will need to differenciate between the two
reply formats, Alternator better return the right one.
This patch also includes a new test that the Content-Type header is
returned with the expected value. The test passes on DynamoDB, and
after this patch it starts to pass on Alternator as well.
Fixes#9554.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211031094621.1193387-1-nyh@scylladb.com>
Setting a value of "warn" will still allow the create or alter commands,
but will warn the user, with a message that will appear both at the log
and also at cqlsh for example.
This is another step towards deprecating DTCS. Users need to know we're
moving towards this direction, and setting the default value to warn
is needed for this. Next step is to set it to false, and finally remove
it from the code base.
Refs #8914.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211029184503.102936-1-raphaelsc@scylladb.com>
Objects of type "api_error" are used in Alternator when throwing an
error which will be reported as-is to the user as part of the official
DynamoDB protocol.
Although api_error objects are often thrown, the api_error class was not
derived from std::exception, because that's not necessary in C++.
However, it is *useful* for this exception to derive from std::except,
so this is what this patch does. It is useful for api_error to inherit
from std::exception because then our logging and debugging code knows
how to print this exception with all its details. All we need to do is
to implement a what() virtual function for api_error.
Before this patch, logging an api_error just logs the type's name
(i.e., the string "api_error"). After this patch, we get the full
information stored in the api_error - the error's type and its message.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211017150555.225464-1-nyh@scylladb.com>
sprint() uses the printf-style formatting language while most of our
code uses the Python-derived format language from fmt::format().
The last mass conversion of sprint() to fmt (in 1129134a4a)
missed some callers (principally those that were on multiple lines, and
so the automatic converter missed them). Convert the remainder to
fmt::format(), and some sprintf() and printf() calls, so we have just
one format language in the code base. Seastar::sprint() ought to be
deprecated and removed.
Test: unit (dev)
Closes#9529
* github.com:scylladb/scylla:
utils: logalloc: convert debug printf to fmt::print()
utils: convert fmt::fprintf() to fmt::print()
main: convert fprint() to fmt::print()
compress: convert fmt::sprintf() to fmt::format()
tracing: replace seastar::sprint() with fmt::format()
thrift: replace seastar::sprint() with fmt::format()
test: replace seastar::sprint() with fmt::format()
streaming: replace seastar::sprint() with fmt::format()
storage_service: replace seastar::sprint() with fmt::format()
repair: replace seastar::sprint() with fmt::format()
redis: replace seastar::sprint() with fmt::format()
locator: replace seastar::sprint() with fmt::format()
db: replace seastar::sprint() with fmt::format()
cql3: replace seastar::sprint() with fmt::format()
cdc: replace seastar::sprint() with fmt::format()
auth: replace seastar::sprint() with fmt::format()
Alternator auth module used to piggy-back on top of CQL query processor
to retrieve authentication data, but it's no longer the case.
Instead, storage proxy is used directly.
Closes#9538
Convert storage_service::describe_ring to a coroutine
to prevent reactor stalls as seen in #9280.
Fixes#9280Closes#9282
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#9282
With 062436829c,
we return all input sstables in strict mode
if they are dosjoint even if they don't need reshaping at all.
This leads to an infinite reshape loop when uploading sstables
with TWCS.
The optimization for disjoint sstables is worth it
also in relaxed mode, so this change first makes sorting of the input
sstables by first_key order independent of reshape_mode,
and then it add a check for sstable_set_overlapping_count
before trimming either the multi_window vector or
any single_window bucket such that we don't trim the list
if the candidates are disjoint.
Adjust twcs_reshape_with_disjoint_set_test accordingly.
And also add some debug logging in
time_window_compaction_strategy::get_reshaping_job
so one can figure out what's going on there.
Test: unit(dev)
DTest: cdc_snapshot_operation.py:CDCSnapshotOperationTest.test_create_snapshot_with_collection_list_with_base_rows_delete_type
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211025071828.509082-1-bhalevy@scylladb.com>
The cdc stream rewrite code launches work in the backround. It's careful
to make sure all the dependencies are preserved via shared pointers, but
there are implicit dependencies (system_keyspace) that are not. To fix this
problem, this series changes the lifetime of the background work from
unbounded to be bounded by the lifetime of storage_service.
[ xemul: Not the storage_service, but the cdc_generation_service, now
all the dances with rewriting and its futures happen inside the cdc
part. Storage_service coroutinization and call to .stop() from main
are here for several reasons
- They are awseome
- Splitting a PR into two is frustrating (as compared to ML threads)
- Shia LaBeouf is good motivation speaker
]
As explained in the patches, I don't expect this to be a real problem and the
series just paves the way for making system_keyspace an explicit
dependency.
Test: unit (dev)
Closes#9356
* github.com:scylladb/scylla:
cdc: don't allow background streams description rewrite to delay too far
storage_service: coroutinize stop()
main: stop storage_service on shutdown
'system.status' and 'system.describe_ring' are imperfect names for
what they do, so rename them. Fortunately they aren't exposed in any
released version so there is no compatibility concern.
Closes#9530
* github.com:scylladb/scylla:
system_keyspace: rename 'system.describe_ring' to 'system.token_ring'
system_keyspace: rename 'system.status' to 'system.cluster_status'
Currently, aws_instance.ephemeral_disks() returns both ephemeral disks
and EBS disks on Nitro System.
This is because both are attached as NVMe disks, we need to add disk
type detection code on NVMe handle logic.
Fixes#9440Closes#9462
Some time ago we started gathering stats for system tables in a separate
class in order to be able to distinguish which queries come from the
user - e.g. if the unpaged queries are internal or not.
Originally, only local system tables were moved into this class,
i.e. system and system_schema. It would make sense, however, to also
include other internal keyspaces in this separate class - which includes
system_distributed, system_traces, etc.
Fixes#9380Closes#9490
if mean() is called when there are no elements in the histogram,
a runtime error will happen due to division-by-zero.
approx_exponential_histogram::mean() handles it but for some
reason we forgot to do the same for estimated_histogram.
this problem was found when adding an unit test which calls
mean() in an empty histogram.
Fixes#9531.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211027142813.56969-1-raphaelsc@scylladb.com>
Throw a proper exception from do_load_schemas if parse_statements
fails to parse the schema cql.
Catch it in scylla-sstable main() function so
it won't be reported as seastar - unhandled exception.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211027124032.1787347-1-bhalevy@scylladb.com>
Table names are usually nouns, so SELECT/INSERT statements
sound natural: "SELECT * FROM pets". 'system.describe_ring' defies
this convention. Rename it to 'system.token_ring' so selects are natural.
The name is not in any released version, so we can safely rename it.
'system.status' is too generic, it doesn't explain the status of what.
'system.node_status' is also ambiguous (this node? all nodes?) so I
picked 'system.cluster_status'.
The internal name, nodetool_status_table, was even worse (we're
not querying the status of nodetool!) but fortunately wasn't exposed.
The name is not in any released version, so we can safely rename it.
sprint() is obsolete. Note some calls where to helper functions that
use sprint(), not to sprint() directly, so both the helpers and
the callers were modified.
We have two identical "Truncated frame" errors, at:
* read_frame_size() in serialization_visitors.hh;
* cql_server::connection::read_and_decompress_frame() in
transport/server.cc;
When such an exception is thrown, it is impossible to tell where was it
thrown from and it doesn't have any further information contained in it
(beyond the basic information it being thrown implies).
This patch solves both problems: it makes the exception messages unique
per location and it adds information about why it was thrown (the
expected vs. real size of the frame).
Ref: #9482Closes#9520
In definition for /column_family/major_compaction/{name} there is an
incorrect use of "bool" instead of "boolean".
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#9516
Although the sstable name is part of the system.large_* records,
it is not printed in the log.
In particular, this is essential for the "too many rows" warning
that currently does not record a row in any large_* table
so we can't correlate it with a sstable.
Fixes#9524
Test: unit(dev)
DTest: wide_rows_test.py
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211027074104.1753093-1-bhalevy@scylladb.com>
Currently row marker shadowing the shadowable tombstone is only checked
in `apply(row_marker)`. This means that shadowing will only be checked
if the shadowable tombstone and row marker are set in the correct order.
This at the very least can cause flakyness in tests when a mutation
produced just the right way has a shadowable tombstone that can be
eliminated when the mutation is reconstructed in a different way,
leading to artificial differences when comparing those mutations.
This patch fixes this by checking shadowing in
`apply(shadowable_tombstone)` too, making the shadowing check symmetric.
There is still one vulnerability left: `row_marker& row_marker()`, which
allow overwriting the marker without triggering the corresponding
checks. We cannot remove this overload as it is used by compaction so we
just add a comment to it warning that `maybe_shadow()` has to be manually
invoked if it is used to mutate the marker (compaction takes care of
that). A caller which didn't do the manual check is
mutation_source_test: this patch updates it to use `apply(row_marker)`
instead.
Fixes: #9483
Tests: unit(dev)
Closes#9519
Make sure to close the reader created by flush_fragments
if an exception occurs before it's moved to `populate_views`.
Note that it is also ok to close the reader _after_ it has been
moved, in case populate_views itself throws after closing the
reader that was moved it. For conveience flat_mutation_reader::close
supports close-after-move.
Fixes#9479
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211024164138.1100304-1-bhalevy@scylladb.com>
Allocating the exception might fail, terminating the application as
`abandoned()` is called in a noexcept context. Handle this case by
catching the bad-alloc and aborting the reader with that instead when
this happens.
Functions `upper_bound` and `lower_bound` had signatures:
```
template<typename T, typename... Args>
static rows_iter_type lower_bound(const T& t, Args... args);
```
This caused a dacay from `const schema&` to `schema` as one of the args,
which in turn copied the schema in a fair number of the queries. Fix
that by setting the parameter type to `Args&&`, which doesn't discard
the reference.
Fixes#9502Closes#9507
* seastar 994b4b5a0c...083898a172 (24):
> Revert "memory: always allocate buf using "malloc" for non reactor"
> Revert dpdk update to 21.08.
> tutorial: Fix typos
> queue: add back template requirement for element type to be nothrow move-constructible
> Revert "queue: require element type to be nothrow move-constructible"
> build: add the closing "-Wl,--no-whole-archive" to the ldflags
> build: add -Wno-error=volatile to CXX_FLAGS
> build: Include dpdk as a single object in libseastar.a
> Merge: queue: cleanup exception handling
> build: drop dpdk-specific machine architecture names
> reactor: call memory::configure() before initialize dpdk
> core/loop: parallel_for_each(): make entire function critical alloc section
> Merge 'scheduling groups: Add compile parameter for setting max scheduling groups count at compile time' from Eliran Sinvani
> test: coroutines_test: assign spinner lambda to local variable
> shared_ptr: mark shared_from_this functions noexcept
> lw_shared_ptr: mark shared_from_this functions noexcept
> build: update download URL for Boost
> Merge "build: build with dpdk v21.08" from Kefu
> cpu_stall_detector: handle wraparounds in Linux perf_event ring buffer
> entry_point.cc: default-initialize sigaction struct
> reactor: s/gettid()/syscall(SYS_gettid)/
> memory: always allocate buf using "malloc" for non reactor
> Revert "memory: always allocate buf using "malloc" for non reactor"
> memory: always allocate buf using "malloc" for non reactor
Currently we update the effective_replication_map
only on non-system keyspace, leaving the system keyspace,
that uses the local replication strategy, with the empty
replication_map, as it was first initialized.
This may lead to a crash when get_ranges is called later
as seen in #9494 where get_ranges was called from the
perform_sstable_upgrade path.
This change updates the effective_replication_map on all
keyspaces rather than just on the non-system ones
and adds a unit test that reproduces #9494 without
the fix and passes with it.
Fixes#9494
Test: unit(dev), database_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211020143217.243949-1-bhalevy@scylladb.com>
Fixes#9491
CQL server, when encountering a "general" exception (i.e. not thrown by
cql error checks), reports a wire error with simply the what() part of
exception. However, if we have nested exceptions, we will most likely
lose info here (hello encryption).
General exception case should unwind exception and give back full,
concatenated message to avoid confusion.
Closes#9492
When an experimental feature graduates from being experimental, we want
to continue allow the old "--experimental-features=..." option to work,
in case some user's configuration uses it - just do nothing. The way
we do it is to map in db::experimental_features_t::map() the feature's
name to the UNUSED value - this way the feature's name is accepted, but
doesn't change anything.
When the CDC feature graduated from being experimental, a new bit
UNUSED_CDC was introduced to do the same thing. This separate bit was
not actually necessary - if we ever check for UNUSED_CDC bit anywhere
in the code it means the flag isn't actually unused ;-) And we don't
check it.
So simplify the code by conflating UNUSED_CDC into UNUSED.
This will also make it easy to build from db::experimental_features_t::map()
a list of current experimental features - now it will simply be those that
do not map to UNUSED.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211013105107.123544-1-nyh@scylladb.com>
This series fixes a bug which allowed using a secondary index in a restriction for a DELETE statement, which resulted in generating incorrect slices and deleting the whole partition instead. Secondary indexes are not meant to be used for deletes, which this series enforces by marking the indexes as not queriable.
It also comes with a reproducing test case, originally provided by @fee-mendes (thanks!).
Fixes#9495
Tests: unit(release)
Closes#9496
* github.com:scylladb/scylla:
cql-pytest: add reproducer for deleting based on secondary index
cql3: forbid querying indexes for deletions
So the compaction perf of different compaction strategies can be
compared. Data timestamps are diversified such that they fall into four
different bucket if TWCS is used, in order to be able to stress the
timestamp based splitting code path.
Closes#9488
Accessing tm.sorted_tokens().back() causes undefined behavior
if tm.sorted_tokens is empty.
Check that first and throw/abort using on_internal_error
in this case.
This will prevent the segfault but it doesn't fix the root cause
which is getting here with empty token_metadata. That will be fixed
by the following patch.
Refs #9494
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211019075710.1626808-1-bhalevy@scylladb.com>
This commit makes the column names from an invalid delete statement human readable.
Before that, they were printed in their hex representation, which is not convenient
for debugging.
Before:
InvalidRequest:
Error from server: code=2200 [Invalid query]
message="Invalid where clause contains non PRIMARY KEY columns: 76616c"
After:
InvalidRequest:
Error from server: code=2200 [Invalid query]
message="Invalid where clause contains non PRIMARY KEY columns: val"
Message-Id: <52923335e8837295fd5ba2dfd0921196e21f7f16.1634626777.git.sarna@scylladb.com>
This commit adds a test case for a bug reported by Felipe
<felipemendes@scylladb.com>. The bug involves trying to delete
an entry from a partition based on a secondary index created
on a column which is part of the compound clustering key,
and the unfortunate result is that the whole partition gets wiped.
Cassandra's behavior is in this case correct - deletion based
on a secondary index column is not allowed.
Refs #9495
Using secondary indexes for the purpose of a DELETE statement
was never expected to be well-defined, but an edge case in #9495
showed that the index may sometimes be inadvertently used, which
causes the whole partition to be deleted.
In order to prevent such errors, it's now explicitly defined
that an index is not queriable if it's going to be used for the
purpose of a DELETE statement.
metric currently_open_for_writing, used to inform # of sstables opened for writing,
holds the same value as total_open_for_writing. that means we aren't actually
decreasing the counter, so it is bogus.
Moved to sstable_writer, because sstable is used by writer to open files,
which are then extracted from sstable object, and later the same object is
reused for read-only mode.
Fixes#9455.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211013134812.177398-1-raphaelsc@scylladb.com>
If we're upgrading from an older version with the previous CDC streams
format, we'll upgrade it in the background. Background update is needed
since we need the cluster to be available when performing the upgrade,
but at this point we're just starting a node, and may not succeed in
forming a cluster before we shut down.
However, running in the background is dangerous since the objects we
use may stop existing. The code is careful to use reference counting,
but this does not guarantee that other dependencies are still alive,
especially since not all dependencies are expressed via constructor
parameters.
Fix by waiting for the rewrite work in generation_service::stop(). As
long as generation_service is up, the required dependencies should be
working too.
Note that there is another change here besides limiting the background
work: checks that were previously done in the foreground (limited to
local tables) are now also done in the background. I don't think
this has any impact.
Note: I expect this to have no real impact. Any CDC users will have
long since ugpraded. This is just preparing for other patches that
bring in other dependencies, which cannot be passed via reference
counted pointers, so they expose the existing problem.
It was auto-expanded only if the strategy name
was the short "NetworkTopologyStrategy" name.
Fixes#9302.
Closes#9304.
* 'prepare_options' of https://github.com/bhalevy/scylla:
cql3: keyspace prepare_options: expand replication_factor also for fully qualified NetworkTopologyStrategy
abstract_replication_strategy: add to_qualified_class_name
TWCS can reshape at most 32 sstables spanning multiple windows, in a
single compaction round. Which sstables are compacted together, when
there are more than 32 sstables, is random.
If sstables with overlapping windows are compacted together, then
write amplification can be reduced because we may be able to push
all the data to a window W in a single compaction round, so we'll
not have to perform another compaction round later in W, to reduce
its number of files. This is also very good to reduce the amount
of transient file descriptors opened, because TWCS reshape
first reshapes all sstables spanning multiple windows, so if
all windows temporarily grow large in number of files, then
there's a risk which file descriptors can be exhausted.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211013203046.233540-3-raphaelsc@scylladb.com>
After a4053dbb72, data segregation is postponed to offstrategy, so reshape
procedure is called with disjoint sstables which belong to different
windows, so let's extend the optimization for disjoint sstables which
span more than one window. In this way, write amplification is reduced
for offstrategy compaction, as all disjoint sstables will be compacted
at once.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211013203046.233540-2-raphaelsc@scylladb.com>
It was auto-expanded only if the strategy name
was the short "NetworkTopologyStrategy" name.
Fixes#9302
Test: cql_query_test.test_rf_expand(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And use it from cql3 check_restricted_replication_strategy and
keyspace_metadata ctor that defined their own `replication_class_strategy`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This mini series contains two fixes that are bundled together since the
second one assumes that the first one exists (or it will not fix
anything really...), the two problems were:
1. When certain operations are called on a service level controller
which doesn't have it's data accessor set, it can lead to a crash
since some operations will still try to dereference the accessor
pointer.
2. The cql environment test initialized the accessor with a
sharded<system_distributed_data>& however this sharded class as
itself is not initialized (sharded::start wasn't called), so for the
same that were unsafe for null dereference the accessor will now crash
for trying to access uninitialized sharded instance.
Closes#9468
* github.com:scylladb/scylla:
CQL test environment: Fix bad initialization order
Service Level Controller: Fix possible dereference of a null pointer
Before this patch, if Scylla crashes during some test in test/alternator,
all tests after it will fail because they can't connect to Scylla - and we
can get a report on hundreds of failures without a clear sign of where the
real problem was.
This patch introduces an autouse fixture (i.e., a fixture automatically
used by every test) which tries to run a do-nothing health-check request
after each test. If this health-check request fails, we conclude that
Scylla crashed and report the test in which this happened - and exit
pytest instead of failing a hundred more tests.
The failure report looks something like this:
```
! _pytest.outcomes.Exit: Scylla appears to have crashed in test test_batch.py::test_batch_get_item !
```
And the entire test run fails.
These extra health checks are not free, but they come fairly close to
being free: In my tests I measured less than 0.1 seconds slowdown of
the entire test suite (which has 618 tests) caused by the extra health
checks.
Fixes#9489
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211017123222.217559-1-nyh@scylladb.com>
In preparation for adding more stuff, convert stop() to a coroutine
to avoid an unreadable chain of continuations.
The code uses a finally() block which might not be needed (since .join()
should not fail). Rather than risking getting it wrong I wrapped it in
a try/catch and added logging.
Just like other services, storage_service needs to be stopped on
shutdown. cql_test_env already stops it, so there is some precedent
for it working. I tested a shutdown while cassandra-stress was
running and it worked okay for a few trials.
Issue #9467 deprecated the blanket "--experimental" option which we
used to enable all experimental Scylla features for testing, and
suggests that individual experimental features should be enabled
instead.
So this is what we do in this patch for the Scylla-running scripts
in test/alternator and test/cql-pytest: We need to enable UDF for
the CQL tests, and to enable Alternator Streams and Alternator TTL
for the Alternator tests.
Refs #9467
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211012110312.719654-2-nyh@scylladb.com>
Earlier we added experimental (and very incomplete) support for
Alternator's TTL feature, but forgot to set a *name* for this
experimental feature. As a result, this feature can be enabled only with
the blanket "--experimental" option and not with a specific
"--experimental-features=..." option.
Since issue #9467 deprecated the blanket "--experimental" option
and users are encouraged to only enable specific experimental
features, it is important that we have a name for it.
So the name chosen in this patch is "alternator-ttl".
Eventually this feature might evolve beyond Alternator-only,
but for now, I think it's a good name and we'll probably
graduate the experimental Alternator TTL feature before
supporting CQL, so it will be a new experimental feature
anyway.
Refs #9467.
db/config.cc
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211012110312.719654-1-nyh@scylladb.com>
The warning was disabled during the migration to clang, but now it
appears unnecessary (perhaps clang added support for the attributes
it did not have then). It is valuable for detecting misspelled
attributes, so enable it again.
Closes#9480
The help string from the "--experimental-features" command-line option
lists the available experimental features, to helping a user who might
want to enable them. But this help string was manually written, and has
since drifted from reality:
* Two of the listed "experimental" features, cdc and lwt, have actually
graduated from being experimental long ago. Although technically a user
may still use the words "cdc" and "lwt" in the "experimental-features"
parameter, doing so is pointless, and worse: This text in the help
string can mislead a user into thinking that these two features are
still experimental - while they are not!
* One experimental feature - alternator-ttl - is missing from this list.
Instead of updating the help string text now - and needing to do this
again and again in the future as we change experimental features - what
this patch does is to construct the list of features automatically from
the map of supported feature names - excluding any features which map
to UNUSED.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211013122635.132582-1-nyh@scylladb.com>
"
The current api design of abstract_replication_strategy
provides a can_yield parameter to calls that may stall
when traversing the token metadata in O(n^2) and even
in O(n) for a large number of token ranges.
But, to use this option the caller must run in a seastar thread.
It can't be used if the caller runs a coroutine or plain
async tasks.
Rather than keep adding threads (e.g. in storage_service::load_and_stream
or storage_service::describe_ring), the series offers an infrastructure
change: precalculating the token->endpoints map once, using an async task,
and keeping the results in a `effective_replication_map` object.
The latter can be used for efficient and stall-free calls, like
get_natural_endpoints, or get_ranges/get_primary_range, replacing their
equivalents in abstract_replication_strategy, and dropping the public
abstract_replication_strategy::calculate_natural_endpoints and its
internal cached_endpoints map.
Other than the performance benefits of:
1. The current calls require running a thread to yield.
Precalculating the map (using async task) allows us to use synchronous calls
without stalling the rector.
2. The replication maps can and should be shared
between keyspaces that use the same replication strategy.
(Will be sent as a follow-up to the series)
The bigger benefits (courtesy of Avi Kivity) are laying the groundwork for:
1. atomic replication metadata - an operation can capture a replication map once, and then use consistent information from the map without worrying that it changes under its feet. We may even be able to s/inet_address/replica_ptr/ later.
2. establish boundaries on the use of replication information - by making a replication map not visible, and observing when its reference count drops to zero, we can tell when the new replication map is fully in use. When we start writing to a new node we'll be able to locate a point in time where all writes that were not aware of the new node were completed (this is the point where we should start streaming).
Notes:
* The get_natural_endpoints method that uses the effective_replication_map
is still provided as a abstract_replication_strategy virtual method
so that local_strategy can override it and privide natural endpoints
for any search token, even in the absence of token_metadata, when\
called early-on, before token_metadata has been established.
The effective_replication_map materializes the replication strategy
over a given replication strategy options and token_metadata.
Whenever either of those change for a keyspace, we make a new
effective_replication_map and keep it in the keyspace for latter use.
Methods that depend on an ad-hoc token_metadata (e.g. during
node operations like bootstrap or replace) are still provided
by abstract_replication_strategy.
TODO:
- effective_replication_map registry
- Move pending ranges from token_metadata to replication map
- get rid of abstract_replication_strategy::get_range_addresses(token_metadata&)
- calculate replication map and use it instead.
Test: unit(dev, debug)
Dtest: next-gating, bootstrap_test.py update_cluster_layout_tests.py alternator_tests.py -a 'dtest-full,!dtest-heavy' (release)
"
* tag 'effective_replication_strategy-v6' of github.com:bhalevy/scylla: (44 commits)
effective_replication_map: add get_range_addresses
abstract_replication_strategy: get rid of shared_token_metadata member and ctor param
abstract_replication_strategy: recognized_options: pass const topology&
abstract_replication_strategy: precacluate get_replication_factor for effective_replication_map
token_metadata: get rid of now-unused sync methods
abstract_replication_strategy: get rid of do_calculate_natural_endpoints
abstract_replication_strategy: futurize get_*address_ranges
abstract_replication_strategy: futurize get_range_addresses
abstract_replication_strategy: futurize get_ranges(inet_address ep, token_metadata_ptr)
abstract_replication_strategy: move get_ranges and get_primary_ranges* to effective_replication_map
compaction_manager: pass owned_ranges via cleanup/upgrade options
abstract_replication_strategy: get rid of cached_endpoints
all replication strategies: get rid of do_get_natural_endpoints
storage_proxy: use effective_replication_map token_metadata_ptr along with endpoints
abstract_replication_strategy: move get_natural_endpoints_without_node_being_replaced to effective_replication_map
storage_service: bootstrap: add log messages
storage_service: get_mutable_token_metadata_ptr: always invalidate_cached_rings
shared_token_metadata: set: check version monotonicity
token_metadata: use static ring version
token_metadata: get rid of copy constructor and assignment operator
...
Make a reader that reads from memtable in reverse order.
This draft PR includes two commits, out of which only the second is
relevant for review.
Described in #9133.
Refs #1413.
Closes#9174
* github.com:scylladb/scylla:
partition_snapshot_reader: pop_range_tombstone returns reference (instead of value) when possible.
memtable: enable native reversing
partition_snapshot_reader: reverse ck_range when needed by Reversing
memtable, partition_snapshot_reader: read from partition in reverse
partition_snapshot_reader: rows_position and rows_iter_type supporting reverse iteration
partition_snapshot_reader: split responsibility of ck_range
partition_snapshot_reader: separate _schema into _query_schema and _partition_schema
query: reverse clustering_range
test: cql_query_test: fix test_query_limit for reversed queries
Our scylla.yaml contains a comment listing the available experimental
features, supposedly helping a user who might want to enable them.
I think the usefuless of this comment is dubious, but as long as we
have one, let's at least make it accurate:
* Two of the listed "experimental" features, cdc and lwt, have actually
graduated from being experimental long ago. Although technically a user
may still use the words "cdc" and "lwt" in the "experimental-features"
list, doing so is pointless, and worse: This comment suggests that these
two features are still experimental - while they are not!
* One experimental feature - alternator-ttl - is missing from this list.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211013083247.13223-1-nyh@scylladb.com>
Equivalent to abstract_replication_strategy get_range_addresses,
yet synchronous, as it uses the precalculated map.
Call it from storage_service::get_new_source_ranges
and range_streamer::get_all_ranges_with_sources_for.
Consequently, get_new_source_ranges and removenode_add_ranges
can become synchronous too.
Unfortunately we can't entirely get rid of
abstract_replication_strategy::get_range_addresses
as it's still needed by
range_streamer::get_all_ranges_with_strict_sources_for.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is not used any more.
Methods either use the token_metadata_ptr in the
effective_replication_map, or receive an ad-hoc
token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prepare for deleting the _shared_token_metadata member.
All we need for recognized_options is the topology
(for network_topology_strategy).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that abstract_replication_strategy methods are all async
clone_only_token_map_sync, and update_normal_tokens_sync
are unused.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is no longer in use.
And with it, the virtual calculate_natural_endpoint_sync method
of which it was the only caller.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Remaining callers of get_address_ranges and get_pending_address_ranges
are all either from a seastar thread or from a coroutine
so we can make the methods always async and drop the
can_yield param.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
All remaining use sites are called in a seastar thread
so we drop the can_yield param and make get_range_addresses
always async.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is called only from repair, in a thread,
so it can be made always async and the need_preempt param
can be dropped.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Provide a sync get_ranges method by effective_replication_map
that uses the precalculated map to get all token ranges owned by or
replicated on a given endpoint.
Reuse do_get_ranges as common infrastructure for all
3 cases: get_ranges, get_primary_ranges, and get_primary_ranges_within_dc.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
So they can be easily computed using an async task
before constructing the compaction object
in a following patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that all falvors of get_natural_endpoints methods
were moved to effective_replication_map,
do_get_natural_endpoints and its overrides are unused.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We should invalidate the cached rings every time the
token metadata changes, not only on topology changes
to invalidate cached token/replication mappings
when the modified token_metadata is committed.
Currently we can do without it (apparently)
but this will become a requirement for keep
versions of the effective_replication_map
in a registry, indexed by the token_metadata ring version,
among other things.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Setting the ring version backwards means it got out of sync.
Possibly concurrent updates weren't serialized properly
using token_metadata_lock / mutate_token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For generating unique _ring_version.
Currently when we clone a mutable token_metadata_ptr
it remains with the same _ring_version
and the ring version is updated only when the topology changes.
To be able to distinguish these traqnsient copies
from the ones that got applied, be stricter about
the ring version and change it to a unique number
using a static counter.
Next patch will update the ring version
(and consequently invalidate the cached_endpoints
on the replication strategy) every time the token_metadata
changes, not only when the topology changes.
Note that the _cached_endpoints will go away
once the transition to effective_replication_map
is finished, so this will not degrade performance.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
implementation
Now that all users of it were converted to use the
effective_replication_map, the legacy
abstract_replication_strategy::get_natural_endpoints method
can be deleted.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, we call find_keyspace and then
get_effective_replication_map on the _same_ keyspace
to get_natural_endpoints for multiple tokens.
Get the effective_replication_map once in these cases
and use it for each token.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Every time the token_metadata changes we need to update the
effective_replication_map on all non-system keyspaces.
Do that in replicate_to_all_cores after the updated token_metadata
has been replicated to all cores.
We first prepare and clone the token_metadata, then prepare
and clone the new effective_replication_maps. Any failure
at this stage is recoverable, handle via rollback and the exception
is returned.
Note that any failure to _apply_ the pending token_metadata or the
effective_replication_map will cause scylla to abort.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Serialize the metadata changes with
keyspace create, update, or drop.
This will become necessary in the following patch
when we update the effective_replication_map
on all keyspaces and we want instances on all shards
end up with the same replication map.
Note that storage_service::keyspace_changed is called
from the scheme_merge path so it already holds
the merge_lock.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than a _pending_token_metadata_ptr member in the storeage_service
class. This is now much easier that the function was converted to a
coroutine.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And functions that use it, like:
keyspace::update_from
database::update_keyspace
database::create_in_memory_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
effective_replication_map holds the full replication_map
resulting from applying the effective replication strategy
over the given token_metadata and replication_strategy_config_options.
It is calculated once, in make_effective_replication_map(), and then it
can be used for retrieving the endpoints/token_ranges synchronously
from the precalculated map.
A new virtual get_natural_endpoints(const token&, const effective_replication_map&)
method has been added to abstract_replication_strategy so that
local_strategy and everywhere_replication_strategy can override it as they may be
needed before the token_metadata is established.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Enable creating shared_ptr<BaseClass> in nonstatic_class_registry
using BaseClass::ptr_type and use that for
abstract_replication_strategy.
While at it, also clean up compressor with that respect
to define compressor::ptr_type as shared_ptr<compressor>
thus simplifying compressor_registry.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Besides being more modern and more efficient for
the compiler, this #ifndef block confuses my editor
that greys out the whole block.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And with that rename calculate_natural_endpoints(const token& search_token, const token_metadata&, can_yield)
to do_calculate_natural_endpoints and make it protected,
With this patch, all its external users call the async version, so
rename it back to calculate_natural_endpoints, and make
calculate_natural_endpoints_sync private since it's being called
only within abstract_replication_strategy.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
calculate_natural_endpoints_sync and _async are both provided
temporarily until all users of them are converted to use
the async version which will remain.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Extract natural_endpoints_tracker out of calculate_natural_endpoints
so we easily split the function to sync and async variants.
Test: network_topology_strategy_test(dev, debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The service level controller was initialized with a data
accessor that uses the system distributed keyspace before
the later have been initialized. If there is a use of
this accessor (for example by calling
to: service_level_controller::get_distributed_service_levels())
if will fail miserably and crash.
Not initializing the data accessor doesn't mean the same thing
since we can deal with such call when the accessor is not
initialized.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
If the service level controller don't have his data accessor set,
calls for getting of distributed information might dereference this
unset pointer for the accessor. Here we add code that will return
a result as if there is no data available to the accessor (a behaviour
which is roughly equivalent to a null data accessor).
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
The documentation for --experimental config option states
that it enables all experimental features, but this is no
longer true, i.e.: raft feature is not enabled with it and
should be explicitly enabled via `--experimental-features=raft`
switch (we don't want to enable it by default alongside
other features).
Since the flag doesn't do what it's intended to, we should
mark it as "deprecated", because documenting each exception
(there could be more than only raft in the future) will be
a burden and docs will constantly go out-of-sync with the
code.
Adjust the description for the option to reflect that, mark
it "deprecated" and suggest using --experimental-features, instead.
Fixes: #9467
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20211012093005.20871-2-pa.solodovnikov@scylladb.com>
Currently it is required that sstables (in particular la/mx ones) are located at a valid path. This is required because `sstables::entry_descriptor::make_descriptor()` extracts the keyspace and table names from the sstable dir components. This PR relaxes this by using a newly introduced `sstables::entry_descriptor::make_descriptor()` overload which allows the caller to specify keyspace and table names, not necessitating these to be extracted from the path.
Tests: unit(dev), manual(testing that `scylla-sstables` can indeed load sstables from invalid path)
Closes#9466
* github.com:scylladb/scylla:
tools/scylla-sstable: allow loading sstables from any path
sstables: entry_descriptor::make_descriptor(): add overload with provided ks/cf
Currently, we configure LimitNOFILE on scylla-server.service, but we
don't configure fs.nr_open and fs.file-max.
When fs.nr_open or fs.file-max are smaller than LimitNOFILE, we may fail
to allocate FDs.
To fix this issue, raise fs.file-max and fs.nr_open to enogh size for
scylla.
Fixes#9461Closes#9461
Scrub compaction in segregate mode can split the input sstable into as
many as hundreds or even thousands of output sstables in the extreme
case. But even at a few dozen output sstables, most of these will only
have a few partitions with a few rows. These sstables however will still
have their bloom filter allocated according to the original
partition-count estimate, causing memory bloat or even OOM in the
extreme case.
This patch solves this by aggressively adjusting the partition count
downwards after the second bucket has been created. Each subsequent
bucket will halve the partition estimate, which will quickly reach 1.
Fixes: #9463Closes#9464
Currently it is required that sstables (in particular la/mx ones) are
located at a valid path. This is required because
`sstables::entry_descriptor::make_descriptor()` extracts the keyspace
and table names from the sstable dir components.
This patch relaxes this by using the freshly introduced
`sstables::entry_descriptor::make_descriptor()` overload which allows
the caller to specify keyspace and table names.
Not necessitating these to be extracted from the sstable dir path. This
practically allows for la/mx sstables at non-standard paths to be
opened. This will be used by the `scylla-sstable` tool which wants to be
flexible about where the sstables it opens are located.
Issue #8203 describes a bug in a long scan which returns a lot of empty
pages (e.g., because most of the results are filtered out). We have two
cql-pytest test cases that reproduced this bug - one for a whole-table
scan and one for a single-partition scan.
It turned out that the bug was not in the Scylla server, but actually in
the Python driver which incorrectly stopped the iteration after an empty
page even though this page did contain the "more pages" flag.
This driver bug was already fixed in the Datastax driver (see
6ed53d9f70,
and in the Scylla fork of the driver:
1d9077d3f4
So in this patch we drop the XFAIL, and if the driver is not new enough
to contain this fix - the test is skipped.
Since our Jenkins machines have the latest Scylla fork of the driver and
it already contains this fix, these tests will not be skipped - and will
run and should pass. Developers who run these tests on their development
machine will see these tests either passing or skipped - depending on
which version of the driver they have installed.
Closes#8203
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211011113848.698935-1-nyh@scylladb.com>
Our source base drifted away from gcc compatibility; this mostly
restores the ability to build with gcc. An important exception is
coroutines that have an initializer list [1]; this still doesn't work.
We aim to switch back to gcc 11 if/when this gives us better
C++ compatibility and performance.
Test: unit (dev)
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98056Closes#9459
* github.com:scylladb/scylla:
test: radix_tree_printer: avoid template specialization in class context
test: raft: avoid ignored variable errors
test: reader_concurrency_semaphore_test: isolate from namespace of source_location
test: cql_query_test: drop unused lambda assert_replication_not_contains
test: commitlog_test: don't use deprecated seastar::unaligned_cast
test: adjust signed/unsigned comparisons in loops and boost tests
build: silence some gcc 11 warnings
sstables: processing_result_generator: make coroutine support palatable for C++20 compilers
managed_bytes: avoid compile-time loop in converting constructor
service: service_level_controller: drop unused variable sl_compare
raft: disambiguate promise name in raft::active_read
locator: azure_snitch: use full type name in definition of globals
cql3: statements: create_service_level_statement: don't ignore replace_defaults()
cql3: statement_restrictions: adjust call to std::vector deduction guide
types: remove recursive constraint in deserialize_value
cql3: restrictions: relax constraint on visitor_with_binary_operator_content
treewide: handle switch statements that return
cql3: expr: correct type of captured map value_type
cdc: adjust type of streams_count
alternator: disambiguate attrs_to_get in table_requests
Meged patch series from Pavel Emelyanov:
There's a long-term (well, likely mid-term already) goal to keep a
single role for the storage_service, namely -- managing the state of
a node in the ring. Then rename it once it happens to stop people
from loading new stuff into storage_service. There are at least three
REST API endpoints that stand on the way.
1. load_new_ss_tables. This part is moved to a new sharded sstables
loader that wraps existing distributed_loader
2. view_build_statuses. Satuses are maintained by view_builder so must
be retrieved from the same place
3. enable_|disable_auto_compaction. This is purely database knob that
used to be such some time ago
This change also removes view_update_generator from storage_service list
of dependencies and leaves the system_distributed_keyspace be the
start-only one (another not yet published branch makes use of it and
removes s.d.ks from storage service at all).
branch: https://github.com/xemul/scylla/tree/br-unload-storage-service-api-3
tests: unit(dev)
refs: #5489
* 'br-unload-storage-service-api-3' of github.com:xemul/scylla:
storage_service, api: Move set-tables-autocompaction back into API
api: Fix indentation after previous patch
api, database, storage_service: Unify auto-compaction toggle
api: Remove storage service from new APIs
view_builder: Accept view_build_statuses
storage_service: Move view_build_statuses code
api, storage_service: Keep view builder API handlers separate
storage_service: Remove view update generator from
sstables_loader: Accept the sstables loading code
storage_service: Move the sstables loading code
storage_service, api: Keep sstables loading API handlers separate
sstables_loader: Introduce
distributed_loader, utils: Move verify_owner_and_mode
distributed_loader: Fix methods visibility
`partition_reversing_data_source` uses `continuous_data_consumer`s
internally (`partition_header_context`, `row_body_skipping_context`)
which hold `input_stream`s opened to sstable data files. These
`input_stream`s must be closed before destruction. Right now they would
sometimes cause "Assertion `_reads_in_progress == 0' failed" on
destruction.
Close the `continuous_data_consumer`s before they are destroyed so they
can close their `input_stream`s.
Fixes#9444.
Closes#9451
The global autocompaction toggle is no longer tied to the storage
service. It naturally belongs to the database, but is small and
tidy enough not to pollute database methods and can be placed into
the api/ dir itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two knobs here -- global and per-table one. Both were added
without any synchronisation, but the former one was later fixed to
become serialized and not to be available "too early".
This patch unifies both toggles to be serialized with each-other and
not be enabled too early.
The justification for this change is to move the global toggle from out
of the storage service, as it really belongs to the database, not the
storage service. Respectively, the current synchronization, that depends
on storage service internals, should be replaced with something else.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The APIs that had been recently switched to using relevant services no
longer need the storage service reference capture, so remove it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The code itself is already in relevant .cc file, not move it to the
relevant class.
The only significant change is where to get token metadata from.
In its old location tokens were provided by the storage service
itself, now when it's in the view builder there's no "native" place
to get them from, however the rest of the view building code gets
tokens from global storage proxy, so do the same here.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This code belongs to view builder, so put it into its .cc. No changes,
just move. This needs some ugly namespace breakage, but they will
be patched away with the next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's the 'storage_service/view_build_statuses' endpoint. It's
handler code sits in the storage_service, but the functionality
belongs purely to view_builder. Same as with sstables loader,
detach the enpoint's API set/unset code, next patches will fix
the handler to use view_builder.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The code was moved in the relevant .cc file by previous patch, now
make it sit in the relevant class. One "significant" change is that
the messaging service is available by local reference already, not
by the sharded one. Other dependencies are already satisfied by the
patch that introduced the sstables_loader class.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Just cut-n-paste the code into sstables_loader.cc. No other
changes but replace storage service logger with its own one.
For now the code stays in storage_service class, but next
patch will relocate the code into the sstables_loader one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now the handlers sit in one boat with the rest of the storage
service APIs. Next patches will switch this particular endpoint to
use previously introduced sstables_loader, before doing so here's
the respective API set/unset stubs.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's a sharded service that will be responsible for loading
sstables via the respective REST API (the endpoint in question
is in turn handling the nodetool refresh command). This patch
adds the loader, equips with the needed dependencies and
starts/stops one from main. Next patches will move the loader
code from storage_service into this new one. The list of
dependencies that are introduced in this patch is exactly
what's needed by the mentioned code move.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This method sits in dist.loader, but really belongs to util/ as it
just works on an "abstract" path and doesn't need to know what this
path is about. Another sign of layering violation is the inclusion
of dist.loader code into util/ stuf.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Most of the methods are marked public, but only few of them should.
Test needs a bit more, however, so the distributed_loader_for_tests
is declared as friend class.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This commit consists of changes, which need to reside in a single
commit, so that the tests pass on each of the commits.
1. Remove do_make_flat_reader which disabled reverse reads by making the
slice a forward one. Remove call to get_ranges which would do
superfluous reversal of clustering ranges.
2. test: cql_query_test: remove expectation that the test_query_limit
fails for reversed queries, since reversed queries no longer require
linear memory wrt. the result size, when paginated.
Previous commits made it possible to split the responsibility of two
kinds of clustering key ranges in read_next and next_range_tombstone.
Here, the actual reversal takes place and we start passing the actually
reversed ck_range, if Reversing. This reversed ck_range is stored as a
class member, so that the reversal happens just once for each range.
In this commit, I add the ability to read from partition snapshots in
reverse order. Before these changes, a reverse read from memtable has
been handled as follows:
- A reader higher in the hierarchy of readers performs a read from
memtable in the forward order, which is not aware of the intention to
read in reverse.
- Later, some reader reverses the received mutation fragments.
Memtable decides based on options in `slice`, whether to read forward
or in reverse. Note that previous commit creates a killswitch which
clears the `reverse` option from slice before running the logic of
whether to reverse or not. This is due to the fact, that this commit
doesn't all the required code changes.
The reversing partition snapshot reader maintains two schemas - one that
is the reversed schema (called _query_schema) for the output, and the
other one (forward one, called _snapshot_schema), which is used to
access the memtable tree (which needs to be the same as the schema used
to create memtable).
The `partition_slice` provided by callers is provided in 'half-reversed'
format for reversed queries, where the order of clustering ranges is
reversed, but the ranges themselves are not.
gcc complains about comparing a signed loop induction variable
with an unsigned limit, or comparing an expected value and measured
value. Fix by using unsigned throughout, except in one case where
the signed value was needed for the data_value constructor.
clang implement the coroutine technical specification, in namespace
std::experimental. gcc implements C++20 coroutines, in namespace std.
Detect which one is in use and select the namespace accordingly.
managed_bytes_basic_view is a template with a constructor that
converts from one instantiation of the template to another.
Unfortunately when gcc encounters the associated constraint, it
instantiates the template which forces it to evaluate the constraint
again, sending it into a loop.
Fix that by making the converting constructor a template itself,
delaying instantiation. The constraint is strengthened so the set
of types on which the constructor works is unchanged.
Some globals in azure_snitch use std::string in the declaration
and auto in the definition. gcc 11 complains. I don't know if it's
correct, but it's easy to use the type in both declaration and
definition.
gcc 11 has a hard time parsing a deduction guide use with
braced initializer. The bug [1] was already fixed in gcc 12,
and I've requested a backport, but reduce friction meanwhile
by switching to a form that works in gcc 11.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89062
deserialize_value() has a constraint that depends on another
deserialize_value() implementation. Apprently gcc wants to
instantiate the deserialize_value() instance we're constraining
while evaluating the constraint, leading to a loop.
Since this deserialize_value() is just an internal helper, drop
the constraint rather than fighting it.
We require that v.current_binary_operator is a 'const binary_operator*',
but it's really a 'const binary_operator*&'. Relax the constraint so it
works with both gcc and clang.
A switch statement where every case returns triggers a gcc
warning if the surrounding function doesn't return/abort.
Fix by adding an abort(). The abort() will never trigger since we
have a warning on unhandled switch cases.
Previously, next_range_tombstone took as an argument a clustering key
range, which served two purposes. One was for accesing only specified
key ranges from the partition, the other was for deciding in which order
the mutation fragments should be emitted. This commits separates these
responsibilities, since in the advent of native memtable reader, these
two responsibilities are no longer common. The split is propagated to
the rest of the partition_snapshot_reader.hh to avoid confusion.
After memtable starts supporting reverse order queries,
the schema provided to the readers will be reversed (reverse clustering
order). Reading from memtable in reverse requires two schemas - one to
access the memtable internal data structures (_partition_schema), and
the other one (_query_schema), the schema imposing clustering order on
returned mutation fragments. This commit prepares for introduction of
native reverse queries for memtable, by separating these responsibilities.
For now, they are still initialized with the schema passed from query.
sstables_manager superseeds previous implementation of sstables_tracker
for tracking lifetime of the tables. Update scylla-gdb.py to use
sstables_manager in a backwards compatible way, as sstables_manager is
not available in Scylla Enterprise 2020.1. Add explicit test for
"scylla sstables" command, as previously only "scylla active-sstables"
was tested.
Closes#9439
pytest supports - if the "repeat" extension is installed - a convenient
and efficient way to repeat the same test (or all of them) multiple times.
Since it's very useful, let's document it in cql-pytest/README.md.
By the way, our test.py also has a "--repeat" option, but it can only run
all cql-pytest tests, not just repeat a single small test, and it is also
slower (and arguably, different) because it restarts Scylla instead of
running a test 100 times on the same Scylla.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211007122146.624210-1-nyh@scylladb.com>
The test generates random mutations and eliminates mutations whose keys
tokenize to 0, in particular it eliminates mutations with empty
partition keys (which should not end up in sstables).
However it would do that after using the randomly generated mutations to
create their reversed versions. So the reversed versions of mutations
with empty partition keys would stay.
Fix by placing the workaround earlier in the test.
Closes#9447
rewrite_sstables() can be asked to stop either on shutdown or on an
user-triggered comapction which forces all ongoing compaction to stop,
like scrub.
turns out we weren't actually bailing out from do_until() when task
cannot proceed. So rewrite_sstables() potentially runs into an infinite
loop which in turn causes shutdown or something else waiting on it
to hang forever.
found this while auditting code.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211005233601.155442-1-raphaelsc@scylladb.com>
I stumbled upon this failure in dev mode:
```
test/boost/sstable_conforms_to_mutation_source_test.cc(0): Entering test case "test_sstable_reversing_reader_random_schema"
sstable_conforms_to_mutation_source_test: ./seastar/src/core/fstream.cc:205: virtual seastar::file_data_source_impl::~file_data_source_impl(): Assertion `_reads_in_progress == 0' failed.
Aborting on shard 0.
```
Since dev mode has no debug symbols I can't
decode the stack trace so I'm not 100% sure about
the root cause and I couldn't reproduce it in release
or debug modes yet.
One vulnerability in the current code is that r1
won't be closed if an exception is thrown before r1 and r2
are moved to `compare_readers` so this change adds
a deferred close of r1 in this case.
Test: sstable_conforms_to_mutation_source_test(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211006144009.696412-1-bhalevy@scylladb.com>
A map's value_type has const key, but in two places we omitted
the const. This causes construction of a new value, plus gcc
complaining that we're refering to a temporary.
Fix by using the correct type.
streams_count has signed type, but it's compared against an unsigned
type, annoying gcc. Since a count should be positive, convert it to
an unsigned type.
There is a table_requests::attrs_to_get type, and also a type
named attrs_to_get used in the same struct, and gcc doesn't
like this. Disambiguate the type by fully qualifying it.
This is a translation of Cassandra's CQL unit test source file
validation/operations/SelectOrderByTest.java into our our cql-pytest
framework.
This test file includes 17 tests for various features and corners of
SELECT's "ORDER BY" feature. All these tests pass on Cassandra, but
three fail on Scylla and are marked as xfail:
One previously-unknown Scylla bug:
Refs #9435: SELECT with IN, ORDER BY and function call does not obey
the ORDER BY
And two new reproducers for already known bugs:
Refs #2247: ORDER BY should allow skipping equality-restricted
clustering columns
Refs #7751: Allow selecting map values and set elements, like in
Cassandra 4.0
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211005174140.571056-1-nyh@scylladb.com>
Introduce new syntax in IDL compiler to allow generating
registration/sending code for RPC verbs:
```
verb [[attr1, attr2...] my_verb (args...) -> return_type;
```
`my_verb` RPC verb declaration corresponds to the
`netw::messaging_verb::MY_VERB` enumeration value to identify the
new RPC verb.
For a given `idl_module.idl.hh` file, a registrator class named
`idl_module_rpc_verbs` will be created if there are any RPC verbs
registered within the IDL module file.
These are the methods being created for each RPC verb:
```
static void register_my_verb(netw::messaging_service* ms, std::function<return_type(args...)>&&);
static future<> unregister_my_verb(netw::messaging_service* ms);
static future<> send_my_verb(netw::messaging_service* ms, netw::msg_addr id, args...);
```
Each method accepts a pointer to an instance of `messaging_service`
object, which contains the underlying seastar RPC protocol
implementation, that is used to register verbs and pass messages.
There is also a method to unregister all verbs at once:
```
static future<> unregister(netw::messaging_service* ms);
```
The following attributes are supported when declaring an RPC verb
in the IDL:
* `[[with_client_info]]` - the handler will contain a const reference to
an `rpc::client_info` as the first argument.
* `[[with_timeout]]` - an additional `time_point` parameter is supplied
to the handler function and `send*` method uses `send_message_*_timeout`
variant of internal function to actually send the message.
* `[[one_way]]` - the handler function is annotated by
`future<rpc::no_wait_type>` return type to designate that a client
doesn't need to wait for an answer.
The `-> return_type` clause is optional for two-way messages. If omitted,
the return type is set to be `future<>`.
For one-way verbs, the use of return clause is prohibited and the
signature of `send*` function always returns `future<>`.
No existing code is affected.
Ref: #1456Closes#9359
* github.com:scylladb/scylla:
idl: support generating boilerplate code for RPC verbs
idl: allow specifying multiple attributes in the grammar
message: messaging_service: extract RPC protocol details and helpers into a separate header
Let's major compact the smallest tables first, increasing chances of
success if low on disk space.
parallel_for_each() didn't have any effect on space requirement as
compaction_manager serializes major compaction in a shard.
As parallel_for_each() is no longer used, find_column_family() is now
used before each compact_all_sstables() to avoid a race with table drop.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211005135257.31931-1-raphaelsc@scylladb.com>
Ever since we started testing Alternator with tests written in Python
and using Amazon's "boto3" library, one limitation kept annoying us:
Boto3 verifies the validity of the request parameters before passing
them on to the server. It verifies that mandatory parameters are not
missing, that parameters have the right types, and sometimes even the
right ranges - all in the library before ever sending the request.
This meant that in many cases, we couldn't get good test coverage for
Alternator's server-side handling of *wrong* parameters.
As it turns out, it is trivial to tell boto3 to *not* do its client-side
request validation, with the `parameter_validation=False` config flag.
We just never noticed that such a flag existed :-)
So this patch adds this flag. It then fixes a few tests which expected
ParameterValidationError - this error is the client-side validation
failure, but should now be replaced by checking the server-side error.
The patch also adds a couple of invalid parameter checks that we
couldn't do before because of boto3's eagerness to check them on the
client side. We can add a lot more of these error tests in the future,
now that we got rid of client-side validation.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211005095514.537226-1-nyh@scylladb.com>
This patch adds yet another test for Alternator's unimplemented feature
of adding a GSI to an already existing table (issue #5022), but this
test is for a very specific corner case - tables which contain string
attributes with an empty value - the corner case described in
issue #9424:
DynamoDB used to forbid any string attributes from being set to an empty
string, but this changed in May 2020, and since then empty strings are
allowed - but NOT as keys. So although it is legal to set a string
attribute to an empty string, if this table has a GSI whose key is that
specific attribute, the update command is refused. We already had a
test for this - test_gsi_empty_value.
However, the case in this patch is the case where a GSI is added to a
table *after* the table already has data. In this case (as this test
demonstrates), we are supposed to drop the items which have the empty
string key from the GSI.
Even when #5022 (the ability to add GSIs to existing tables) will be done,
this test will continue to fail. The unique problem of this test is that
Scylla's materialized views *do* allow empty strings as clustering keys
(right now) and even partition keys (after #9375 will be solved), while
we don't want them to enter the GSI. We will probably need to add to the
view's filter, which right now contains (as required) "x IS NOT NULL"
also the filter "x != ''" (when x's type is a string or binary) so that
items with empty-string keys will be dropped.
Refs #5022
Refs #9375
Refs #9424
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211003170636.477582-1-nyh@scylladb.com>
DynamoDB has a rather baroque definition of numbers, and in particular
it does *not* allow numeric attributes to be set to infinity or NaN.
Although I did check invalid numbers in the past, manually, I was never
able to write a unit test for this in the past - because the boto3
library catches such errors on the client side, and prevents the test from
sending broken requests to the server. So in this patch, I finally came up
with a solution - a context manager client_no_transform() which
yields a client which does NOT do any transformation or validation
on the request's parameters, allowing us to use boto3 to create
improper requests - and test the server's handling of them.
The test in this patch passes - it did not discover a new bug, but
it is a useful regression test and the client_no_transform() trick
can be used in more error-case tests which until now we were unable
to write.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211004161809.520236-1-nyh@scylladb.com>
This allows us to forward-declare raw_selector, which in turn reduces
indirect inclusions of expression.hh from 147 to 58, reducing rebuilds
when anything in that area changes.
Includes that were lost due to the change are restored in individual
translation units.
Closes#9434
base64.hh pulls in the huge rjson.hh, so if someone just wants
a base64 codec they have to pull in the entire rapidjson library.
Move the json related parts of base64.hh to rjson.hh and adjust
includes and namespaces.
In practice it doesn't make much difference, as all users of base64
appear to want json too. But it's cleaner not to mix the two.
Closes#9433
The bouncing algorithm tries to send a request to other servers in the
configuration after it receives a `not_a_leader` response.
Improve the algorithm so it doesn't try the same server twice.
As a preparation for the following commits.
Otherwise the definitions wouldn't be found during argument-dependent lookup
(I don't understand why it worked before but won't after the next commit).
The operation sends a reconfiguration request to a Raft cluster. It
bounces a few times in case of `not_a_leader` results.
A side effect of the operation is modifying a `known` set of nodes which
the operation's state has a reference to. This `known` set can then be
used by other operations (such as `raft_call`s) to find the current
leader.
For now we assume that reconfigurations are performed sequentially. If a
reconfiguration succeeds, we change `known` to the new configuration. If
it fails, we change `known` to be the set sum of the previous
configuration and the current configuration (because we don't know what
the configuration will eventually be - the old or the attempted one - so
any member of the set sum may eventually become a leader).
"
There's a run_mutation_source_tests lib helper that runs a bunch of
tests sequentially. The problem is that it does 4 different flavors
of it each being a certain decoration over provided reader. This
amplification makes some test cases run enormous amount of time
without any chance for parallelizm.
The simplest way to help running those cases in parallel is to teach
the slowest cases to run different flavors of mutation source tests
in dedicated cases. This patch makes it so.
The resulting timings are
dev debug
sequential run: 2m1s 53m50s
--parallel-cases (+ this patch): 1m3s 31m15s
tests: unit(dev, debug)
"
* 'br-parallel-mutation-source-tests' of https://github.com/xemul/scylla:
test: Split multishard combining reader case
test: Split database test case
test: Split run_mutation_source_tests
(Single-partition) reversed queries are no longer unlimited but some
places still treat them as such. This causes, for example, shorter pages
for such queries, which breaks a test that expects certain results to
come in a single page.
It's possible that the server drops the snapshot in the same iteration
of `io_fiber` loop as it tries to send it (the sending of messages
happens after snapshot dropping). Handle this case by throwing an
exception.
As a preparation we also fix the code in `server_impl::send_snapshot` so
it works correctly when `rpc::send_snapshot` throws or returns a ready
future.
Refs #9407.
* kbr/snapshot-handle-errors:
test: raft: randomized_nemesis_test: remove an obsolete comment
test: raft: randomized_nemesis_test: handle missing snapshot in `rpc::send_snapshot`
raft: server: handle `rpc::send_snapshot` returning instantly
It's possible that the server drops the snapshot in the same iteration
of `io_fiber` loop as it tries to send it (the sending of messages
happens after snapshot dropping). Handle this case.
Refs #9407.
If `rpc::send_snapshot` returned immediately with a ready future, or if
it threw, the code in `server_impl::send_snapshot` would not update
`_snapshot_transfers` correctly.
The code assumed that the continuation attached to `rpc::send_snapshot`
(with `then_wrapped`) was executed after `_snapshot_transfer` below the
`rpc::send_snapshot` call was updated. That would not necessarily be true
(the continuation may even not have been executed at all if
`rpc::send_snapshot` threw).
Fix that by wrapping the `rpc::send_snapshot` call into a continuation
attached to `later()`.
Originally authored by Gleb <gleb@scylladb.com>, I added a comment.
All the cases in this test also run mutation source tests and the
case with single-fragment buffer takes times more time to execute
than the others.
Splitting this single case so that it runs mutation source tests
flavours in different cases improves the test parallelizm.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test_database_with_data_in_sstables_is_a_mutation_source case runs
the mutation source tests in one go. The problem is that on each step
a whole new ks:cf is created which takes the majority of the tests time.
In the end of the day this case is the slowest one in the suite being
up to two times longer (depending on mode) than the #2 on this list.
This patch splits the case into 4 so that each mutation source flavor
is run in separate case.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are 4 flavours of mutation source tests that are all ran
sequentially -- plain, reversed and upgrade/downgrade ones that
check v1<->v2 conversions.
This patch splits them all into individual calls so that some
tests may want to have dedicated cases for each. "By default" they
are all run as they were.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The set_local_host_id() accepts UUID references and starts
to save it in local keyspace and in all shards' local cache.
Before it was coroutinized the UUID was copied on captures
and survived, after it it remains references. The problem is
that callers pass local variables as arguments that go away
"really soon".
Fix it to accept UUID as value, it's short enough for safe
and painless copy.
fixes: #9425
tests: dtest.ReplaceAddress_rbo_enabled.replace_node_diff_ip(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211004145421.32137-1-xemul@scylladb.com>
The previous check would find a leader once and assume that it does not
change, and that the first attempt at sending a request to this leader
succeeds. In reality the leader may change at the end of the test (e.g.
it may be in the middle of stepping down when we find it) and in general
it may take some time for the cluster to stabilize. The new check tries
a few times to find a leader and perform a request - until a time limit
is reached.
* kbr/nemesis-liveness-check:
test: raft: randomized_nemesis_test: better liveness check at the end of generator test
test: raft: randomized_nemesis_test: take `time_point` instead of `duration` in `wait_for_leader`
The previous check would find a leader once and assume that it does not
change, and that the first attempt at sending a request to this leader
succeeds. In reality the leader may change at the end of the test (e.g.
it may be in the middle of stepping down when we find it) and in general
it may take some time for the cluster to stabilize. The new check tries
a few times to find a leader and perform a request - until a time limit
is reached.
The commit also removes an incorrect assertion inside in `wait_for_leader`.
Until now reversed queries were implemented inside
`querier::consume_page` (more precisely, inside the free function
`consume_page` used by `querier::consume_page`) by wrapping the
passed-in reader into `make_reversing_reader` and then consuming
fragments from the resulting reversed reader.
The first couple of commits change that by pushing the reversing down below
the `make_combined_reader` call in `table::query`. This allows
working on improving reversing for memtables independently from
reversing for sstables.
We then extend the `index_reader` with functions that allow
reading the promoted index in reverse.
We introduce `partition_reversing_data_source`, which wraps an sstable data
file and returns data buffers with contents of a single chosen partition
as if the rows were stored in reverse order.
We use the reversing source and the extended index reader in
`mx_sstable_mutation_reader` to implement efficient (at least in theory)
reversed single-partition reads.
The patchset disables cache for reversed reads. Fast-forwarding
is not supported in the mx reader for reversed queries at this point.
Details in commit messages. Read the commits in topological order
for best review experience.
Refs: #9134
(not saying "Fixes" because it's only for single-partition queries
without forwarding)
Closes#9281
* github.com:scylladb/scylla:
table: add option to automatically bypass cache for reversed queries
test: reverse sstable reader with random schema and random mutations
sstables: mx: implement reversed single-partition reads
sstables: mx: introduce partition_reversing_data_source
sstables: index_reader: add support for iterating over clustering ranges in reverse
clustering_key_filter: clustering_key_filter_ranges owning constructor
flat_mutation_reader: mention reversed schema in make_reversing_reader docstring
clustering_key_filter: document clustering_key_filter_ranges::get_ranges
Currently the new reversing sstable algorithms do not support fast
forwarding and the cache does not yet handle reversed results. This
forced us to disable the cache for reversed queries if we want to
guarantee bounded memory. We introduce an option that does this
automatically (without specifying `bypass cache` in the query) and turn
it on by default.
If the user decides that they prefer to keep the cache at the
cost of fetching entire partitions into memory (which may be viable
if their partitions are small) during reversed queries, the option can
be turned off. It is live-updateable.
The test generates a random set of mutations and creates two readers:
- one by reversing the mutations, creating an sstable out of the result,
and querying it in reverse,
- one by creating an sstable directly from the mutations and querying it
in forward mode.
It checks that the readers give equal results.
The test already managed to find a bug where offsets returned by the
sstable index were interpreted incorrectly as absolute instead of
relative. It also helped find another bug unrelated to reversing (#9352).
Surprisingly few tests use the random schema and random mutation
utilities which seem to be quite powerful.
We use partition_reversing_data_source and the new `index_reader` methods
to implement single-partition reads in `mx_sstable_mutation_reader`.
The parsing logic does not need to change: the buffers returned by the
source already contain rows in reversed clustering order.
Some changes were required in `mp_row_consumer_m` which processes the
parsed rows and emits appropriate mutation fragments. The consumer uses
`mutation_fragment_filter` underneath to decide whether a fragment
should be ignored or not (e.g. the parsed fragment may come from outside
the requested clustering range), among other things. Previously
`mutation_fragment_filter` was provided a `partition_slice`. If the
slice was reversed, the filter would use
`clustering_key_filter_ranges::get_ranges` to obtain the clustering
ranges from the slice in unreversed order (they were reversed in the
slice) since we didn't perform any reversing in the reader. Now the
reader provides the ranges directly instead of the slice; furthermore,
the ranges are provided in native-reversed format (the order of ranges
is reversed and the ranges themselves are also reversed), and the schema
provided to the filter is also reversed. Thus to the filter everything
appears as if it was used during a non-reversed query but on a table
with reversed schema, which works correctly given the fact that the
reader is feeding parsed rows into the consumer in reversed order.
During reversed queries the reader uses alternative logic for skipping
to a later range (or, speaking in non-reversed terms, to an earlier range),
which happens in `advance_context`. It asks the index to advance its
upper bound in reverse so that the reversing_data_source notices the
change of the index end position and returns following buffers with rows
from the new range.
There is a slight difference in behavior of the reader from
`mp_row_consumer_m`'s point of view. For non-reversed reads, after
the consumer obtains the beginning of a row (`consume_row_start`)
- which contains the row's position but not the columns - and tells the
reader that the row won't be emitted because we need to skip to a later
range, the reader would tell the data source (the 'context') immediately
to skip to a later range by calling `skip_to`. This caused the source
not to return the rest of the row, and the rest of the row would not
be fed to the consumer (`consume_row_end`). However, for reversed reads,
the data source performs skipping 'on its own', after it notices that
the index end position has changed. This may happen 'too late', causing
the rest of the row to be returned anyway. We are prepared for this
situation inside `mp_row_consumer` by consulting the mutation fragment
filter again when the rest of the row arrives.
Fast forwarding is not supported at this point, which is fine given that
the cache is disabled for reversed queries for now (and the cache is the
only user of fast forwarding).
The `partition_slice` provided by callers is provided in 'half-reversed'
format for reversed queries, where the order of clustering ranges is
reversed, but the ranges themselves are not. This means we need to modify
the slice sometimes: for non-single-partition queries the mx reader must
use a non-reversed slice, and for single-partition queries the mx reader
must use a native-reversed slice (where the clustering ranges themselves
are reversed as well). The modified slice must be stored somewhere; we
store it inside the mx reader itself so we don't need to allocate more
intermediate readers at the call sites. This causes the interface of
`mx::make_reader` to be a bit weird: for non-single-partition queries
where the provided slice is reversed the reader will actually return a
non-reversed stream of fragments, telling the user to reverse the stream
on their own. The interface has been documented in detail with
appropriate comments.
This patch adds an implementation of a data source that wraps an sstable
data file and returns data buffers with contents of one partition in the
sstable as if the rows of the partition were present in a reversed
order. In other words, to the user of the source the partition appears
to be reversed. We shall call this an 'intermediary' data source.
As part of the interface of the intermediary source the user is also
given read access to the source's current position over the data file,
and the constructor of the source takes a reference to `index_reader`.
This is necessary because the index operates directly on data file
offsets and we want the user to be able to use the index to skip
sequences of rows.
In order to ask the source to skip a sequence of rows - e.g. when jumping
between clustering ranges - the user must advance the index' upper bound
in reverse (to an earlier position). The source will then notice that
the end position of the index has changed and take appropriate action.
An alternative would be to translate the data positions of
`index_reader` to 'reversed positions' of the intermediary and then use
`skip_to` for skipping, as we do for forward reads. However this
solution would introduce more complexity to `index_reader` and the
intermediary source. One reason for the complexity in the input stream
is that we would have two kinds of skips: a single row skip,
and a skip to a clustering range. We know the offset of the next row,
so we could check that to differentiate them. We would also need to add
an information about the position of first clustering row and end of
the last one in the index_reader. Skipping by checking the index seems
to be overall simpler.
For simplicity, the intermediary stream always starts with
parsing the partition header and (if present) the static row,
and returning the corresponding bytes as a result of the first
read.
After partition header and static row we must find the last row entry of
the requested range. If the range ends before the partition end (i.e.
there are more row entries after the range) we can use the 'previous
unfiltered size' of the row following the range; otherwise we must scan
the last promoted index block and take its last row.
After finding the data range of the last row, we parse rows
consecutively in reversed order. We must parse the rows partially
to learn their lengths and the positions of previous rows. We're
using similar constructs as in the sstable parser, but it only
contains a small part of the parsing coroutine and doesn't perform
any correctness checks. The parser for rows still turned out rather
big mostly because we can't always deduce the size of the clustering
blocks without reading the block header.
The parser allows reading rows while skipping their bodies also in
non-reversed order, which we are making use of while reading the
last promoted index block.
The intermediary data source has one more utility: reversing range
tombstones. When we read a tombstone bound/boundary, we modify
the data buffer so that the resulting bound/boundary has the reversed
kind (so we don't read ends before starts) and the boundaries have their
before/after timestamps swapped.
In the sstable reader, we iterate over clustering ranges using the
index_reader, which normally only accepts advancing to increasing
positions. In this patch we add methods for advancing the index
reader in reverse.
To simplify our job we restrict our attention to a single implementation
of the promoted index block cursor: `bsearch_clustered_cursor`. The
`index_reader` methods for advancing in reverse will thus assume that
this implementation is used. The assumption is correct given that we're
working only with sstables of versions >= mc, which is indeed the
intended use case. We add some documentation in appropriate places to
make this obvious.
We extend `bsearch_clustered_cursor` with two methods:
`advance_past(pos)`, which advances the cursor to the first block after
`pos` (or to the end if there is no such block), and
`last_block_offset()`, which returns the data file offset of the first
row from the last promoted index block.
To efficiently find the position in the data file of the last row
of the partition (which we need when performing a reversed query)
the sstable reader may need to read the span of the entire last promoted
index block in the data file. To learn where the block starts it can use
`index_reader::last_block_offset()`, which is implemented in terms of
`bsearch_clustered_cursor::last_block_offset()`.
When performing a single partition read in forward order, the reader
asks the index to position its lower bound at the start of the partition
and its upper bound after the end of the slice. It starts by reading the
first range. After exhausting a range it jumps to the next one by asking
the index to advance the lower bound.
For reverse single partition reads we'll take a similar approach: the
initial bound positions are as in the forward case. However, we start
with the last range and after exhausting a range we want to jump to a
previous one; we will do it by advancing the upper bound in reverse
(i.e. moving it closer to the beginning of the partition). For this
we introduce the `index_reader::advance_reverse` function.
"
The storage_service is involved in the cdc_generation_service guts
more than needed.
- the bool _for_testing bit is cdc-only
- there's API-only cdc_generation_service getter
- cdc_g._s. startup code partially sits in s._s. one
This patch cleans most of the above leaving only the startup
_cdc_gen_id on board.
tests: unit(dev)
refs: #2795
"
* 'br-storage-service-vs-cdc-2' of https://github.com/xemul/scylla:
api: Use local sharded<cdc::generation_service> reference
main: Push cdc::generation_service via API
storage_service: Ditch for_testing boolean
cdc: Replace db::config with generation_service::config
cdc: Drop db::config from description_generator
cdc: Remove all arguments from maybe_rewrite_streams_descriptions
cdc: Move maybe_rewrite_streams_descriptions into after_join
cdc: Squash two methods into one
cdc: Turn make_new_cdc_generation a service method
cdc: Remove ring-delay arg from make_new_cdc_generation
cdc: Keep database reference on generation_service
The end_point_hints_manager's field _last_written_rp is initialized in
its regular constructor, but is not copied in the move constructor.
Because the move constructor is always involved when creating a new
endpoint manager, the _last_written_rp field is effectively always
initialized with the zero-argument constructor, and is set to the zero
value.
This can cause the following erroneous situation to occur:
- Node A accumulates hints towards B.
- Sync point is created at A.
It will be used later to wait for currently accumulated hints.
- Node A is restarted.
The endpoint manager A->B is created which has bogus value in the
_last_written_rp (it is set to zero).
- Node A replays its hints but does not write any new ones.
- A hint flush occurs.
If there are no hint segments on disk after flush, the endpoint
manager sets its last sent position to the last written position,
which is by design. However, the last written position has incorrect
value, so the last sent position also becomes incorrect and too low.
- Try to wait for the sync point created earlier.
The sync point waiting mechanism waits until last sent hint position
reaches or goes past the position encoded in the sync point, but it
will not happen because the last sent position is incorrect.
The above bug can be (sometimes) reproduced in
hintedhandoff_sync_point_api_test dtest.
Now, the _last_written_rp field is properly initialized in the move
constructor, which prevents the bug described above.
Fixes: #9320Closes#9426
We must abort the environment before the ticker as the environment may
require time to keep advancing during abort in order for all operations
to finish, e.g. operations that can finish only due to timeout.
Currently such operations may cause the test to hang indefinitely
at the end.
The test requires a small modification to ensure that
`delivery_queue::push` is not called after the queue was aborted.
Message-Id: <20210930143539.157727-1-kbraun@scylladb.com>
"This series removes layer violation in compaction, and also
simplifies compaction manager and how it interacts with compaction
procedure."
* 'compaction_manager_layer_violation_fix/v4' of github.com:raphaelsc/scylla:
compaction: split compaction info and data for control
compaction_manager: use task when stopping a given compaction type
compaction: remove start_size and end_size from compaction_info
compaction_manager: introduce helpers for task
compaction_manager: introduce explicit ctor for task
compaction: kill sstables field in compaction_info
compaction: kill table pointer in compaction_info
compaction: simplify procedure to stop ongoing compactions
compaction: move management of compaction_info to compaction_manager
compaction: move output run id from compaction_info into task
The script assumes the remote name is "origin", a fair
assumption, but not universally true. Read it from configuration
instead of guessing it.
Closes#9423
Since May 2020 empty strings are allowed in DynamoDB as attribute values
(see announcment in [1]). However, they are still not allowed as keys.
We had tests that they are not allowed in keys of LSI or GSI, but missed
tests that they are not allowed as keys (partition or sort key) of base
tables. This patch add these missing tests.
These tests pass - we already had code that checked for empty keys and
generated an appropriate error.
Note that for compatibility with DynamoDB, Alternator will forbid empty
strings as keys even though Scylla *does* support this possibility
(Scylla always supported empty strings as clustering key, and empty
partition keys will become possible with issue #9352).
[1] https://aws.amazon.com/about-aws/whats-new/2020/05/amazon-dynamodb-now-supports-empty-values-for-non-key-string-and-binary-attributes-in-dynamodb-tables/
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211003122842.471001-1-nyh@scylladb.com>
The gist of this series is splitting `compaction_abort_exception` from `compaction_stop_exception`
and their respective error messages to differentiate between compaction being stopped due to e.g. shutdown
or api event vs. compaction aborting due to scrub validation error.
While at it, cleanup the existing retry logic related to `compaction_stop_exception`.
Test: unit(dev)
Dtest: nodetool_additional_test.py:TestNodetool.{{scrub,validate}_sstable_with_invalid_fragment_test,{scrub,validate}_ks_sstable_with_invalid_fragment_test,{scrub,validate}_with_one_node_expect_data_loss_test} (dev, w/ https://github.com/scylladb/scylla-dtest/pull/2267)
Closes#9321
* github.com:scylladb/scylla:
compaction: split compaction_aborted_exception from compaction_stopped_exception
compaction_manager: maybe_stop_on_error: rely on retry=false default
compaction_manager: maybe_stop_on_error: sync return value with error message.
compaction: drop retry parameter from compaction_stop_exception
compaction_manager: move errors stats accounting to maybe_stop_on_error
Just like EBS disks for EC2, we want to use persistent disk on GCE.
We won't recommend to use it, but still need to support it.
Related scylladb/scylla-machine-image#215
Closes#9395
* seastar e6db0cd587...0ba6c36cc3 (6):
> semaphore: add try_get_units
> build: adjust compilation for libfmt 8+
> alloc_failure_injector: add explicit zero-initialization
> Change wakeup() from private to public in reactor.hh
> app-template: separate seastar options into --seastar-help
> files: Don't ignore FS info for read-only files
The pull_github_pr.sh script is preferred over colorful github
buttons, because it's designed to always assign proper authors.
It also works for both single- and multi-patch series, which
makes the merging process more universal.
Message-Id: <b982b650442456b988e1cea59aa5ad221207b825.1633101849.git.sarna@scylladb.com>
by setting _alloc_count initially to 0.
The _alloc_count hasn't been explicitely specified. As the allocator has
been usually an automatic variable, _alloc_count had initially some
unspecified contents. This probalby means that cases where the first few
allocations passed and the later one failed, might haven't ever been
tested. Good thing is that most of the users have been transferred to
the Seastar failure injector, which (by accident) has been correct.
Closes#9420
The tracing code assumes that query_option_names and query_option_values
vectors always have the same length as the prepared_statements vector,
but it's not true. E.g. if one of the statements in a batch is
incorrect, it will create a discrepancy between the number of prepared
statements and the number of bound names and values, which currently
leads to a segmentation fault.
To overcome the problem, all three vectors are integrated into a single
vector, which makes size mismatches impossible.
Tested manually with code that triggers a failure while executing
a batch statement, because the Python driver performs driver-side
validation and thus it's hard to create a test case which triggers
the problem.
closes: #9221
In order to ease future extensions to the information being sent
by the service level configuration change API, we pack the additional
parameters (other the the service level options) to the interface in a
structure. This will allow an easy expansion in the future if more
parameters needs to be sent to the observer.i
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
compaction_info must only contain info data to be exported to the
outside world, whereas compaction_data will contain data for
controlling compaction behavior and stats which change as
compaction progresses.
This separation makes the interface clearer, also allowing for
future improvements like removing direct references to table
in compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
compaction_info will eventually only be used for exporting data about
ongoing compactions, so task must be used instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
those stats aren't used in compaction stats API and therefore they
can be removed. end_size is added to compaction_result (needed for
updating history) and start_size can be calculated in advance.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Today, compactions are tracked by both _compactions and _tasks,
where _compactions refer to actual ongoing compaction tasks,
whereas _tasks refer to manager tasks which is responsible for
spawning new compactions, retry them on failure, etc.
As each task can only have one ongoing compaction at a time,
let's move compaction into task, such that manager won't have to
look at both when deciding to do something like stopping a task.
So stopping a task becomes simpler, and duplication is naturally
gone.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Today, compaction is calling compaction manager to register / deregister
the compaction_info created by it.
This is a layer violation because manager sits one layer above
compaction, so manager should be responsible for managing compaction
info.
From now on, compaction_info will be created and managed by
compaction_manager. compaction will only have a reference to info,
which it can use to update the world about compaction progress.
This will allow compaction_manager to be simplified as info can be
coupled with its respective task, allowing duplication to be removed
and layer violation to be fixed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
this run id is used to track partial runs that are being written to.
let's move it from info into task, as this is not an external info,
but rather one that belongs to compaction_manager.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
There will be unbounded growth of pending tasks if they are submitted
faster than retiring them. That can potentially happen if memtables
are frequently flushed too early. It was observed that this unbounded
growth caused task queue violations as the queue will be filled
with tons of tasks being reevaluated. By avoiding duplication in
pending task list for a given table T, growth is no longer unbounded
and consequently reevaluation is no longer aggressive.
Refs #9331.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210930125718.41243-1-raphaelsc@scylladb.com>
This is not to mess with storage service in this API call. Next
patch will make use of the passed reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Nowadays it purely controls whether or not to inject delays into
timestamps generation by cdc. The same effect can be achieved by
configuring the cdc::generation_service directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is to push the service towards general idea that each
component should have its own config and db::config to stay
in main.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All of them are references taken from 'this', since the function is
the generation_service method it can use 'this' directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The generation service already has all it needs to do it. This
keeps storage_service smaller and less aware about cdc internals.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The recently introduced make_new_generation() method just calls
another one by passing more this->... stuff as arguments. Relax
the flow by teaching the latter to use 'this' directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It has everything needed onboard. Only two arguments are required -- the
booststrap tokens and whether or not to inject a delay.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There seems to be a problem with libwasmtime.a dependency on aarch64,
causing occasional segfaults during tests - specifically, tests
which exercise the path for halting wasm execution due to fuel
exhaustion. As a temporary measure, wasm is disabled on this
architecture to unblock the flow.
Refs #9387Closes#9414
The --replace-token and --replace-node were added some time ago, but
have never been used since then, just parsed and immediatelly aborted.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210930102222.16294-1-xemul@scylladb.com>
"
Most of the code gets local_host_id by querying system keyspace.
There's one place (counters) that want future-less getter and
that caches host id on database for that (it used to be cached on
storage_service some time ago).
This set relocates the value on local cache and frees the starting
code from the need to mess with database for setting it. Also
this cuts hints->qctx hidden dependency.
tests: unit(dev)
"
* 'br-host-id-in-local-cache' of https://github.com/xemul/scylla:
storage_proxy: Use future-less local_host_id getting
database: Get local host id from system_keyspace
system_keyspace: Keep local_host_id on local_cache
code: Rename get_local_host_id() into load_...()
system_keyspace: Coroutinize get_/set_local_host_id
The methods in question are called from the API handlers which
are registered after start (and after host id load and cache),
so they can safely be switched.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some places in the code want to have future-less access to the
host id, now they do it all by themselves. Local cache seems to
be a better place (for the record -- some time ago the "better
place" argument justified cached host id relocation from the
storage_service onto the database).
While at it -- add the future-less getter for the host_id to be
used further.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch adds a reproducer for issue #7586 - that Alternator queries
(Query) operating in reverse order (ScanIndexForward = false) are
artificially limited to 100 MB partitions because of their memory use.
This test generates a partition over 100 MB in size and then tries various
reverse queries on it - with or without Limit, starting at the end or
the middle of the partition. The test currently fails when a reverse query
refuses to operate on such a large partition - the log reports this:
ERROR ... Memory usage of reversed read exceeds hard limit of 104857600
(configured via max_memory_for_unlimited_query_hard_limit), while reading
partition K1H6ON3A1C
With yet-uncommitted reverse-scan improvements, the test proceeds further,
but still fails where we test that a reverse query with Limit not
explicitly specified should still be limited to a certain size (e.g. 1MB)
and cannot return the entire 100 MB partition in one response.
Please note that this is not a comprehensive test for Scylla's reverse
scan implementation: In particular we do not have separate tests for
reverse scan's implementation on different sources - memtables, sstables,
or the cache. Nor do we check all sorts of edge cases. We assume that
Scylla's reverse scan implementation will have its own unit tests
elsewhere that will check these things - and this test can focus on the
Alternator use case.
This test is marked "xfail" because it still fails on Alternator. It is
marked "veryslow" because it's a (relatively) slow test, taking multiple
seconds to set up the 100 MB partition. So run the test with the
pytest options "--runxfail --runveryslow" to see how it fails.
Refs #7586
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210930063700.407511-1-nyh@scylladb.com>
There will appear the future-less method which better deserves
the get_ prefix, so give the existing method the load_ one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The cql_config_updater is a sharded<> service that exists in main and
whose goal is to make sure some db::config's values are propagated into
cql_config. There's a more handy updateable_value<> glue for that.
tests: unit(dev)
refs: #2795
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210927090402.25980-1-xemul@scylladb.com>
Introduce new syntax in IDL compiler to allow generating
registration/sending code for RPC verbs:
verb [[attr1, attr2...] my_verb (args...) -> return_type;
`my_verb` RPC verb declaration corresponds to the
`netw::messaging_verb::MY_VERB` enumeration value to identify the
new RPC verb.
For a given `idl_module.idl.hh` file, a registrator class named
`idl_module_rpc_verbs` will be created if there are any RPC verbs
registered within the IDL module file.
These are the methods being created for each RPC verb:
static void register_my_verb(netw::messaging_service* ms, std::function<return_type(args...)>&&);
static future<> unregister_my_verb(netw::messaging_service* ms);
static future<> send_my_verb(netw::messaging_service* ms, netw::msg_addr id, args...);
Each method accepts a pointer to an instance of `messaging_service`
object, which contains the underlying seastar RPC protocol
implementation, that is used to register verbs and pass messages.
There is also a method to unregister all verbs at once:
static future<> unregister(netw::messaging_service* ms);
The following attributes are supported when declaring an RPC verb
in the IDL:
* [[with_client_info]] - the handler will contain a const reference to
an `rpc::client_info` as the first argument.
* [[with_timeout]] - an additional `time_point` parameter is supplied
to the handler function and `send*` method uses `send_message_*_timeout`
variant of internal function to actually send the message.
* [[one_way]] - the handler function is annotated by
`future<rpc::no_wait_type>` return type to designate that a client
doesn't need to wait for an answer.
The `-> return_type` clause is optional for two-way messages. If omitted,
the return type is set to be `future<>`.
For one-way verbs, the use of return clause is prohibited and the
signature of `send*` function always returns `future<>`.
No existing code is affected.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Fixed some errors in .github/CODEOWNERS (which is used by Github to
recommend who should review which pull request), and also add a few
additional ownerships I thought of.
This file could still use more work - if you can think of specific
files or directories you'd like to review changes in, please send a
patch for this file to add yourself to the appropriate paths.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210929141118.378930-1-nyh@scylladb.com>
Recently we observed an OOM caused by the partition based splitting
writer going crazy, creating 1.7K buckets while scrubbing an especially
broken sstable. To avoid situations like that in the future, this patch
provides a max limit for the number of live buckets. When the number of
buckets reach this number, the largest bucket is closed and replaced by
a bucket. This will end up creating more output sstables during scrub
overall, but now they won't all be written at the same time causing
insane memory pressure and possibly OOM.
Scrub compaction sets this limit to 100, the same limit the TWCS's
timestamp based splitting writer uses (implemented through the
classifier -
time_window_compaction_strategy::max_data_segregation_window_count).
Fixes: #9400
Tests: unit(dev)
Closes#9401
There are now 231 translation units that indirectly include commitlog.hh
due to the need to have access to db::commitlog::force_sync.
Move that type to a new file commitlog_types.hh and make it available
without access to the commitlog class.
This reduces the number of translation units that depend on commitlog.hh
to 84, improving compile time.
Unfortunately, defining metrics in Scylla requires some code
duplication, with the metrics declared in one place but exported in a
different place in the code. When we duplicated this code in Alternator,
we accidentally dropped the first metric - for BatchGetItem. The metric
was accounted in the code, but not exported to Prometheus.
In addition to fixing the missing metric, this patch also adds a test
that confirms that the BatchGetItem metric increases when the
BatchGetItem operation is used. This test failed before this patch, and
passes with it. The test only currently tests this for BatchGetItem
(and BatchWriteItem) but it can be later expanded to cover all the other
operations as well.
Fixes#9406
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210929121611.373074-1-nyh@scylladb.com>
Currently no mutation-source supports reading in reverse natively but
we are working on changing that, adding native reverse read support to
memtable, cache and sstable readers. To ensure that all mutation
sources work in a correct and uniform manner when reading in reverse,
we add a reverse test to the mutation source test suite. This test
reverses the data that it passes to `populate()`, then reads in
forward order (in reverse compared to the data order). For this we use
the currently established reverse read API: reverse schema (schema
order == query order) and half-reversed (legacy) slice. All mutation
sources are prepared to work with reversed reads, using the
`make_reversing_reader()` adapter. As we progress with our native
reverse support, we will replace these adapters with native reversing
support. As part of this, we push down the reversing reader adapter
currently existing on the `query::consume_page()` level, to the
individual mutation sources.
Closes#9384
* github.com:scylladb/scylla:
test: mutation_reader_test: reversed version of test_clustering_order_merger_sstable_set
querier: consume_page(): remove now unused max_size parameter
test/lib: mutation_source_test: test reading in reverse
test: mutation_reader_test: clustering_combined_reader_mutation_source_test: prepare for reading in reverse
test: flat_mutation_reader_test: test_reverse_reader_is_mutation_source: prepare for reading in reverse
test: mutation_reader_test: test_manual_paused_evictable_reader_is_mutation_source: use query schema instead of table schema
treewide: move reversing to the mutation sources
mutation_query: reconcilable_result_builder: document reverse query preconditions
sstable_set: time_series_sstable_set: reverse mode
mutlishard_mutation_query: set max result size on used permits
db/virtual_table: streaming_virtual_table::as_mutation_source(): use query schema instead of table schema
flat_mutation_reader: make_reversing_reader(): add convenience stored slice
mutation_reader: evictable_reader: add reverse read support
flat_mutation_reader: make_flat_mutation_reader_from_fragments(): add reverse read support
flat_mutation_reader: flat_mutation_reader_from_mutations(): add reverse read support
flat_mutation_reader: flat_mutation_reader_from_mutations(): document preconditions
query-request: introduce `half_reverse_slice`
flat_mutation_reader_assertions: log what's expected
"
Backlog tracker isn't updated correctly when facing a schema change, and
may leak a SSTable if compaction strategy is changed, which causes
backlog to be computed incorrectly. Most of these problems happen because
sstable set and tracker are updated independently, so it could happen
that tracker lose track (pun intended) of changes applied to set.
The first patch will fix the leak when strategy is changed, and the third
patch will make sure that tracker is updated atomically with sstable set,
so these kind of problems will not happen anymore.
Fixes#9157
"
* 'fixes_to_backlog_tracker_v4' of github.com:raphaelsc/scylla:
compaction: Update backlog tracker correctly when schema is updated
compaction: Don't leak backlog of input sstable when compaction strategy is changed
compaction: introduce compaction_read_monitor_generator::remove_exhausted_sstables()
compaction: simplify removal of monitors
To ensure all mutation sources uniformly support the current API of
reverse reading: reversed schema and half-reversed slice. This test will
also ensure that once we switch to native-reverse slice, all
mutation-sources will keep on working.
For reversed reads we must adjust the lower/upper bounds used by the
`position_reader_queue` and `clustering_combined_reader`. The bounds are
calculated using the mutation schema, but we need bounds calculated
using the query schema which is reversed.
The mutation source test suite will soon test reads in reverse. Prepare
for this by checking the reversed flag on the slice and not reversing
the data when set. The test will have two modes effectively:
* Forward mode: data is reversed before read, the reversed again during
read.
* Reverse mode: data is already reversed and it is reversed back during
read.
The two might not be the same in case the schema was upgraded or if we
are reading in reverse. It is important to use the passed-in query
schema consistently during a read.
Push down reversing to the mutation-sources proper, instead of doing it
on the querier level. This will allow us to test reverse reads on the
mutation source level.
The `max_size` parameter of `consume_page()` is now unused but is not
removed in this patch, it will be removed in a follow-up to reduce
churn.
DynamoDB limits the number of items that a BatchWriteItem call can write
to 25. As noted in issue #5057, in Alternator we don't have this limit
or any limit on the number of items in a BatchWriteItem - which probably
isn't wise.
This patch adds a simple xfailing test for this.
Refs #5057
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210912140736.76995-1-nyh@scylladb.com>
The `expression` type (an std::variant) suffers from bad ergonomics:
- std::variant has poor/no constraints, so compiler error messages are long and uninformative
- it cannot be forward-declared (since std::variant does not support incomplete types)
- the type name is long, polluting compiler error messages and debug symbols
- it requires an artificial `nested_expression` when one expression is nested inside another
This series fixes those drawbacks by wrapping the variant in a class, adding constraints, and adding an extra indirection.
Test: unit (dev)
Closes#9402
* github.com:scylladb/scylla:
cql3: expr: drop nested_expression
cql3: expr: make expression forward declarable, easier to use
cql3: expr: construct column_value explicitly
cql3: expr: introduce as/as_if/is
cql3: expr: introduce expr::visit, replacing std::visit
On Ubuntu, scaling_governor becomes powersave after rebooted, even we configured cpufrequtils.
This is because ondemand.service, it unconditionally change scaling_governor to ondemand or powersave.
cpufrequtils will start before ondemand.service, scaling_governor overwrite by ondemand.service.
To configure scaling_governor correctly, we have to disable this service.
Fixes#9324Closes#9325
Now that expression can be nested in its component types
directly, we can remove nested_expression. Most of the patch
adjusts uses to drop the dereference that was needed for
nested_expression.
Make expression a class, holding a unique_ptr to a variant,
instead of just a variant.
This has some advantages:
- the constructor can be properly constrained
- the type can be forward-declared
- the type name is just "expression", rather than
a huge variant. This makes compiler error messages easier
to read.
- the internal indirection allows removal of nested_expression
(later in the series)
We have a few cases where a column_definition* is converted
directly to an expression without an explicit call to column_value{}.
The new expression implementation will not allow this, so make
these cases explicit. IMO this is better form than to rely
on the compiler picking the right expression subtype.
Simple wrappers for std::get, std::get_if, std::holds_alternative.
The new names are shorter and IMO more readable.
Call sites are updated.
We will later replace the implementation.
The new expr::visit() is just a wrapper around std::visit(),
but has better constraints. A call to expr::visit() with a
visitor that misses an overload will produce an error message
that points at the missing type. This is done using the new
invocable_on_expression concept. Note it lists the expression
types one by one rather than using template magic, since
otherwise we won't get the nice messages.
Later, we will change the implementation when expression becomes
our own type rather than std::variant.
Call sites are updated.
`time_series_sstable_set` uses `clustering_combined_reader` to implement
efficient single-partition reads. It provides a `position_reader_queue`
to the reader. This queue returns readers to the sstables from the set
in order of the sstables' lower bounds, and with each reader it provides
an upper bound for the positions-in-partition returned by the reader.
Until now we would assume non-reversed queries only. Reversed queries
were implemented by performing forward query in the lower layers
and reversing the results at the upper-most layer of the reader stack.
Before pushing the reversing down to the sources (in particular,
to sstable readers), we need to support the reverse mode in
`time_series_sstable_set` and the queue it provides to
`clustering_combined_reader`.
This requires using different lower and upper bounds in the queue.
For non-reversed reads we used `sstable::min_position()` as the lower
bound and `sstable::max_position()` as the upper bound. For reversed
reads all comparisons performed by `clustering_combined_reader` will be
reversed, as it will use a reversed schema. We can then use
`sstable::max_position().reversed()` for the lower bound and
`sstable::min_position().reversed()` for the upper bound.
08042c1688 added the query max result size
to the permit but only set it for single partition queries. This patch
does the same for range-scans in preparation of `query::consume_page()`
not propagating max size soon.
The two might not be the same in case the schema was upgraded (unlikely
for virtual tables) or if we are reading in reverse. It is important to
use the passed-in query schema consistently during a read.
This serves as a convenience slice storage for reads that have to
store an edited slice somewhere. This is common for reads that work
with a native-reversed slice and so have to convert the one used in the
query -- which is in half-reversed format.
Evictable reader has to be made aware of reverse reads as it checks/edits the
slice. This shouldn't require reverse awareness normally, it is only
required because we still use the half-reversed (legacy) slice format
for reversed reads. Once we switch to the native format this commit can
be reverted.
* scylla-dev/raft-misc-v4-docedit:
raft: document pre-voting and protection against disruptive leaders
raft: style edits of README.md.
raft: document snapshot API
Currently the following can happen:
1) there's ongoing compaction with input sstable A, so sstable set
and backlog tracker both contains A.
2) ongoing compaction replaces input sstable A by B, so sstable set
contains only B now.
3) schema is updated, so a new backlog tracker is built without A
because sstable set now contains only B.
4) ongoing compaction tries to remove A from tracker, but it was
excluded in step 3.
5) tracker can now have a negative value if table is decreasing in
size, which leads to log(<negative number>) == -NaN
This problem happens because backlog tracker updates are decoupled
from sstable set updates. Given that the essential content of
backlog tracker should be the same as one of sstable set, let's move
tracker management to table.
Whenever sstable set is updated, backlog tracker will be updated with
the same changes, making their management less error prone.
Fixes#9157
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The generic backlog formula is: ALL + PARTIAL - COMPACTING
With transfer_ongoing_charges() we already ignore the effect of
ongoing compactions on COMPACTING as we judge them to be pointless.
But ongoing compactions will run to completion, meaning that output
sstables will be added to ALL anyway, in the formula above.
With stop_tracking_ongoing_compactions(), input sstables are never
removed from the tracker, but output sstables are added, which means
we end up with duplicate backlog in the tracker.
By removing this tracking mechanism, pointless ongoing compaction
will be ignored as expected and the leaks will be fixed.
Later, the intention is to force a stop on ongoing compactions if
strategy has changed as they're pointless anyway.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
by switching to unordered_map, removal of generated monitors is
made easier. this is a preparatory change for patch which will
remove monitor for all exhausted sstables
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
It transforms the position from a forward-clustering-order schema
domain to a reversed-clustering-order schema domain.
The object still refers to the same element of the space of keys under this
transformation. However, the identification of the position,
the position_in_partition object, is schema-dependent, it is always
interpreted relative to some schema. Hence the need to transform it
when switching schema domains.
Message-Id: <20210917102612.308149-1-tgrabiec@scylladb.com>
If a locally taken snapshot cannot be installed because newer one was
received meanwhile it should be dropped, otherwise it will take space
needlessly.
Message-Id: <YUrWXxVfBjEio1Ol@scylladb.com>
Consider:
- n1, n2 in the cluster
- n2 shutdown
- n2 sends gossip shutdown message to n1
- n1 delays processing of the handler of shutdown message
- n2 restarts
- n1 learns new gossip state of n2
- n1 resumes to handle the shutdown message
- n1 will mark n2 as shutdown status incorrectly until n2 restarts again
To prevent this, we can send the gossip generation number along with the
shutdown message. If the generation number does not match the local
generation number for the remote node, the shutdown message will be
ignored.
Since we use the rpc::optional to send the generation number, it works
with mixed cluster.
Fixes#8597Closes#9381
This PR adds the function:
```c++
constant evaluate(const expression&, const query_options&);
```
which evaluates the given expression to a constant value.
It binds all the bound values, calls functions, and reduces the whole expression to just raw bytes and `data_type`, just like `bind()` and `get()` did for `term`.
The code is often similar to the original `bind()` implementation in `lists.cc`, `sets.cc`, etc.
* For some reason in the original code, when a collection contains `unset_value`, then the whole collection is evaluated to `unset_value`. I'm not sure why this is the case, considering it's impossible to have `unset_value` inside a collection, because we forbid bind markers inside collections. For example here: cc8fc73761/cql3/lists.cc (L134)
This seems to have been introduced by Pekka Enberg in 50ec81ee67, but he has left the company.
I didn't change the behaviour, maybe there is a reason behind it, although maybe it would be better to just throw `invalid_request_exception`.
* There was a strange limitation on map key size, it seems incorrect: cc8fc73761/cql3/maps.cc (L150), but I left it in.
* When evaluating a `user_type` value, the old code tolerated `unset_value` in a field, but it was later converted to NULL. This means that `unset_value` doesn't work inside a `user_type`, I didn't change it, will do in another PR.
* We can't fully get rid of `bind()` yet, because it's used in `prepare_term` to return a `terminal`. It will be removed in the next PR, where we finally get rid of `term`.
Closes#9353
* github.com:scylladb/scylla:
cql3: types: Optimize abstract_type::contains_collection
cql3: expr: Convert evaluate_IN_list to use evaluate(expression)
cql3: expr: Use only evaluate(expression) to evaluate term
cql3: expr: Implement evaluate(expr::function_call)
cql3: expr: Implement evaluate(expr::usertype_constructor)
cql3: expr: Implement evaluate(expr::collection_constructor)
cql3: expr: Implement evaluate(expr::tuple_constructor)
cql3: expr: Implement evaluate(expr::bind_variable)
cql3: Add contains_collection/set_or_map to abstract_type
cql3: expr: Add evaluate(expression, query_options)
cql3: Implement term::to_expression for function_call
cql3: Implement term::to_expression for user_type
cql3: Implement term::to_expression for collections
cql3: Implement term::to_expression for tuples
cql3: Implement term::to_expression for marker classes
cql3: expr: Add data_type to *_constructor structs
cql3: Add term::to_expression method
cql3: Reorganize term and expression includes
Some future and enterprise features requires us to inherit from
reader_concurrency_semaphore, this might require additional
"wrap up" operations to be done on stop which serves as a barrier
for the semaphore. Here we simply make stop virtual so it is
inherited and can be augmented.
This change have no significant impact on performance since stop
can get called once in a lifetime of a semaphore.
The approach is to add two extenction points to the
reader_concurrency_semaphore class, one just before the stop code is
executed and one just after.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Closes#9373
There's a circular dependency:
query processor needs database
database owns large_data_handler and compaction_manager
those two need qctx
qctx owns a query_processor
Respectively, the latter hidden dependency is not "tracked" by
constructor arguments -- the query processor is started after
the database and is deferred to be stopped before it. This works
in scylla, because query processor doesn't really stop there,
but in cql_test_env it's problematic as it stops everything,
including the qctx.
Recent database start-stop sanitation revealed this problem --
on database stop either l.d.h. or compaction manager try to
start (or continue) messing with the query processor. One problem
was faced immediatelly and pluged with the 75e1d7ea safety check
inside l.d.h., but still cql_test_env tests continue suffering
from use after free on stopped query processor.
The fix is to partially revert the 4b7846da by making the tests
stop some pieces of the database (inclusing l.d.h. and compaction
manager) as it used to before. In scylla this is, probably, not
needed, at least now -- the database shutdown code was and still
is run right before the stopping one.
tests: unit(debug)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210924080248.11764-1-xemul@scylladb.com>
The Raft PhD presents the following scenario.
When we remove a server from the cluster configuration, it does not
receive the configuration entry which removes it (because the leader
appending this entry uses that entry's configuration to decide to which
servers to send the entry to, and the entry does not contain the removed
server). Therefore the server keeps believing it is a member but does
not receive heartbeats from leaders in the new configuration. Therefore
it will keep becoming a candidate, causing existing leaders to step
down, harming availability. With many such candidates the cluster may
even stop being able to proceed at all. We call such servers
"disruptive".
More concretely, consider the following example, adapted from the PhD for
joint configuration changes (the original PhD considered a different
algorithm which can only add/remove one server at once):
Let C_old = {A, B, C, D}, C_new = {B, C, D}, and C_joint be the joint
configuration (C_old, C_new). D is the leader. D managed to append
C_joint to every server and commit it. D appends C_new. At this point, D
stops sending heartbeats to A because C_new does not contain A, but A's
last entry is still C_joint, so it still has the ability to become a
candidate. A can now become a candidate and cause D, or any other leader
in C_new, to step down. Even if D manages to commit C_new, A can keep
disrupting the cluster until it is shut down.
Prevoting changes the situation, which the authors admit. The "even if"
above no longer applies: if D manages to commit C_new, or just append it
to a majority of C_new, then A won't be able to succeed in the prevote
phase because a majority of servers in C_new has a longer log than A
(and A must obtain a prevote from a majority of servers in C_new because
A is in C_joint which contains C_new). But the authors continue to argue
that disruptions can still occur during the small period where C_new is
only appended on D but not yet on a majority of C_new. As they say:
"we also did not want to assume that a leader will reliably replicate
entries fast enough to move past the scenario (...) quickly; that might
have worked in practice, but it depends on stronger assumptions that we
prefer to avoid about the performance (...) of replicating log entries".
One could probably try debunking this by saying that if entries take
longer to replicate than the election timeout we're in much bigger
trouble, but nevermind.
In any case, the authors propose a solution which we call "sticky
leadership". A server will not grant a vote to a candidate if it has
recently received a heartbeat from the currently known leader, even if
the candidate's term is higher. In the above example, servers in C_new
would not grant votes to A as long as D keeps sending them heartbeats,
thus A is no longer disruptive.
In our case the situation is a bit
different: in original Raft, "heartbeats" have a very specific meaning
- they are append_entries requests (possibly empty) sent by leaders.
Thus if a node stops being a leader it stops sending heartbeats;
similarly, if a node leaves the configuration, it stops receiving
heartbeats from others still in the configuration. We instead use a
"shared failure detector" interface, where nodes may still consider
other nodes alive regardless of their configuration/leadership
situation, as part of the general "MultiRaft" framework.
This pretty much invalidates the original argument, as seen on
the above example: A will still consider D alive, thus it won't become
a candidate.
Shared failure detector combined with sticky leadership actually makes
the situation worse - it may cause cluster unavailability in certain
scenarios (fortunately not a permanent one, it can be solved with server
restarts, for example). Randomized nemesis testing with reconfigurations
found the following scenario:
Let C1 = {A, B, C}, C2 = {A}, C3 = {B, C}. We start from configuration
C1, B is the leader. B commits joint (C1, C2), then new C2
configuration. Note that C does not learn about the last entry
(since it's not part of C2) but it keeps believing that B is alive,
so it keeps believing that B is the leader.
We then partition {A} from {B, C}. A appends (C2, C3) joint
configuration to its log. It's not able to append it to B or C due to
the partition. The partition holds long enough for A to revert to
candidate state (or we may restart A at this point). Eventually the
partition resolves. The only node which can become a candidate now is A:
C does not become a candidate because it keeps believeing that B is the
leader, and B does not become a candidate because it saw the C2
non-joint entry being committed. However, A won't become a leader
because C won't grant it a vote due to the sticky leadership rule.
The cluster will remain unavailable until e.g. C is restarted.
Note that this scenario requires allowing configuration changes which
remove and then readd the same servers to the configuration. One may
wonder if such reconfigurations should be allowed, but there doesn't
seem to be any example of them breaking safety of Raft (and the PhD
doesn't seem to mention them at all; perhaps it implicitly accepts
them). It is unknown whether a similar scenario may be produced without
such reconfigurations.
In any case, disabling sticky leadership resolves the problem, and it is
the last currently known availability problem found in randomized
nemesis testing. There is no reason to keep this extension, both because
the original Raft authors' argument does not apply for shared failure
detector, and because one may even argue with the authors in vanilla
Raft given that prevoting is enabled (see end of third paragraph of this
commit message).
Message-Id: <20210921153741.65084-1-kbraun@scylladb.com>
contains_collection() and contains_set_or_map() used to be calculated on each call().
Now the result is calculated only once during type creation.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
evaluate_IN_list used term::bind(), but now it's possible
to make it use term::to_expression() and then evaluate(expression)
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Finally we don't need term::bind() to evaluate a term.
We can just convert the term to expression and call evaluate(expression).
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
function_call can be evaluated now.
The code matches the one from functions::function_call::bind.
I needed to add cache id to function_call in order for it ot work properly.
See the blurb in struct function_call for more information.
New code corresponds to bind() in cql3/functions/functions.cc.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
usertype_constructor can now be evaluated.
To evaluate an usertype_constructor we need to know the type,
because the fields have to be in the correct order.
Type has been added to usertype_constructor.
New code corresponds to old bind() of user_types::delayed_value in cql3/user_types.cc.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
collection_constructor can now be evaluated.
There is a bit of a problem, because we don't know the type of an empty collection_constructor,
but luckily empty collection constructors get converted to constants during preparation.
For some reason in the original code when a collection contains unset_value,
the whole collection is automatically evaluated to unset_value. I didn't change this behaviour.
New code corresponds to old bind() of lists::delayed_value in cql3/lists.cc, sets::delayed_value etc.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Tuple constructors can now be evaluated.
New code corresponds to old bind() of tuples::delayed_value::marker in cql3/tuples.cc
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Implement evaluating a bind_variable.
To be able to evaluate a bind_variable we need to know the type of the bound value.
This is why a data_type has been added to the bind_variable struct.
There are some quirks when evaluating a bind_variable.
The first problem occurs when the variable has been sent with an older cql serialization format and contains collections.
In that case the value has to be reserialized to use the newest cql serialization format.
The second problem occurs when there is a set or a map in the value.
The set value sent by the driver might not have the elements in the correct order, contain duplicates etc.
When a set or map is detected in the value it is reserialized as well.
collection_type_impl::reserialize doesn't work for this purpose, because it uses data_value which does not perform sorting or removal.
New code corresponds to old bind() of lists::marker in cql3/lists.cc, sets::marker etc.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Sometimes we need to know whether some type contains
some collection, set, or map inside.
Introduce two functions that provide this information.
Information about collection is useful for reserializing
values with old serialization format.
Information about set/map is useful for reserializing
sets and maps to remove duplicates.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add a function that takes an expression and evaluates it to a constant.
Evaluating specific expression variants will be implemented in the following commits.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Implement to_expression for non terminals that represent a bind marker.
For now each bind marker has a shape describing where it is used, but hopefully this can be removed in the future.
In order to evaluate a bind_variable we need to know its type.
The type is needed to pass to constant and to validate the value.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
It is useful to have a data_type in *_constructor structs when evaluating.
The resulting constant has a data_type, so we have to find it somehow.
For tuple_constructor we don't have to create a separate tuple_type_impl instance.
For collection_constructor we know what the type is even in case of an empty collection.
For usertype_constructor we know the name, type and order of fields in the user type.
Additionally without a data_type we wouldn't know whether the type is reversed or not.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add a method that converts given term to the matching expression.
It will be used as an intermediate step when implementing evaluate(expression).
evaluate(term) will convert the term to the expression and then call evaluate(expression).
For terminals this is simply calling get() to serialize the value.
For non-terminals the implementation is more complicated and will be implemeted in the following commits.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Make term.hh include expression.hh instead of the other way around.
expression can't be forward declared.
expression is needed in term.hh to declare term::to_expression().
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The AppendReg state machine stores a sequence of integers. It supports
`append` inputs which append a single integer to the sequence and return
the previous state (before appending).
The implementation uses the `append_seq` data structure
representing an immutable sequence that uses a vector underneath
which may be shared by multiple instances of `append_seq`.
Appending to the sequence appends to the underlying vector,
but there is no observable effect on the other instances since
they use only the prefix of the sequence that wasn't changed.
If two instances sharing the same vector try to append,
the later one must perform a copy.
This allows efficient appends if only one instance is appending, which
is useful in the following context:
- a Raft server stores a copy in the underlying state machine replica
and appends to it,
- clients send append operations to the server; the server returns the
state of the sequence before it was appended to,
- thanks to the sharing, we don't need to copy all elements when
returning the sequence to the client, and only one instance (the
server) is appending to the shared vector,
- summarizing, all operations have amortized O(1) complexity.
We use AppendReg instead of ExReg in `basic_generator_test`
with a generator which generates a sequence of append operations with
unique integers.
This implies that the result of every operation uniquely identifies the
operation (since it contains the appended integer, and different
operations use different integers) and all operations that must have
happened before it (since it contains the previous state of the append
register), which allows us to reconstruct the "current state" of the
register according to the results of operations coming from Raft calls,
giving us an on-line serializability checker with O(1) amortized
complexity on each operation completion.
We also enforce linearizability by checking that every
completed operation was previously invoked.
We also perform a simple liveness check at the end of the test by
ensuring that a leader becomes eventually elected and that we can
successfully execute a call.
* kbr/linearizability-v2:
test: raft: randomized_nemesis_test: check consistency and liveness in basic_generator_test
test: raft: randomized_nemesis_test: introduce append register
"This series removes layer violation in compaction, and also
simplifies compaction manager and how it interacts with compaction
procedure."
* 'compaction_manager_layer_violation_fix/v3' of github.com:raphaelsc/scylla:
compaction: split compaction info and data for control
compaction_manager: use task when stopping a given compaction type
compaction: remove start_size and end_size from compaction_info
compaction_manager: introduce helpers for task
compaction_manager: introduce explicit ctor for task
compaction: kill sstables field in compaction_info
compaction: kill table pointer in compaction_info
compaction: simplify procedure to stop ongoing compactions
compaction: move management of compaction_info to compaction_manager
compaction: move output run id from compaction_info into task
compaction_info must only contain info data to be exported to the
outside world, whereas compaction_data will contain data for
controlling compaction behavior and stats which change as
compaction progresses.
This separation makes the interface clearer, also allowing for
future improvements like removing direct references to table
in compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
compaction_info will eventually only be used for exporting data about
ongoing compactions, so task must be used instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
those stats aren't used in compaction stats API and therefore they
can be removed. end_size is added to compaction_result (needed for
updating history) and start_size can be calculated in advance.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Today, compactions are tracked by both _compactions and _tasks,
where _compactions refer to actual ongoing compaction tasks,
whereas _tasks refer to manager tasks which is responsible for
spawning new compactions, retry them on failure, etc.
As each task can only have one ongoing compaction at a time,
let's move compaction into task, such that manager won't have to
look at both when deciding to do something like stopping a task.
So stopping a task becomes simpler, and duplication is naturally
gone.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Today, compaction is calling compaction manager to register / deregister
the compaction_info created by it.
This is a layer violation because manager sits one layer above
compaction, so manager should be responsible for managing compaction
info.
From now on, compaction_info will be created and managed by
compaction_manager. compaction will only have a reference to info,
which it can use to update the world about compaction progress.
This will allow compaction_manager to be simplified as info can be
coupled with its respective task, allowing duplication to be removed
and layer violation to be fixed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
this run id is used to track partial runs that are being written to.
let's move it from info into task, as this is not an external info,
but rather one that belongs to compaction_manager.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The previous commit already relaxed the condition for test_fib,
but the same should be done for test_fib_called_on_null
for an identical reason - more than 1 error can be expected
in the case of calling heavily recursive function, and either
fuel exhaustion, or hitting the stack limit, or any other
InvalidRequest exception should be accepted.
Closes#9363
Add new struct to the `expression` variant:
```c++
// A value serialized with the internal (latest) cql_serialization_format
struct constant {
cql3::raw_value value;
data_type type; // Never nullptr, for NULL and UNSET might be empty_type
};
```
and use it where possible instead of `terminal`.
This struct will eventually replace all classes deriving from
`terminal`, but for now `terminal` can't be removed completely.
We can't get rid of terminal yet, because sometimes `terminal` is
converted back to `term`, which `constant` can't do. This won't be a
problem once we replace term with expression.
`bool` is removed from `expression`, now `constant` is used instead.
This is a redesign of PR #9203, there is some discussion about the
chosen representation there.
Closes#9371
* github.com:scylladb/scylla:
cql3: term: Remove get_elements and multi_item_terminal from terminals
cql3: Replace most uses of terminal with expr::constant
cql3: expr: Remove repetition from expr::get_elements
cql3: expr: Add expr::get_elements(constant)
cql3: term: remove term::bind_and_get
cql3: Replace all uses of bind_and_get with evaluate_to_raw_view
cql3: expr: Add evaluate_IN_list
cql3: tuples: Implement tuples::in_value::get
cql3: Move data_type to terminal, make get_value_type non-virtual
cql3: user_types: Implement get_value_type in user_types.hh
cql3: tuples: Implement get_value_type in tuples.hh
cql3: maps: Implement get_value_type in maps.hh
cql3: sets: Implement get_value_type in sets.hh
cql3: lists: Implement get_value_type in lists.hh
cql3: constants: Implement get_value_type in constants.hh
cql3: expr: Add expr::evaluate
cql3: Make collection term get() use the internal serialization format
cql3: values: Add unset value to raw_value_view::make_temporary
cql3: expr: Add constant to expression
"
The main challenge here is to move messaging_service.start_listen()
call from out of gossiper into main. Other changes are pretty minor
compared to that and include
- patch gossiper API towards a standard start-shutdown-stop form
- gossiping "sharder info" in initial state
- configure cluster name and seeds via gossip_config
tests: unit(dev)
dtest.bootstrap_test.start_stop_test_node(dev)
manual(dev): start+stop, nodetool enable-/disablegossip
refs: #2737
refs: #2795
refs: #5489
"
* 'br-gossiper-dont-start-messaging-listen-2' of https://github.com/xemul/scylla:
code: Expell gossiper.hh from other headers
storage_service: Gossip "sharder" in initial states
gossiper: Relax set_seeds()
gossiper, main: Turn init_gossiper into get_seeds_from_config
storage_service: Eliminate the do-bind argument from everywhere
gossiper: Drop ms-registered manipulations
messaging, main, gossiper: Move listening start into main
gossiper: Do handlers reg/unreg from start/stop
gossiper: Split (un)init_messaging_handler()
gossiper: Relocate stop_gossiping() into .stop()
gossiper: Introduce .shutdown() and use where appropriate
gossiper: Set cluster_name via gossip_config
gossiper, main: Straighten start/stop
tests/cql_test_env: Open-code tst_init_ms_fd_gossiper
tests/cql_test_env: De-global most of gossiper
gossiper: Merge start_gossiping() overloads into one
gossiper: Use is_... helpers
gossiper: Fix do_shadow_round comment
gossiper: Dispose dead code
Use AppendReg instead of ExReg for the state machine.
Use a generator which generates a sequence of append operations with
unique integers.
This implies that the result of every operation uniquely identifies the
operation (since it contains the appended integer, and different
operations use different integers) and all operations that must have
happened before it (since it contains the previous state of the append
register), which allows us to reconstruct the "current state" of the
register according to the results of operations coming from Raft calls,
giving us an on-line linearizability checker with O(1) amortized
complexity on each operation completion.
We also perform a simple liveness check at the end of the test by
ensuring that a leader becomes eventually elected and that we can
successfully execute a call.
* seastar c04a12edbd...e6db0cd587 (13):
> Merge "Add kernel stack trace reporting for stalls" from Avi
Ref #8828
> Merge "Keep XFS' dioattr cached" from Pavel E
> coroutines: de-template maybe_yield()
> sharded: Add const versions of map_reduce's
> apps/io_tester: remove unused lambda capture
> doc: exclude seastar::coroutine::internal namespace
> deprecate unaligned_cast<> from unaligned.hh
> reactor: adjust max_networking_aio_io_control_blocks to lower size when fs.aio-max-nr is small
> build: clarify choice of C++ dialect, and change default to C++20
> coding_style: update concepts style to snake_case
> Merge "Teach io_tester to submit requests-per-second flow" from Pavel E
> cmake: find and link against Boost::filesystem
> coroutine: add maybe_yield
We know (verified by existing tests) that null keys are not allowed -
neither as partition keys nor clustering keys.
In issue #9352 a question was raised of whether an *empty string* is
allowed as as a key on a base table (not a materialized view or index).
The following tests confirm that the current situation is as follows:
1. An empty string is perfectly legal as a clustering key.
2. An empty string is NOT ALLOWED as a partition key - the error
"Key may not be empty" is reported if this is attempted.
3. If the partition key is compound (multiple partition-key columns)
then any or all of them may be empty strings.
These tests pass the same on both Cassandra and Scylla, showing that
this bizarre (and undocumented) behavior is identical in both.
Refs #9352.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210922131310.293846-1-nyh@scylladb.com>
"
There's a whole lot of places that create an sstable for tests
like this
auto sst = env.make_sstable(...);
sst->write_components(...);
sst->load();
Some of them are already generalized with the make_sstable_easy
helper, but there are several instances of them.
Found while hunting down the places that use default IO sched
class behind the scenes.
tests: unit(dev)
"
* 'br-sst-tests-make-sstable-easy' of https://github.com/xemul/scylla:
test: Generalize make_sstable() and make_sstable_easy()
test: Use now existing helpers elsewhere
test: Generalize all make_sstable_easy()-s
test: Set test change estimation to 1
test: Generalize make_sstable_easy in mutation tests
test: Generalize make_sstable_easy in set tests
test: Reuse make_sstable_easy in datafile tests
test: Relax make_sstable_easy in compaction tests
When a string column is indexed with a secondary index, the empty value
for this column (an empty string '') is perfectly legal, and should be
indexed as well. This is not the same as an unset (null) value which
isn't indexed.
The following test demonstrates that this case works in Cassandra, but
does not in Scylla (so the test is marked "xfail"). In Scylla, a query
that returns the expected results with ALLOW FILTERING suddenly returns
a different (and wrong) result when an index is added on the table.
This test reproduces issue #9364.
Refs #9364.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210922121510.291826-1-nyh@scylladb.com>
"
Backlog tracker isn't updated correctly when facing a schema change, and
may leak a SSTable if compaction strategy is changed, which causes
backlog to be computed incorrectly. Most of these problems happen because
sstable set and tracker are updated independently, so it could happen
that tracker lose track (pun intended) of changes applied to set.
The first patch will fix the leak when strategy is changed, and the third
patch will make sure that tracker is updated atomically with sstable set,
so these kind of problems will not happen anymore.
Fixes#9157
test: mode(debug)
"
* 'fixes_to_backlog_tracker_v3' of https://github.com/raphaelsc/scylla:
compaction: Update backlog tracker correctly when schema is updated
compaction: Don't leak backlog of input sstable when compaction strategy is changed
compaction: introduce compaction_read_monitor_generator::remove_exhausted_sstables()
compaction: simplify removal of monitors
Scylla and Cassandra do not allow an empty string as a partition key,
but a materialized view might "convert" a regular string column into a
partition key, and an empty string is a perfectly valid value for this
column. This can result in a view row which has an empty string as a
partition key. This case works in Cassandra, but doesn't in Scylla (the
row with the empty string as a partition key doesn't appear). The
following test demonstrates this difference between Scylla and Cassandra
(it passes on Cassandra, fails on Scylla, and accordingly marked
"xfail").
Refs #9375.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210922115000.290387-1-nyh@scylladb.com>
Recently we had to say goodbye to our dear friend Pekka.
He orphaned a few subsystems that can't call for his help in code
reviews anymore.
This patch makes sure no one will bother Pekka in his afterlife.
It also cleanups HACKING.md a little bit by removing Pekka and Duarte
from the maintainer/reviewer lists.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <98ba1aed9ee8a87b9037b5032b82abc5bfddbd66.1632301309.git.piotr@scylladb.com>
A variant of make_reversed() which goes through the schema registry,
teaching the schema to the registry if necessary. This effectively
caches the result of the reversing and as an added bonus double
reversing yields the very same schema C++ object that was the starting
point.
Closes#9365
The AppendReg state machine stores a sequence of integers. It supports
`append` inputs which append a single integer to the sequence and return
the previous state (before appending).
The implementation uses the `append_seq` data structure
representing an immutable sequence that uses a vector underneath
which may be shared by multiple instances of `append_seq`.
Appending to the sequence appends to the underlying vector,
but there is no observable effect on the other instances since
they use only the prefix of the sequence that wasn't changed.
If two instances sharing the same vector try to append,
the later one must perform a copy.
This allows efficient appends if only one instance is appending, which
is useful in the following context:
- a Raft server stores a copy in the underlying state machine replica
and appends to it,
- clients send append operations to the server; the server returns the
state of the sequence before it was appended to,
- thanks to the sharing, we don't need to copy all elements when
returning the sequence to the client, and only one instance (the
server) is appending to the shared vector,
- summarizing, all operations have amortized O(1) complexity.
Add the content of the compaction stats introduced in the previous patch
to the tracing data. This will help diagnose query performance related
problems caused by tombstones.
Add the content of the compaction stats introduced in the previous patch
to the tracing data. This will help diagnose query performance related
problems caused by tombstones.
This needs to add forward declarations of the gossiper class and
re-include some other headers here and there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now the number of shards and ignore-msb-bits are gossiped
with a separate call. It's simpler to include this data into
the initial gossiping state.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's much shorter and simpler to pass the seeds, obtained from the
config, into gossiper via gossip_config rahter than with the help
of a special call.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Looking into init_gossiper() helper makes it clear that what it does
is gets seeds, provider and listen_address from config and generates
a set of seeds for the gossiper. Then calls gossiper.set_seeds().
This patch renames the helper into get_seeds_from_config(), removes
all but db::config& argunebts from it and moves the call to set_seed()
into main.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The same as in previous patch -- the gossiper doesn't need to know
if it should call messaging.start_listen() or not, neither should
do the storage_service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Before preparing the cluster join process the messaging should be
put into listening state. Right now it's done "on-demand" by the
call to the do_shadow_round(), also there's a safety call in the
start_gossiping(). Tests, however, should not start listening, so
the do_bind boolean exists and is passed all the way around.
Make the main() code explicitly call the messaging.start_listen()
and leave tests without it. This change makes messaging start
listening a bit earlier, but in between these old and new places
there's nothing that needs messaging to stay deaf.
As the do_bind becomes useless, the wait_for_gossip_to_settle() is
also moved into main.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
On start handlers can be registered any time before the messaging
starts to listen. On stop handlers can remain registered any long,
since the messaging service stops early in drain_on_shutdown().
One tricky place is API start_/stop_gossiping(). The latter calls
gossiper::stop() thus unregistering the handlers. So to make the
start_gossiping() work it must call gossiper::start() in advance.
Overall the gossiper start/stop becomes this:
gossiper.start()
`- registers handlers
gossiper.start_gossiping()
`- // starts gossiping
gossiper.shutdown()
`- // stops gossiping
gossiper.stop()
`- calls shutdown() // re-entrable
`- unregisters handlers
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The helper in question is called in two places:
1. In main() as a fuse against early exception before creating the
drain_on_shutdown() defer
2. In the stop_gossiping() API call
Both can be replaced with the stop_gossiping() call from the .stop()
method, here's why:
1. In main the gossiper::stop() call is already deferred right after
the gossiper is started. So this change moves it above. It may
happen that an exception pops up before the old fuse was deferred,
but that's OK -- the stop_gossiping() is safe against early- and
re- entrances
2. The stop_gossiping() change is effectlvey a rename -- it calls the
stop_gossiping() as it did before, but with the help of the .stop()
method
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The start/stop sequence we're moving towards assumes a shutdown (or
drain) method that will be called early on stop to notify the service
that the system is going down so it could prepare.
For gossiper it already means calling stop_gossiping() on the shard-0
instance. So by and large this patch renames a few stop_gossiping()
calls into .shutdown() ones.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's taken purely from the db::config and thus can be set up early.
Right now the empty name is converted into "Test Cluster" one, but
remains empty in the config and is later used by the system_keyspace
code. This logic remains intact.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Turn the gossiper start/stop sequence into the canonical form
gossiper.start(std::ref(dependencies)...).get();
auto stop_gossiper = defer({
gossiper.invoke_on_all(&gossiper::stop).get();
});
gossiper.invoke_on_all(&gossiper::start).get();
The deferred call should be gossiper.stop(); but for now keep
the instances memory alive.
This trick is safe at this point, because .start() and .stop()
methods are both empty (still).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
constant is now ready to replace terminal as a final value representation.
Replace bind() with evaluate and shared_ptr<terminal> with constant.
We can't get rid of terminal yet. Sometimes terminal is converted back
to term, which constant can't do. This won't be a problem once we
replace term with expression.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
There was some repeating code in expr::get_elements family
of functions. It has been reduced into one function.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
We need to be able to access elements of a constant.
Adds functions to easily do it.
Those functions check all preconditions required to access elements
and then use partially_deserialize_* or similar.
It's much more convenient than using partially_deserialize directly.
get_list_of_tuples_elements is useful with IN restrictions like
(a, b) IN [(1, 2), (3, 4)].
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
term::bind_and_get is not needed anymore, remove it.
Some classes use bind_and_get internally, those functions are left intact
and renamed to bind_and_get_internal.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Start using evaluate_to_raw_value instead of bind_and_get.
This is a step towards using only evaluate.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
A list representing IN values might contain NULLs before evaluation.
We can remove them during evaluation, because nothing equals NULL.
If we don't remove them, there are gonna be errors, because a list can't contain NULLs.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
To convert a terminal to expr::constant we need to be able to serialize it.
tuples::in_value didn't have serialization implemented, do it.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Every class now has implementation of get_value_type().
We can simply make base class keep the data_type.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in user_types.hh.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in tuples.hh.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in mapshh.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in sets.hh.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in lists.hh.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in constants.hh.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Adds the functions:
constant evaluate(term*, const query_options&);
raw_value_view evaluate(term*, const query_options&);
These functions take a term, bind it and convert the terminal
to constant or raw_value_view.
In the future these functions will take expression instead of term.
For that to happen bind() has to be implemented on expression,
this will be done later.
Also introduces terminal::get_value_type().
In order to construct a constant from terminal we need to know the type.
It will be implemented in the following commits.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
A term should always be serialized using the internal cql serialization format.
A term represents a value received from the driver,
but for every use we are going to need it in the internal serialization format.
Other places in the code already do this, for example see list_prepare_term,
it calls value.bind(query_options::DEFAULT) to evaluate a collection_constructor.
query_options::DEFAULT has the latest cql serialization format.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
When unset_value is passed to make_temporary it gets converted to null_value.
This looks like a mistake, it should be just unset_value.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Adds constant to the expression variant:
struct constant {
raw_value value;
data_type type;
};
This struct will be used to represent constant values with known bytes and type.
This corresponds to the terminal from current design.
bool is removed from expression, now constant is used instead.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The helper is called once. Keeping this code in the caller packs the
code, helps it look more like main() and facilitates further patching.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Sizes too small to fit a ::seastar::memory::free_object won't contain
any objects at all so they don't contribute anything to the listing
beyond noise.
Closes#9366
Gossiper is still global and cql_test_env heavily exploits this fact.
Clean that by getting the gossiper once and using the local reference
everywhere else.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two of them and one is only called from the API with the
do_bind always set to "yes". This fact makes it possible to remove
it by adding relevant defaults for the other.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are several state booleans on the service and some helpers to
manipulate/check those. Make the code consistent by always using these
helpers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The debug_show() is unused, as well as the advertise_myself().
The _features_condvar used to be listened on before f32f08c9,
now it's signal-only.
Feature frendship with gossiper is not required.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
... from Nadav Har'El
This small series adds a stub implementation of Alternator's
UpdateTimeToLive and DescribeTimeToLive operations. These operations can
enable, disable, or inquire about, the chosen expiration-time attribute.
Currently, the information about the chosen attribute is only saved,
with no actual expiration of any items taking place.
Because this is an incomplete implementation of this feature, it is not
enabled unless an experimental flag is enabled on all nodes in the
cluster.
See the individual patches for more information on what this series
does.
Refs #5060.
Closes#9345
* github.com:scylladb/scylla:
test/alternator: rename utility function test_table_name()
alternator: stub TTL operations
alternator: make three utility functions in executor.cc non-static
test/alternator: test another corner case of TTL
Currently the following can happen:
1) there's ongoing compaction with input sstable A, so sstable set
and backlog tracker both contains A.
2) ongoing compaction replaces input sstable A by B, so sstable set
contains only B now.
3) schema is updated, so a new backlog tracker is built without A
because sstable set now contains only B.
4) ongoing compaction tries to remove A from tracker, but it was
excluded in step 3.
5) tracker can now have a negative value if table is decreasing in
size, which leads to log(<negative number>) == -NaN
This problem happens because backlog tracker updates are decoupled
from sstable set updates. Given that the essential content of
backlog tracker should be the same as one of sstable set, let's move
tracker management to table.
Whenever sstable set is updated, backlog tracker will be updated with
the same changes, making their management less error prone.
Fixes#9157
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The generic back formula is: ALL + PARTIAL - COMPACTING
With transfer_ongoing_charges() we already ignore the effect of
ongoing compactions on COMPACTING as we judge them to be pointless.
But ongoing compactions will run to completion, meaning that output
sstables will be added to ALL anyway, in the formula above.
With stop_tracking_ongoing_compactions(), input sstables are never
removed from the tracker, but output sstables are added, which means
we end up with duplicate backlog in the tracker.
By removing this tracking mechanism, pointless ongoing compaction
will be ignored as expected and the leaks will be fixed.
Later, the intention is to force a stop on ongoing compactions if
strategy has changed as they're pointless anyway.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
by switching to unordered_map, removal of generated monitors is
made easier. this is a preparatory change for patch which will
remove monitor for all exhausted sstables
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The former constructs a memtable from the vector of mutations and
then does exactlty the same steps as the latter one -- creates an
sstable corresponding to the memtable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are already four of them. Those working with the mutation reader
can be folded into one with some default args.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test intention is not to test how zero estimated partitions
work, there's another case for than (in another test). Also it
looks like 0 is doesn't flow anywhere far, it's std::max-ed into
1 early inside mc::writer constructor.
This changes significantly simplifies the unification of the set
of make_sstable_easy()-s in the next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The same trick as in the previous patch, but the new helper
accepts a memtable instead of a mutation reader and makes the
reader from the memtable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There a bunch of places in the test that do the same sequence
of steps to create an sstable. Generalize them into a helper
that resembles the one from previous patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch is two-fold. First it changes the signature of the
local helper to facilitate next patching. Second, it makes more
relevant places in the test use this helper.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The version argument can be omitted, the env.make_sstable will
default it to highest version. The generation argument is left
and defaulted to 1.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Originally, the expected failure for a recursive invocation
test case was to expect that fuel gets exhausted, but it's also
possible to hit a stack limit first. All errors are equally
expected here as long as the execution is halted, so let's relax
the condition and accept any wasm-related InvalidRequest errors.
Closes#9361
This reverts commit e9343fd382, reversing
changes made to 27138b215b. It causes a
regression in v2 serialization_format support:
collection_serialization_with_protocol_v2_test fails with: marshaling error: read_simple_bytes - not enough bytes (requested 1627390306, got 3)
Fixes#9360
"
Currently database start and stop code is quite disperse and
exists in two slightly different forms -- one in main and the
other one in cql_test_env. This set unifies both and makes
them look almost the perfect way:
sharded<database> db;
db.start(<dependencies>);
auto stop = defer([&db] { db.stop().get(); });
db.invoke_on_all(&database::start).get();
with all (well, most) other mentionings of the "db" variable
being arguments for other services' dependencies.
tests: unit(dev, release), unit.cross_shard_barrier(debug)
dtest.simple_boot_shutdown(dev)
refs: #2737
refs: #2795
refs: #5489
"
* 'br-database-teardown-unification-2' of https://github.com/xemul/scylla: (26 commits)
main: Log when database starts
view_update_generator: Register staging sstables in constructor
database, messaging: Delete old connection drop notification
database, proxy: Relocate connection-drop activity
messaging, proxy: Notify connection drops with boost signal
database, tests: Rework recommended format setting
database, sstables_manager: Sow some noexcepts
database: Eliminate unused helpers
database: Merge the stop_database() into database::stop()
database: Flatten stop_database()
database: Equip with cross-shard-barrier
database: Move starting bits into start()
database: Add .start() method
main: Initialize directories before database
main, api: Detach set_server_config from database and move up
main: Shorten commitlog creation
database: Extract commitlog initialization from init_system_keyspace
repair: Shutdown without database help
main: Shift iosched verification upward
database: Remove unused mm arg from init_non_system_keyspaces()
...
We have a utility function test_table_name() to create a unique name for
a test table. The funny thing is, that because this function starts with
the string "test_", pytest believes it's a test. This doesn't cause any
problems (it's consider a *passing* test), but it's nevertheless strange
to see it listed on the list of tests.
So in this page, we trivially rename this function to unique_table_name(),
a name why pytest doesn't think is the name of test.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds stubs for the UpdateTimeToLive and DescribeTimeToLive
operations to Alternator. These operations can enable, disable, or inquire
about, the chosen expiration-time attribute.
Currently, the information about the chosen attribute is only saved, with
no actual expiration of any items taking place.
Some of the tests for the TTL feature start to pass, so their xfail tag
is removed.
Because this this new feature is incomplete, it is not enabled unless
the "alternator-ttl" experimental feature is enabled. Moreover, for
these operations to be allowed, the entire cluster needs to support
this experimental feature, because all nodes need to participate in the
data expiration - if some old nodes don't support Alternator TTL, some
of the data they hold won't get expired... So we don't allow enabling
TTL until all the nodes in the cluster support this feature.
The implementation is in a new source file, alternator/ttl.cc. This
source file will continue to grow as we implement the expiration feature.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Make three of the utility functions in alternator/executor.cc, which
until now were static (local to the source files) external symbols
(in the alternator namespace). This will allow using them in other
Alternator source files - like the one in the next patch for TTL
support, which we'll want to put in a separate source file.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Usually the TTL feature's expiration-time attribute is a schema-less
attribute, implemented in Alternator as a JSON-serialized item in a
bigger map column. However, key attributes are a special case because
they are implemented as separate columns. We already had test cases
showing that this case works too - for the case of hash and range keys.
In this test we test another possibility of an attribute that is
implemented as a schema column - the case of an LSI key.
As the other TTL tests, this test too passes on DynamoDB but xfails on
Alternator because the TTL feature is not yet implemented.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This was a global variable that was potentially modified from a
performance benchmark. It would modify the behavior of `index_reader`
in certain scenarios.
Remove the variable so we can specify the behavior of `index_reader`
functions without relying on anything other than what's passed into the
constructor and the function parameters.
Indicate whether the compaction job should be aborted
due to an error using a new, compaction_aborted_exception type,
vs. compaction_stopped_exception that indicates
the task should be stopped due to some external event that
doesn't indicate an error (like shutdown or api call).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
message.
It is misleading to set retry to true in the following statement
and return it later on when the `will_stop` parameter is true.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, _stats.errors is not accounted for non-retryable errors
like storage_io_error.
Fixes#9354
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We found that user can mistakenly break system with --builddir option,
something like './reloc/build_deb.sh --builddir /'.
To avoid that we need to stop removing entire $BUILDDIR, remove
directories only we have to clean up before building deb package.
See: https://github.com/scylladb/scylla-python3/pull/23#discussion_r707088453Closes#9351
First, it's to fix the discarded future during the register. The
future is not actually such, as it's always the no-op ready one as
at that stage the view_update_generator is neither aborted nor is
in throttling state.
Second, this change is to keep database start-up code in main
shorter and cleaner. Registering staging sstables belongs to the
view_update_generator start code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Database no longer needs it. Since the only user of the old-style
notification is gone -- remove it as well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
On start database is subscribed on messaging-service connection drop
notification to drop the hit-rate from column families. However, the
updater and reader of those hit-rates is the storage_proxy, so it
must be the _proxy_ who drops the hit-rate.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The messaging_service keeps track of a list of connection-drop
listeners. This list is not auto-removing and is thus not safe
on stop (fortunately there's only 1 non-stopping client of it
so far).
This patch adds a safter notification based on boost/signals.
Also storage_proxy is subscribed on it in advance to demonstrate
how it looks like altogether and make next patch shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Tests don't have sstable format selector and enforce the needed
format by hands with the help of special database:: method. It's
more natural to provide it via convig. Doing this makes database
initialization in main and cql_test_env closer to each other.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Setting sstables format into database and into sstables_manager is
all plain assignments. Mark them as noexcept, next patch will become
apparently exception safe after that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are some large-data-handler-related helpers left after previous
patches, they can be removed altogehter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After stop_database() became shard-local, it's possible to merge
it with database::stop() as they are both called one after another
on scylla stop. In cql-test-env there are few more steps in
between, but they don't rely on the database being partially
stopped.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method need to perform four steps cross-shard synchronously:
first stop compaction manager, then close user and, after it,
system tables, finally shutdown the large data handler.
This patch reworks this synchronization with the help of cross-shard
barrier added to the database previously. The motivation is to merge
.stop_database() with .stop().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make sure a node-wide barrier exists on a database when scylla starts.
Also provide a barrier for cql_test_env. In all other cases keep a
solo-mode barrier so that single-shard db stop doesn't get blocked.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch extends the IDL grammar by allowing to use
multiple `[[...]]` attribute clauses, as well, as specifying
more than one attribute inside a single attribute clause, e.g.:
`[[attr1, attr2]]` will be parsed correctly now.
For now, in all existing use cases only the first attribute
is taken into account and the rest is ignored.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Introduce a new header `message/rpc_protocol_impl.hh`, move here
the following things from `message/messaging_service.cc`:
* RPC protocol wrappers implementation
* Serialization thunks
* `register_handler` and `send_message*` functions
This code will be used later for IDL-generated RPC verbs
implementation.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This is to keep all database start (and stop) code together. Right
now directories startup breaks this into two pieces.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The api::set_server_config() depends on sharded database to start, but
really doesn't need it -- it needs only the db::config object which's
available earlier.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This does three things in one go:
- converts
db.invoke_on_all([] (database& db) {
return db.init_commitlog();
});
into a one-line version
db.invoke_on_all(&database::init_commitlog);
- removes the shard-0 pre-initialization for tests, because
tests don't have the problem this pre- solves
- make the init_commitlog() re-entrable to let regular start
not check for shard-0 explicitly
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The intention is to keep all database initialization code in one place.
The init_system_keyspace() is one the obstacles -- it initializes db's
commitlog as first step.
This patch moves the commitlog initialization out of the mentioned
helper. The result looks clumsy, but it's temporary, next patches will
brush it up.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sharded database reference is passed into repair_shutdown() just
to have something to call .invoke_on_all() onto. There's the more
appropriate sharded repair_service for this, so use it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a block of CLI options sanity checks in the beginning of
main starting lambda, it's better to have the iosched validation
in this block.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All the debug:: inhabitants have their names look like "the_<classname>"
This patch brings the database piece to this standard.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All the large data handler methods rely on global qctx thing to
write down its notes. This creates circular dependency:
query processor -> database -> large_data_handler -> qctx -> qp
In scylla this is not a technical problem, neither qctx nor the
query processor are stopped. It is a problem in cql_test_env
that stops everything, including resetting qctx to null. To avoid
tests stepping on nullptr qctx add the explicit check.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The wasm::engine exists as a sharded<> service in main, but it's only
passed by local reference into database on start. There's no much profit
in keeping it at main scope, things get much simpler if keeping the
engine purely on database.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add a synchronization facility to let shards wait for each
other to pass through certain points in the code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This series of commits fixes a small number of bugs with current implementation of HTTP API which allows to wait until hints are replayed, found by running the `hintedhandoff_sync_point_api_test` dtest in debug mode.
Refs: #9320Closes#9346
* github.com:scylladb/scylla:
commitlog: make it possible to provide base segment ID
hints: fill up missing shards with zeros in decoded sync points
hints: propagate abort signal correctly in wait_for_sync_point
hints: fix use-after-free when dismissing replay waiters
This warning can catch a virtual function that thinks it
overrides another, but doesn't, because the two functions
have different signatures. This isn't very likely since most
of our virtual functions override pure virtuals, but it's
still worth having.
Enable the warning and fix numerous violations.
Closes#9347
The copy constructor of small vector has a noexcept specifier, however
it calls `reserve(size_t)`, which can throw `std::bad_alloc`. This
causes issues when using it inside tests that use
alloc_failure_injector, but potentially could also float up in the
production.
Closes#9338
Adds a configuration option to the commitlog: base_segment_id. When
provided, the commitlog uses this ID as a base of its segment IDs
instead of calculating it based on the number of milliseconds between
the epoch and boot time.
This is needed in order for the feature which allows to wait for hints
to be replayed to work - it relies on the replay positions monotonically
increasing. Endpoint managers periodically re-creates its commitlog
instance - if it is re-created when there are no segments on disk,
currently it will choose the number of milliseconds between the epoch
and boot time, which might result in segments being generated with the
same IDs as some segments previously created and deleted during the same
runtime.
Between encoding and decoding of a sync point, the node might have been
restarted and resharded with increased shard count. During resharding,
existing hints segments might have been moved to new shards. Because of
that, we need to make sure that we wait for foreign segments to be
replayed on the new shards too.
This commit modifies the sync point decoding logic so that it places a
zero replay position for new shards. Additionally, a (incorrect) shard
count check is removed from `storage_proxy::wait_for_hint_sync_point`
because now the shard count in decoded sync point is guaranteed to be
not less than the node's current shard count.
This will make it easier, for example, to enforce memory limits in lower
levels of the `flat_mutation_reader` stack.
By default, the query result size is unlimited. However, for specific queries it is
possible to store a different value (e.g. obtained from a `read_command` object)
through a setter. An example of this can be seen in the last commit of this PR,
where we set the limit to `cmd.max_result_size` if engaged, or to the 'unlimited
query' limit (using `database::get_unlimited_query_max_result_size()`) if not.
Refs: #9281. The v2 version of the reverse sstable reader PR will be based on this PR:
we'll use the query max result size parameter in one of the readers down the stack
where `read_command` is not available but `reader_permit` is.
Closes#9341
* github.com:scylladb/scylla:
table, database: query, mutation_query: remove unnecessary class_config param
reader_permit: make query max result size accessible from the permit
reader_concurrency_semaphore: remove default parameter values from constructors
query_class_config: remove query::max_result_size default constructor
When `manager::wait_for_sync_point` is called, the abort source from the
arguments (`as`) might have already been triggered. In such case, the
subscription which was supposed to trigger the `local_as` abort source
won't be run, and the code will wait indefinitely for hints to be
replayed instead of checking the replay status and returning
immediately.
This commit fixes the problem by manually triggering `local_as` if `as`
have been triggered.
When the promise waited on in the `wait_until_hints_are_replayed_up_to`
function is resolved, a continuation runs which prints a log line with
information about this event. The continuation captures a pointer to the
hints sender and uses it to get information about the endpoint whose
hints are waited for. However, at this point the sender might have been
deleted - for example, when the node is being stopped and everybody
waiting for hints is dismissed.
This commit fixes the use-after-free by getting all necessary
information while the sender is guaranteed to be alive and captures it
in the continuation's capture list.
The semaphore inside was never accessed and `max_memory_for_unlimited_query`
was always equal to `*cmd.max_result_size` so the parameter was completely
redundant.
`cmd.max_result_size` is supposed to be always set in the affected
functions - which are executed on the replica side - as soon as the
replica receives the `read_command` object, in case the parameter was
not set by the coordinator. However, we don't have a guarantee at the
type level (it's still an `optional`). Many places used
`*cmd.max_result_size` without even an assertion.
We make the code a bit safer, we check for `cmd.max_result_size` and if
it's indeed engaged, store it in `reader_permit`. We then access it from
`reader_permit` where necessary. If `cmd.max_result_size` is not set, we
assume this is an unlimited query and obtain the limit from
`get_unlimited_query_max_result_size`.
This will make it easier, for example, to enforce memory limits in lower
levels of the flat_mutation_reader stack.
By default the size is unlimited. However, for specific queries it is
possible to store a different value (for example, obtained from a
`read_command` object) through a setter.
It's easy to forget about supplying the correct value for a parameter
when it has a default value specified. It's safer if 'production code'
is forced to always supply these parameters manually.
The default values were mostly useful in tests, where some parameters
didn't matter that much and where the majority of uses of the class are.
Without default values adding a new parameter is a pain, forcing one to
modify every usage in the tests - and there are a bunch of them. To
solve this, we introduce a new constructor which requires passing the
`for_tests` tag, marking that the constructor is only supposed to be
used in tests (and the constructor has an appropriate comment). This
constructor uses default values, but the other constructors - used in
'production code' - do not.
The default values for the fields of this class didn't make much sense,
and the default constructor was used only in a single place so removing
it is trivial.
It's safer when the user is forced to supply the limits.
This series adds very basic support for WebAssembly-based user-defined functions.
This series comes with a basic set of tests which were used to designate a minimal goal for this initial implementation.
Example usage:
```cql
CREATE FUNCTION ks.fibonacci (str text)
RETURNS NULL ON NULL INPUT
RETURNS boolean
LANGUAGE xwasm
AS ' (module
(func $fibonacci (param $n i32) (result i32)
(if
(i32.lt_s (local.get $n) (i32.const 2))
(return (local.get $n))
)
(i32.add
(call $fibonacci (i32.sub (local.get $n) (i32.const 1)))
(call $fibonacci (i32.sub (local.get $n) (i32.const 2)))
)
)
(export "fibonacci" (func $fibonacci))
) '
```
Note that the language is currently called "xwasm" as in "experimental wasm", because its interface is still subject to change in the future.
Closes#9108
* github.com:scylladb/scylla:
docs: add a WebAssembly entry
cql-pytest: add wasm-based tests for user-defined functions
main: add wasm engine instantiation
treewide: add initial WebAssembly support to UDF
wasm: add initial WebAssembly runtime implementation
db: add wasm_engine pointer to database
lang: add wasm_engine service
import wasmtime.hh
lua: move to lang/ directory
cql3: generalize user-defined functions for more languages
A first set of wasm-based test cases is added.
The tests include verifying that supported types work
and that validation of the input wasm is performed.
This commit adds a very basic support for user-defined functions
coded in wasm. The support is very limited (only a few types work)
and was not tested against reactor stalls and performance in general.
The engine is based on wasmtime and is able to:
- compile wasm text format to bytecode
- run a given compiled function with custom arguments
This implementation is missing crucial features, like running
on any other types than 32-bit integers. It serves as a skeleton
for future full implementation.
Add new struct to the `expression` variant:
```c++
// A value serialized with the internal (latest) cql_serialization_format
struct constant {
cql3::raw_value value;
data_type type; // Never nullptr, for NULL and UNSET might be empty_type
};
```
and use it where possible instead of `terminal`.
This struct will eventually replace all classes deriving from `terminal`, but for now `terminal` can't be removed completely.
We can't get rid of terminal yet, because sometimes `terminal` is converted back to `term`, which `constant` can't do. This won't be a problem once we replace term with expression.
`bool` is removed from `expression`, now `constant` is used instead.
This is a redesign of PR #9203, there is some discussion about the chosen representation there.
Closes#9244
* github.com:scylladb/scylla:
cql3: term: Remove get_elements and multi_item_terminal from terminals
cql3: Replace most uses of terminal with expr::constant
cql3: expr: Remove repetition from expr::get_elements
cql3: expr: Add expr::get_elements(constant)
cql3: term: remove term::bind_and_get
cql3: Replace all uses of bind_and_get with evaluate_to_raw_view
cql3: expr: Add evaluate_IN_list
cql3: tuples: Implement tuples::in_value::get
cql3: Move data_type to terminal, make get_value_type non-virtual
cql3: user_types: Implement get_value_type in user_types.hh
cql3: tuples: Implement get_value_type in tuples.hh
cql3: maps: Implement get_value_type in maps.hh
cql3: sets: Implement get_value_type in sets.hh
cql3: lists: Implement get_value_type in lists.hh
cql3: constants: Implement get_value_type in constants.hh
cql3: expr: Add expr::evaluate
cql3: values: Add unset value to raw_value_view::make_temporary
cql3: expr: Add constant to expression
All the namespace scope functions in system_keyspace have no place
to store context, so they must store their context in global
variables. This prevents conversion of those global variables
to constructor-provided depdendencies.
Take the first step towards providing a place to store the
context by converting system_keyspace to a class. All the functions
are static, so no context is yet available, but we can de-static-ify
them incrementally in the future and store the context in class members.
Closes#9335
* github.com:scylladb/scylla:
system_keyspace: convert from namespace to class
system_keyspace: prepare forward-declared members
system_keyspace: rearrange legacy subnamespace
system_keyspace: remove outdated java code
The above mentioned method is supposed to work both ways: reversed <-> forward, so setting the reversed bit is not correct: it should be toggled, which is what this mini-series does.
Closes#9327
* github.com:scylladb/scylla:
reverse_slice(): toggle reversed bit instead of setting it
partition_slice_builder(): add with_option_toggled()
enum_set: add toggle()
'$builddir/release/{scylla_product}-python3-package.tar.gz' on
dist-python3 target is for compat-python3, we forgot to remove at 35a14ab.
Fixes#9333Closes#9334
constant is now ready to replace terminal as a final value representation.
Replace bind() with evaluate and shared_ptr<terminal> with constant.
We can't get rid of terminal yet. Sometimes terminal is converted back
to term, which constant can't do. This won't be a problem once we
replace term with expression.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
There was some repeating code in expr::get_elements family
of functions. It has been reduced into one function.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
We need to be able to access elements of a constant.
Adds functions to easily do it.
Those functions check all preconditions required to access elements
and then use partially_deserialize_* or similar.
It's much more convenient than using partially_deserialize directly.
get_list_of_tuples_elements is useful with IN restrictions like
(a, b) IN [(1, 2), (3, 4)].
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
term::bind_and_get is not needed anymore, remove it.
Some classes use bind_and_get internally, those functions are left intact
and renamed to bind_and_get_internal.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
"
There's a single call to get_local_storage_proxy in cdc
code that needs to get database from. Furtunately, the
database can be easily provided there via call argument.
tests: unit(dev)
"
* 'br-remove-proxy-from-cdc' of https://github.com/xemul/scylla:
cdc: Add database argument to is_log_for_some_table
client_state: Pass database into has_access()
client_state: Add database argument to has_schema_access
client_state: Add database argument to has_keyspace_access()
cdc: Add database argument to check_for_attempt_to_create_nested_cdc_log
"
There's a bunch of explicit and implicit async contexts nesting
in sstables tests. This set turns them into a single nest async
(mostly with an awk script).
The indentation in first two patches is deliberately left as it
was before patching, i.e. -- slightly broken. As a consolation,
after the third patch it suddenly becomes fixed as the unneeded
intermediate call with broken indent is removed.
tests: unit(dev)
"
* 'br-sst-tests-no-nested-async' of https://github.com/xemul/scylla:
test: Don't nest seastar::async calls (2nd cont)
test: Don't nest seastar::async calls (cont)
test: Don't nest seastar::async calls
To building Ubuntu AMI with CPU scaling configuration, we need force
running mode for scylla_cpuscaling_setup, which run setup without
checking scaling_governor support.
See scylladb/scylla-machine-image#204
Closes#9326
Start using evaluate_to_raw_value instead of bind_and_get.
This is a step towards using only evaluate.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
A list representing IN values might contain NULLs before evaluation.
We can remove them during evaluation, because nothing equals NULL.
If we don't remove them, there are gonna be errors, because a list can't contain NULLs.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
To convert a terminal to expr::constant we need to be able to serialize it.
tuples::in_value didn't have serialization implemented, do it.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Every class now has implementation of get_value_type().
We can simply make base class keep the data_type.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in user_types.hh.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in tuples.hh.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in mapshh.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in sets.hh.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in lists.hh.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
To convert a terminal to expr::constant we need know the value type.
Implement getting value type for terminals in constants.hh.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Adds the functions:
constant evaluate(term*, const query_options&);
raw_value_view evaluate(term*, const query_options&);
These functions take a term, bind it and convert the terminal
to constant or raw_value_view.
In the future these functions will take expression instead of term.
For that to happen bind() has to be implemented on expression,
this will be done later.
Also introduces terminal::get_value_type().
In order to construct a constant from terminal we need to know the type.
It will be implemented in the following commits.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
When unset_value is passed to make_temporary it gets converted to null_value.
This looks like a mistake, it should be just unset_value.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Adds constant to the expression variant:
struct constant {
raw_value value;
data_type type;
};
This struct will be used to represent constant values with known bytes and type.
This corresponds to the terminal from current design.
bool is removed from expression, now constant is used instead.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
All the namespace scope functions in system_keyspace have no place
to store context, so they must store their context in global
variables. This prevents conversion of those global variables
to constructor-provided depdendencies.
Take the first step towards providing a place to store the
context by converting system_keyspace to a class. All the functions
are static, so no context is yet available, but we can de-static-ify
them incrementally in the future and store the context in class members.
Indentation is a mess, but can be easily fixed later.
In anticipation of making system_keyspace a class instead of a
namespace, rename any member that is currently forward-declared,
since one can't forward-declare a class member. Each member
is taken out of the system_keyspace namespace and gains a
system_keyspace prefix. Aliases are added to reduce code churn.
The result isn't lovely, but can be adjusted later.
Merge two fragments together, in anticipation of making 'legacy'
s struct instead of a namespace (when system_keyspace is a class,
we can't nest a namespace inside it).
On ARM, the code in libstdc++ responsible for computing random
floating-point numbers is very slow, because it uses `long double`
arithmetic, which is emulated in software on this architecture.
The performance effect on read queries is noticeable – about 6% of the
total work of a read from cache. Since probabilistic read repair is almost
always disabled (and under consideration for removal) let's just optimize
the case when it's disabled.
Fixes#9107Closes#9329
WASM engine needs to be used from two separate contexts:
- when a user-defined function is created via CQL
- when a user-defined function is received during schema migration
The common instance that these two have in common is the database
object, so that's where the reference is stored.
"
The tests contains a single case that runs 6 different
cases inside and is one of the longest tests out there.
Splitting it improves parallel-cases suite run time.
tests: unit(dev, debug, release)
"
* 'br-split-sst-conforms-to-ms' of https://github.com/xemul/scylla:
tests: Fix indentation after previous patch
tests: Split sstable_conforms_to_mutation_source
timeuuid is not monotonic when now() is called on different connections,
so when running tests that depend on that property, we get failures if
using the Scylla driver (which became standard in 729d0fe).
Skip the tests for now, until we figure out what to do. We probably
can't make now() globally monotonic, and there isn't much to gain
by making it monotonic only per connection, since clients are allowed
to switch connections (and even nodes) at will.
Ref #9300Closes#9323
[avi: committing my own patch to unblock master]
The test/alternator/run script runs Scylla and then runs pytest against
it. But when passing the "--aws" option, the intention is that these
tests be run against AWS DynamoDB, not a local Scylla, so there is no
point in starting Scylla at all - so this is what we do in this patch.
This doesn't really add a new feature - "test/alternator/run --aws"
will now be nothing more than "cd test/alternator; pytest --aws".
But it adds the convenience that you can run the same tests on Scylla
and AWS with exactly the same "run" command, just adding the "--aws"
option, and don't need to sometimes use "run" and sometimes "pytest".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210912133239.75463-1-nyh@scylladb.com>
Flushing schema tables is important for crash recovery (without a flush,
we might have sstables using a new schema before the commitlog entry
noting the schema change has been replayed), but not important for tests
that do not test crash recovery. Avoiding those flushes reduces system,
user, and real time on tests running on a consumer-level SSD.
before:
real 8m51.347s
user 7m5.743s
sys 5m11.185s
after:
real 7m4.249s
user 5m14.085s
sys 2m11.197s
Note real time is higher that user+sys time divided by the number
of hardware threads, indicating that there is still idle time due
to the disk flushing, so more work is needed.
Closes#9319
the name compaction_options is confusing as it overlaps in meaning
with compaction_descriptor. hard to reason what are the exact
difference between them, without digging into the implementation.
compaction_options is intended to only carry options specific to
a give compaction type, like a mode for scrub, so let's rename
it to compaction_type_options to make it clearer for the
readers.
[avi: adjust for scrub changes]
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210908003934.152054-1-raphaelsc@scylladb.com>
Corrupt keys might be printed as non-utf8 strings to the log,
and that, in turn, may break applications reading the logs,
such as Python (3.7)
For example:
```
Traceback (most recent call last):
File "/home/bhalevy/dev/scylla-dtest/dtest.py", line 1148, in tearDown
self.cleanUpCluster()
File "/home/bhalevy/dev/scylla-dtest/dtest.py", line 1184, in cleanUpCluster
matches = node.grep_log(expr)
File "/home/bhalevy/dev/scylla-ccm/ccmlib/node.py", line 367, in grep_log
for line in f:
File "/usr/lib64/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 5577: invalid start byte
```
Test: unit(dev)
DTest: scrub_with_one_node_expect_data_loss_test
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210730105428.2844668-1-bhalevy@scylladb.com>
Previously, the layout for storing raft snapshot
descriptors contained a `config` field, which had `blob`
data type.
That means `raft::configuration` for the snapshot was serialized
as a whole in binary form. It's convenient to implement and
is the most compact form of representing the data, but:
1. Hard to debug due to the need to de-serialize the data.
2. Plants a time bomb wrt. changing data layout and also the
documentation in the future.
Remove the `config` field from `system.raft_snapshots` and
extract it to a separate `system.raft_config` table to store
the data in exploded form.
Also, modify the schema of `system.raft_snapshots` table in
the following way: add a `server_id` field as a part of
composite partition key ((group_id, server_id)) to
be able to start multiple raft servers belonging to one raft
group on the same scylla node.
Rename `id` field in `raft_snapshots` to `snapshot_id` so
it's self-documenting.
Rename `snapshot_id` from clustering key since a given server
can have only one snapshot installed at a time.
Note that the `raft::server_address` stucture contains an opaque
`info` member, which is `bytes`, but in the `raft_config` table
we use `ip_addr inet` field, instead. We always know that the
corresponding member field is going to contain an IP address (either v4
or v6) of a given raft server.
So, now the snapshots schema looks like this:
CREATE TABLE raft_snapshots (
group_id timeuuid,
server_id uuid,
snapshot_id uuid,
idx int,
term int,
-- no `config` field here, moved to `raft_config` table
PRIMARY KEY ((group_id, server_id))
)
CREATE TABLE raft_config (
group_id timeuuid,
my_server_id uuid,
server_id uuid,
disposition text, -- can be either 'CURRENT` or `PREVIOUS'
can_vote bool,
ip_addr inet,
PRIMARY KEY ((group_id, my_server_id), server_id, disposition)
);
This way it's much easier to extend the schema with new fields,
very easy to debug and inspect via CQL, and it's much more descriptive
in terms of self-documentation.
Tests: unit(dev)
* manmanson/raft_snapshots_new_schema_v2:
test: adjust `schema_change_test` to include new `system.raft_config` table
raft: new schema for storing raft snapshots
raft: pass server id to `raft_sys_table_storage` instance
The interval template member functions mostly accept
tri-comparators but a few functions accept less-comparators.
To reduce the chance of error, and to provide better error
messages, constrain comparator parameters to the expected
signature.
In one case (db/size_estimates_virtual_reader.cc) the caller
had to be adjusted. The comparator supported comparisons
of the interval value type against other types, but not
against itself. To simplify things, we add that signature too,
even though it will never be called.
Closes#9291
column_value_tuple overlaps both column_value and tuple_constructor
(in different respects) and can be replaced by a combination: a
tuple_constructor of column_value. The replacement is more expressive
(we can have a tuple of column_value and other expression types), though
the code (especially grammar) do not allow it yet.
So remove column_value_tuple and replace it everywhere with
tuple_constructor. Visitors get the merged behavior of the existing
tuple_constructor and column_value_tuple, which is usually trivial
since tuple_constructor and column_value_tuple came from different
hierarchies (term::raw and relation), so usually one of the types
just calls on_internal_error().
The change results in awkwards casts in two areas: WHERE clause
filtering (equal() and related), and clustering key range evaluations
(limits() and related). When equal() is replaced by recursive
evaluate(), the casts will go way (to be replaced by the evaluate())
visitor. Clustering key range extraction will remain limited
to tuples of column_value, so the prepare phase will have to vet
the expressions to ensure the casts don't fail (and use the
filtering path if they will).
Tests: unit (dev)
Closes#9274
Currently scrub compaction filters-out sstables that are undergoing
(regular) compaction. This is surprising to the user and we would like
scrub (in validate mode or otherwise) to examine all sstables in the
table.
Scrub in VALIDATE mode is read-only, therefore it can run in parallel to
regular compaction. However, this series makes sure it selects all
sstables in the table, without filtering sstables undergoing compaction.
For scrub in non-validation mode, we would like to ensure that it
examined all sstables that were sealed when it started and it fixed any
corruption (based on the scrub mode). Therefore, we stop ongoing
compactions when running scrub in non-validation modes. Otherwise
compaction might just copy the corrupt data onto new sstables, requiring
scrub to run again.
Also, acquire _compaction_locks write lock for the table to serialize
with other custom compaction jobs like major compaction, reshape, and
reshard.
Fixes#9256
Test: unit(dev) DTest:
nodetool_additional_test.py:TestNodetool.{validate_sstable_with_invalid_fragment_test,
validate_ks_sstable_with_invalid_fragment_test,validate_with_one_node_expect_data_loss_test}
Closes#9258
* github.com:scylladb/scylla:
compaction_manager: rewrite_sstables: acquire _compaction_locks
compaction_manager: perform_sstable_scrub: run_with_compaction_disabled
compaction: don't rule out compacting sstables in validate-mode scrub
A tool which can be used to examine the content of sstable(s) and
execute various operations on them. The currently supported operations
are:
* dump - dumps the content of the sstable(s), similar to sstabledump;
* dump-index - dumps the content of the sstable index(es), similar to scylla-sstable-index;
* writetime-histogram - generates a histogram of all the timestamps in
the sstable(s);
* custom - a hackable operation for the expert user (until scripting
support is implemented);
* validate - validate the content of the sstable(s) with the mutation
fragment stream validator, same as scrub in validate mode;
The sstables to-be-examined are passed as positional command line
arguments. Sstables will be processed by the selected operation
one-by-one (can be changed with `--merge`). Any number of sstables can
be passed but mind the open file limits. Pass the full path to the data
component of the sstables (*-Data.db). For now it is required that the
sstable is found at a valid data path:
/path/to/datadir/{keyspace_name}/{table_name}-{table_id}/
The schema to read the sstables is read from a `schema.cql` file. This
should contain the keyspace and table definitions, as well as any UDTs
used.
Filtering the sstable(s) to process only certain partition(s) is supported
via the `--partition` and `--partitions-file` command line flags.
Partition keys are expected to be in the hexdump format used by scylla
(hex representation of the raw buffer).
Operations write their output to stdout, or file(s). The tool logs to
stderr, with a logger called `scylla-sstable-crawler`.
Examples:
# dump the content of the sstable
$ scylla-sstable-crawler --dump /path/to/md-123456-big-Data.db
# dump the content of the two sstable(s) as a unified stream
$ scylla-sstable-crawler --dump --merge /path/to/md-123456-big-Data.db /path/to/md-123457-big-Data.db
# generate a joint histogram for the specified partition
$ scylla-sstable-crawler --writetime-histogram --partition={{myhexpartitionkey}} /path/to/md-123456-big-Data.db
# validate the specified sstables
$ scylla-sstable-crawler --validate /path/to/md-123456-big-Data.db /path/to/md-123457-big-Data.db
Future plans:
* JSON output for dump.
* A simple way of generating `schema.cql` for any schema, other than copying it from snapshots, or copying from `cqlsh`. None of these generate a complete output.
* Relax sstable path checks, so sstables can be loaded from any path.
* Add scripting support (Lua), allowing custom operations to be written
in a scripting language.
Refs: #9241Closes#9271
* github.com:scylladb/scylla:
tools: remove scylla-sstable-index
tools: introduce scylla-sstable
tools: extract finding selected operation (handler) into function
tools: add schema_loader
cql3: query_processor: add parse_statements()
cql3: statements/create_type: expose create_type()
cql3: statements/create_keyspace: add get_keyspace_metadata()
We define the native reverse format as a reversed mutation fragment
stream that is identical to one that would be emitted by a table with
the same schema but with reversed clustering order. The main difference
to the current format is how range tombstones are handled: instead of
looking at their start or end bound depending on the order, we always
use them as-usual and the reversing reader swaps their bounds to
facilitate this. This allows us to treat reversed streams completely
transparently: just pass along them a reversed schema and all the
reader, compacting and result building code is happily ignorant about
the fact that it is a reversed stream.
This series is the first step towards implementing efficient reverse
reads. It allows us to remove all the special casing we have in various
places for reverse reads and thus treating reverse streams transparently
in all the middle layers. The only layers that have to know about the
actual reversing are mutation sources proper. The plan is that when
reading in reverse we create a reversed schema in the top layer then
pass this down as the schema for the read. There are two layers that
will need to act on this reversed schema:
* The layer sitting on top of the first layer which still can't handle
reversed streams, this layer will create a reversed reader to handle
the transition.
* The mutation source proper: which will obtain the underlying schema
and will emit the data in reverse order.
Once all the mutation sources are able to handle reverse reads, we can
get rid of the reverse reader entirely.
Refs: #1413
Tests: unit(dev)
TODO:
* v2
* more testing
Also on: https://github.com/denesb/scylla.git reverse-reads/v3
Changelog
v3:
* Drop the entire schema transformation mechanism;
* Drop reversing from `schema_builder()`;
* Don't keep any information about whether the schema is reversed or not
in the schema itself, instead make reversing deterministic w.r.t.
schema version, such that:
`s.version() == s.make_reversed().make_reversed().version()`;
* Re-reverse range tombstones in `streaming_mutation_freezer`, so
`reconcilable_results` sent to the coordinator during read repair
still use the old reverse format;
v2:
* Add `data_type reversed(data_type)`;
* Add `bound_kind reverse_kind(bound_kind)`;
* Make new API safer to use:
- `schema::underlying_type()`: return this when unengaged;
- `schema::make_transformed()`: noop when applying the same
transformation again;
* Generalize reversed into transformation. Add support to transferring
to remote nodes and shards by way of making `schema_tables` aware of
the transformation;
* Use reverse schema everywhere in reverse reader;
Closes#9184
* github.com:scylladb/scylla:
range_tombstone_accumulator: drop _reversed flag
test/boost/mutation_test: add test for mutation::consume() monotonicity
test/boost/flat_mutation_reader_test: more reversed reader tests
flat_mutation_reader: make_reversing_reader(): implement fast_forward_to(partition_range)
flat_mutation_reader: make_reversing_reader(): take ownership of the reader
test/lib/mutation_source_test: add consistent log to all methods
mutation: introduce reverse()
mutation_rebuilder: make it standalone
mutation: make copy constructor compatible with mutation_opt
treewide: switch to native reversed format for reverse reads
mutation: consume(): add native reverse order
mutation: consume(): don't include dummy rows
query: add slice reversing functions
partition_slice_builder: add range mutating methods
partition_slice_builder: add constructor with slice
query: specific_ranges: add non-const ranges accessor
range_tombstone: add reverse()
clustering_bounds_comparator: add reverse_kind()
schema: introduce make_reversed()
schema: add a transforming copy constructor
utils: UUID_gen: introduce negate()
types: add reversed(data_type)
docs: design-notes: add reverse-reads.md
Check that the reverse reader emits a stream identical to that emitted
by a reader reading in native order from a table with reversed
clustering order.
Most test methods log their own name either via testlog.info() or
BOOST_TEST_MESSAGE() so failures can be more easily located. Not all do
however. This commit fixes this and also converts all those using
BOOST_TEST_MESSAGE() for this to testlog.info(), for consistency.
Currently `_data` is assumed to be engaged by the copy constructor which
is not necessarily the case with `mutation_opt` objects (which is an
`optimized_optional<mutation>`). Fix this by only copying `_data` if
non-null.
We define the native reverse format as a reversed mutation fragment
stream that is identical to one that would be emitted by a table with
the same schema but with reversed clustering order. The main difference
to the current format is how range tombstones are handled: instead of
looking at their start or end bound depending on the order, we always
use them as-usual and the reversing reader swaps their bounds to
facilitate this. This allows us to treat reversed streams completely
transparently: just pass along them a reversed schema and all the
reader, compacting and result building code is happily ignorant about
the fact that it is a reversed stream.
The existing consume_in_reverse::yes is renamed to
consume_in_reverse::legacy_half_reverse and consume_in_reverse::yes now
means native reverse order. This is because we expect the legacy order
to die out at one point and when that happens we can just remove that
ugly third option and will be left with yes and no as before.
Intended to be used to modify an existing slice. We want to move the
slice into the direction where the schema is at: make it completely
immutable, all mutations happening through the slice builder class.
Take write lock for cf to serialize cleanup/upgrade sstables/scrub
with major compaction/reshape/reshard.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
since we might potentially have ongoing compactions, and we
must ensure that all sstables created before we run are scrubbed,
we need to barrier out any previously running compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
even sstables being compacted must be validated. otherwise scrub
validate may return false negative.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
`make_revered()` creates a schema identical to the schema instance it is
called on, with clustering order reversed. To distinguish the reverse
schema from the original one, the node-id part of its version UUID is
bit-flipped. This ensures that reversing a schema twice will result in
the identical schema to the original one (although a different C++
object).
This reversed schema will be used in reversed reads, so intermediate
layers can be ignorant of the fact that the read happens in reverse.
In order to be able to avoid a deadlock when CQL server cannot be started,
the view builder shutdown procedure is now split to two parts -
- drain and stop. Drain is performed before storage proxy shutdown,
but stop() will be called even before drain is scheduled.
The deadlock is as follows:
- view builder creates a reader permit in order to be able
to read from system tables
- CQL server fails to start, shutdown procedure begins
- view builder stop() is not called (because it was not scheduled
yet), so it holds onto its reader permit
- database shutdown procedure waits for all permits to be destroyed,
and it hangs indefinitely because view builder keeps holding
its permit.
Fixes#9306Closes#9308
* github.com:scylladb/scylla:
main: schedule view builder stopping earlier
db,view: split stopping view builder to drain+stop
The empty-range check causes more bugs than it fixes. Replace it with
an explicit check for =NULL (see #7852).
Fixes#9311.
Fixes#9290.
Tests: unit (dev), cql-pytest on Cassandra 4.0
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#9314
There is typically just 1-2 digests per query, so we can allocate
space for them in digest_read_resolver using small_vector, saving
an allocation.
Results: (perf_simple_query --smp 1 --operations-per-shard 1000000 --task-quota-ms 10)
before: median 215301.75 tps ( 75.1 allocs/op, 12.1 tasks/op, 45238 insns/op)
after: median 221121.37 tps ( 74.1 allocs/op, 12.1 tasks/op, 45186 insns/op)
While the throughput numbers are not reliable due to frequency throttling,
it's clear there are fewer allocations and instuctions executed.
Closes#9296
In order to avoid a deadlock described in the previous commit,
view builder stopping is registered earlier, so that its destructor
is called and its reader permit is released before the database starts
shutting down.
Note that draining the view builder is still scheduled later,
because it needs to happen before storage proxy drain
to keep the existing deinitialization order.
Fixes#9306
In order to be able to avoid a deadlock when CQL server cannot be started,
the view builder shutdown procedure is now split to two parts -
- drain and stop. Drain is performed before storage proxy shutdown,
but stop() will be called even before drain is scheduled.
The deadlock is as follows:
- view builder creates a reader permit in order to be able
to read from system tables
- CQL server fails to start, shutdown procedure begins
- view builder stop() is not called (because it was not scheduled
yet), so it holds onto its reader permit
- database shutdown procedure waits for all permits to be destroyed,
and it hangs indefinitely because view builder keeps holding
its permit.
We previously forbade selecting a static column when an index is
used. But Cassandra allows it, so we should, too -- see #8869.
After removing the static-column check, the existing code gets the
correct result without any further changes (though it may read
multiple rows from the same partition).
Fixes#8869.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#9307
In the quest to have explicit dependencies and the abiliy to run
multiple nodes in one process, remove some uses of get_gossiper() and
get_local_gossiper() and replace them with parameters passed from main()
or its equivalents.
Some uses still remain, mostly in snitch, but this series removes a
majority.
* https://github.com/avikivity/scylla.git gossiper-deglobal-1/v1
alternator: remove uses of get_local_gossiper()
storage_service: remove stray get_gossiper(), get_local_gossiper() calls
migration_manager: remove use of get_gossiper() from passive_announce()
storage_proxy: start_hints_manager(): don't require caller to provide gossiper
migration_manager: remove uses of get_local_gossiper()
storage_proxy: remove uses of get_local_gossiper()
gossiper: remove get_local_gossiper() from some inline helpers
gossiper: remove get_gossiper() from stop_gossiping()
gossiper: remove uses of get_local_gossiper for its rpc server
api: remove use of get_local_gossiper()
gossiper: remove calls to global get_gossiper from within the gossiper itself
migration_manager already has a reference to _gossiper, but
passive_announce is static and so can't use it. Luckily the only
caller (in storage_service) uses it as it it wasn't static, so
we can just unstaticify it.
Pass gossiper as a constructor parameter instead. cql_test_env
gains a use of get_gossiper() instead, but at least these uses
are concentrated in one place.
A tool which can be used to examine the content of sstable(s) and
execute various operations on them. The currently supported operations
are:
* dump - dumps the content of the sstable(s), similar to sstabledump;
* index-dump - dumps the content of the sstable index(es), similar to
scylla-sstable-index;
* writetime-histogram - generates a histogram of all the timestamps in
the sstable(s);
* custom - a hackable operation for the expert user (until scripting
support is implemented);
* validate - validate the content of the sstable(s) with the mutation
fragment stream validator, same as scrub in validate mode;
Some state accessors called get_local_gossiper(); this is removed
and replaced with a parameter. Some callers (redis, alternators)
now have the gossiper passed as a parameter during initialization
so they can use the adjusted API.
Have the callers pass it instead, and they all have a reference
already except for cql_test_env (which will be fixed later).
The checks for initialization it does are likely unnecessary, but
we'll only be able to prove it when get_gossiper() is completely
removed.
Initialization happens in the gossiper itself, so we can capture
'this'. If we need to move to shard 0, use sharded::invoke_on() to
get the local instance.
Pass down gossiper from main, converting it to a shard-local instance
in calls to register_api() (which is the point that broadcasts the
endpoint registration across shards).
This helps remove gossiper as a global variable.
We want to use the pattern of having a command line flag for each
operation in more tools, so extract the logic which finds the selected
operation from the command line arguments into a function.
A utility which can load a schema from a schema.cql file. The file has
to contain all the "dependencies" of the table: keyspace, UDTs, etc.
This will be used by the scylla-sstable-crawler in the next patch.
i3en.xlarge is currently not getting tuned properly. A quick test using
Scylla AMI ( ami-07a31481e4394d346 ) reveals that the storage
capabilities under this instance are greatly reduced:
$ grep iops /etc/scylla.d/io_properties.yaml
read_iops: 257024
write_iops: 174080
This patch corrects this typo, in such a way that iotune now properly
tunes this instance type.
Closes#9298
"
This is a set of random patches trying to fix broken cmake build:
* `-fcoroutines` flag is now used only for GCC, but not for Clang
* `SCYLLA-VERSION-GEN` invocation is adjusted to work correctly with
out-of-source builds
* Auxiliary targets are adjusted to support out-of-source builds
* Removed extra source files and added missing ones to the scylla
target
Scylla still doesn't build successfully with CMake build. But now,
at least, it's passes configuration step, which is a prerequisite
to loading the solution in IDEs.
"
* 'cmake_improvements' of github.com:ManManson/scylla:
cmake: fix out-of-source builds
cmake: don't use `-fcoroutines` for clang
cmake: update and sort source files and idl:s
The 2nd continuation of the previous patch fixes the places
with run_with_async() nested inside explicit seastar::async
calls.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The continuation of the previous patch for the cases when the
sstables::test_env::run_with_async sits lower the stack from
the SEASTAR_THREAD_TEST_CASE. The patching is similar but also
requires some care about reference-captured variables.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The SEASTAR_THREAD_TEST_CASE runs the provided lambda in async
context. The sstables::test_env::run_with_async does the same.
This (script-generated) patch makes all of the found cases be
SEASTAR_TEST_CASE and, respectively, return the async future
from the run_with_async().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
There's a landmine buried in range_rombstone's move constructor.
Whoever tries to use it risks grabbing the tombstone from the
containing list thus leaking the guy optionally invalidating an
iterator pointing at it. There's a safety without_link moving
constructor out there, but still.
To keep this place safe it's better to separate range_tombstone
from its linkage into anywhere. In particular to keep the range
tombstones in a range_tombstone_list here's the entry that keeps
the tombstone _and_ the list hook (which's a boost set hook).
The approach resembles the rows_entry::deletable_row pair.
tests: unit(dev, debug, patch from #9207)
fixes: #9243
"
* 'br-range-tombstone-vs-entry' of https://github.com/xemul/scylla:
range_tombstone: Drop without-link constructor
range_tombstone: Drop move_assign()
range_tombstone: Move linkage into range_tombstone_entry
range_tombstone_list: Prepare to use range_tombstone_entry
range_tombstone, code: Add range_tombstone& getters
range_tombstone_list: Factor out tombstone construction
range_tombstone_list: Simplify (maybe) pop_front_and_lock()
range_tombstone_list: De-templatize pop_as<>
range_tombstone_list: Conceptualize erase_where()
range_tombstone(_list): Mark some bits noexcept
mutation: Use range_tombstone_list's iterators
mutation_partition: Shorten memory usage calculation
mutation_partition: Remove unused local variable
Scrub is supposed to not remove anything from input, write it as is
while fixing any corruption it might have. It shouldn't have any
assumption on the input. Additionally, a data shadowed by a tombstone
might be in another corrupted sstable, so expired tombstones should
not be purged in order to prevent data ressurection from occurring.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210904165908.135044-1-raphaelsc@scylladb.com>
We were silently ignoring INSERTs with NULL values for primary-key
columns, which Cassandra rejects. Fix it by rejecting any
modification_statement that would operate on empty partition or
clustering range.
This is the most direct fix, because range and slice are calculated in
one place for all modification statements. It covers not only NULL
cases, but also impossible restrictions like c>0 AND c<0.
Unfortunately, Cassandra doesn't treat all modification statements
consistently, so this fix cannot fully match its behavior. We err on
the side of tolerance, accepting some DELETE statements that Cassandra
rejects. We add a TODO for rejecting such DELETEs later.
Fixes#7852.
Tests: unit (dev), cql-pytest against Cassandra 4.0
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#9286
The thing was used to move a range tombstone without detaching it
from the containing list (well, intrusive set). Now when the linkage
is gone this facility is no longer needed (and actually no longer
used).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The helper was in use by move-assignment operator and by the .swap()
method. Since now the operator equals the helper, the code can be
merged and the .swap() can be prettified.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now it's time to remove the boost set's hook from the range_tombstone
and keep it wrapped into another class if the r._tombstone's location
is the range_tombstone_list.
Also the added previously .tombstone() getters and the _entry alias
can be removed -- all the code can work with the new class.
Two places in the code that made use of without_link{} move-constructor
are patched to get the range_tombstone part from the respective _entry
with the same result.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A continuation of the previous patch. The range_tombstone_list works
with the range_tombstone very actively, kicking every single line
doing this to call .tombstone() seems excessive. Instead, declare
the range_tombstone_entry alias. When the entry will appear for real,
the alias would go away and the range_tombstone_list will be switched
into new entity right at once.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently all the code operates on the range_tombstone class.
and many of those places get the range tombstone in question
from the range_tombstone_list. Next patches will make that list
carry (and return) some new object called range_tombstone_entry,
so all the code that expects to see the former one there will
need to patched to get the range_tombstone from the _entry one.
This patch prepares the ground for that by introdusing the
range_tombstone& tombstone() { return *this; }
getter on the range_tombstone itself and patching all future
users of the _entry to call .tombstone() right now.
Next patch will remove those getters together with adding the new
range_tombstone_entry object thus automatically converting all
the patched places into using the entry in a proper way.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Just add a helper for constructing the managed range
tombstone object. This will also help further patch
have less duplicating hunks in it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method returns a pointer on the left-most range tombstone
and expects the caller to "dispose" it. This is not very nice
because the callers thus needs to mess with the relevant deleter.
A nicer approach is the pop-like one (former pop_as<>-like)
which is in returning the range tombstone by value. This value
is move-constructed from the original object which is disposed
by the container itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method pops the range tombstone from the containing list
and transparently "converts" it into some other type. Nowadays
all callers of it need range tombstone as-is, so the template
can be relaxed down to a plan call.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The range_tombstone's .empty() and .operator bool are trivially such.
The swap()'s noexceptness comes from what it calls -- the without-link
move constructor (noexcept) and .move_assign(). The latter is noexcept
because it's already called from noexcept move-assign operator and
because it calls noexcept move operators of tombstones' fields. The
update_node() is noexcept for the same reason.
The range_tombstone_list's clear() is noexcept because both -- set
clear and disposer lambda are both such.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This gcc flag is not supported. `-fcoroutines-ts` also
cannot be used, so just don't supply anything, similar to
what `configure.py` does.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The consume_clustering_fragments declares several auxiliary symbols
to work with rows' and range-tombstones' iterators. For the range
tombstones it relies on what container is declared inside the
range tombstone itself. Soon the container declaration will move
from range_tombstone class into a new entity and this place should
be prepared for that. The better place to get iterator types from
is the range-tombstones container itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The tests in this patch verify that null characters are valid characters
inside string and bytes (blob) attributes in Alternator. The tests
verify this for both key attributes and non-key attributes (since those
are serialized differently, it's important to check both cases).
The tests pass on both DynamoDB and Alternator - confirming that we
don't have a bug in this area.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210824163442.186881-1-nyh@scylladb.com>
"
A special-purpose reader which doesn't use the index at all, designed to
be used in circumstances where the index is not reliable. The use-case
is scrub and validate which often have to work with corrupt indexes and
it is especially important that they don't further any existing
corruption.
Tests: unit(dev)
"
* 'crawling-sstable-reader/v2' of https://github.com/denesb/scylla:
compaction: scrub/validate: use the crawling sstable reader
sstables: wire in crawling reader
sstables: mx/reader: add crawling reader
sstables: kl/reader: add crawling reader
This patch adds tests for two undocumented (as far as I can tell) corner
cases of CQL's string types:
1. The types "text" and "varchar" are not just similar - they are in
fact exactly the same type.
2. All CQL string and blob types ("ascii", "text" or "varchar", "blob")
allow the null character as a valid character inside them. They are
*not* C strings that get terminated by the first null.
These tests pass on both Cassandra and Scylla, so did not expose any
bug, but having such tests is useful to understand these (so-far)
undocumented behaviors - so we can later document them.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210824225641.194146-1-nyh@scylladb.com>
Python 3.8 doesn't allow to use built-in collection types
in type annotations (such as `list` or `dict`).
This feature is implemented starting from 3.9.
Replace `list[str]` type annotation with an old-style
`List[str]`, which uses `List` from the `typing` module.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210901131436.35231-1-pa.solodovnikov@scylladb.com>
After aa7cdc0392, run_custom_job() propagates stop exception.
The problem is that we fail to handle stop exception in the procedure
which stops ongoing compactions, so the exception will be propagated
all the way to init, which causes scylla to abort.
to fix this, let's swallow stop_exception in stop_ongoing_compactions(),
which is correct because compactions are stopped by triggering that
exception if signalled to stop.
when reshard is stopped, scylla init will fail as follow instead:
ERROR 2021-08-16 20:13:13,770 [shard 0] init - Startup failed: std::runtime_error
(Exception while populating keyspace 'keyspace5' with column family 'standard1'
from file ...
Fixes#9158.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210816232434.78375-1-raphaelsc@scylladb.com>
lw_shared_ptr must not be copied on a foreign shard.
Copying the vector on shard 0 tries increases the reference count of
lw_shared_ptr<sstable> elements that were created on other shards,
as seen in https://github.com/scylladb/scylla/issues/9278.
Fixes#9278
DTest: migration_test.py:TestLoadAndStream_with_3_0_md.load_and_stream_increase_cluster_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210902084313.2003328-1-bhalevy@scylladb.com>
The CQL grammar is obviously about cql3 and mostly about cql3
expressions, so add using namespace statements so we don't
have to specify it over and over again.
These statements are present in the headers, but only in the
cql_parser namespace, so it doesn't pollute other translation
units.
Closes#9255
compile_commands.json is a format of compilation database info for use
with several editors, such as VSCode (with official C++ extension) and
Vim (with youcompleteme). It can be generated with ninja:
```
ninja -t compdb > compile_commands.json
```
I propose this addition, so that this file won't be commited by
accident.
Closes#9279
This change allows to invoke the script in out-of-source
builds: `git log` now uses `-C` option with the directory
containing the script.
Also, the destination path can now be overriden by providing
`-o|--output-dir PATH` option. By default it's set to the
`build` directory relative to the script location.
Usage message is now shown, when '-h|--help' option is
specified.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210831120257.46920-1-pa.solodovnikov@scylladb.com>
Pass --pip-packages option to tools/python3/reloc/build_reloc.sh,
add scylla-driver to relocatable python3 which required for
fix_system_distributed_tables.py.
[avi: regenrate toolchain]
Ref #9040
Both functions do the same -- get datacenters from given
endpoint and local broadcast address and compare them to
match (or not).
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210902080858.16364-1-xemul@scylladb.com>
* 'raft-misc-v3' of github.com:scylladb/scylla-dev:
raft: rename snapshot into snapshot_descriptor
raft: drop snapshot if is application failed
raft: fix local snapshot detection
raft: replication_test: store multiple snapshots in a state machine
raft: do not wait for entry to become stable before replicate it
Sstables that are scrubbed or validated are typically problematic ones
that potentially have corrupt indexes. To avoid using the index
altogether use the recently added crawling reader. Scrub and validate
never skips in the sstable anyway.
"
There are 4 places out there that do the same steps parsing
"client_|server_encryption_options" and configuring the
seastar::tls::creds_builder with the values (messaging, redis,
alternator and transport).
Also to make redis and transport look slimmer main() cleans
the client_encryption_options by ... parsing it too.
This set introduces a (coroutinized) helper to configure the
creds_builder with map<string, string> and removes the options
beautification from main.
tests: unit(dev), dtest.internode_ssl_test(dev)
"
* 'br-generalize-tls-creds-builder-configuration' of https://github.com/xemul/scylla:
code: Generalize tls::credentials_builder configuration
transport, redis: Do not assume fixed encryption options
messaging: Move encryption options parsing to ms
main: Open-code internode encryption misconfig warning
main, config: Move options parsing helpers
The link to the docker build instructions was outdated - from the time
our docker build was based on a Redhat distribution. It no longer is,
it's now based on Ubuntu, and the link changed accordingly.
Fixes#9276.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210901083055.445438-1-nyh@scylladb.com>
...when using Segment/TotalSegment option.
The requirement is not specified in DynamoDB documents, but found
in DynamoDB Local:
{"__type":"com.amazon.coral.validate#ValidationException",
"message":"Exclusive start key must lie within the segment"}
Fixes#9272
Signed-off-by: Liu Lan <liulan_yewu@cmss.chinamobile.com>
Closes#9270
A special-purpose reader which doesn't use the index at all and hence
doesn't support skipping at all. It is designed to be used in conditions
in which the index is not reliable (scrub compaction).
A special-purpose reader which doesn't use the index at all and hence
doesn't support skipping at all. It is designed to be used in conditions
in which the index is not reliable (scrub compaction).
Return the pre- 6773563d3 behavior of demanding ALLOW FILTERING when partition slice is requested but on potentially unlimited number of partitions. Put it on a flag defaulting to "off" for now.
Fixes#7608; see comments there for justification.
Tests: unit (debug, dev), dtest (cql_additional_test, paging_test)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#9126
* github.com:scylladb/scylla:
cql3: Demand ALLOW FILTERING for unlimited, sliced partitions
cql3: Track warnings in prepared_statement
test: Use ALLOW FILTERING more strictly
cql3: Add statement_restrictions::to_string
When a query requests a partition slice but doesn't limit the number
of partitions, require that it also says ALLOW FILTERING. Although
do_filter() isn't invoked for such queries, the performance can still
be unexpectedly slow, and we want to signal that to the user by
demanding they explicitly say ALLOW FILTERING.
Because we now reject queries that worked fine before, existing
applications can break. Therefore, the behavior is controlled by a
flag currently defaulting to off. We will default to "on" in the next
Scylla version.
Fixes#7608; see comments there for justification.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Introduce a general-purpose search and replace function to manipulate
expressions, and use it to simplify replace_column_def() and
replace_token().
Closes#9259
* github.com:scylladb/scylla:
cql3: expr: rewrite replace_token in terms of search_and_replace()
cql3: expr: rewrite replace_column_def in terms of search_and_replace()
cql3: expr: add general-purpose search-and-replace
"
When cloning throws in the middle it may leak some child
nodes triggering the respective assertion in node destructor.
Also there's a chance to mis-assert the linear node roll-back.
tests: unit(dev)
"
Fixes#9248
Backport: 4.5
* 'br-btree-clone-exceptions-2' of https://github.com/xemul/scylla:
btree: Add commens in .clone() and .clear()
btree, test: Test exception safety and non-leakness of btree::clone_from
btree, test: Test key copy constructor may throw
btree: Dont leak kids on clone roll-back
btree: Destroy, not drop, node on clone roll-back
There are two tricky places about corner leaves pointers
managements. Add comments describing the magic.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Use search_and_replace() to simplify replace_token(). Note the conversion
does not have 100% fidelity - the previous implementation throws on
some impossible subexpression types, and the new one passes them
through. It should be the caller's responsibility anyway, not a
side effect of replacing tokens, and since these subexpressions are
impossible there is no real effect on execution.
Note that this affects only TOKEN() calls on the partition key
columns in the right order. Other uses of the token function
(say with constants) won't be translated to the token subexpression
type. So something like
WHERE token(pk) = token(?)
would only see the left-hand side replaced, not the right-hand
side, even if it were an expression rather than a term.
We're won't introduce new expression types that are equivalent
to column_value, and search_and_replace() takes care of all
expressions that need to recurse, so we don't need std::visit()
for the search/replace lambda.
Add a recursive search-and-replace function on expressions. The
caller provides a search/replace function to operate on subexpressions,
returning nullopt if they want the default behavior of recursively
copying, or a new expression to terminate the search (in the current
subtree) and replace the current node with the returned expression.
To avoid endlessly specifying the subexpression types that get the
the common behavior (copying) since they don't contain any subexpressions,
we add a new concept LeafExpression to signify them.
Existing functions such as replace_token() can be reimplemented in
terms of search_and_replace, but that is left for later.
When failed-to-be-cloned node cleans itself it must also clear
all its child nodes. Plain destroy() doesn't do it, it only
frees the provided node.
fixes: #9248
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The node in this place is not yet attached to its parent, so
in btree::debug::yes (tests only) mode the node::drop()'s parent
checks will access null parent pointer.
However, in non-tesing runtime there's a chance that a linear
node fails to clone one of its keys and gets here. In this case
it will carry both leftmost and rightmost flags and the assertion
in drop will fire.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The (oddly-placed) document dist/docker/debian/README.md explains how a
developer can build a Scylla docker image using a self-built Scylla
executable.
While the document begins by saying that you can "build your own
Scylla in whatever build mode you prefer, e.g., dev.", the rest of the
instructions don't fit this example mode "dev" - the second command does
"ninja dist-deb" which builds *all* modes, while the third command
forgets to pass the mode at all (and therefore defaults to "release").
The forth command doesn't work at all, and became irrelevant during
a recent rewrite in commit e96ff3d.
This patch modifies the document to fix those problems.
It ends with an example of how to run the resulting docker image
(this is usually the purpose of building a docker image - to run it
and test it). I did this example using podman because I couldn't get
it to work in docker. Later we can hopefully add the corresponding
docker example.
Fixes#9263.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210829182608.355748-1-nyh@scylladb.com>
Our docker image accepts various command-line arguments and translates
them into Scylla arguments. For example, Alternator's getting-started
document has the following example:
```
docker run --name scylla -d -p 8000:8000 scylladb/scylla-nightly:latest
--alternator-port=8000 --alternator-write-isolation=always```
Recently, this stopped working and the extra arguments at the end were
just ignored.
It turns out that this is a regression caused by commit
e96ff3d82d that changed our docker image
creation process from Dockerfile to buildah. While the entry point
specified in Dockerfile was a string, the same string in buildah has
a strange meaning (an entry point which can't take arguments) and to
get the original meaning, the entry point needs to be a JSON array.
This is kind-of explained in https://github.com/containers/buildah/issues/732.
So changing the entry point from a string to a JSON array fixes the
regression, and we can again pass arguments to Scylla's docker image.
Fixes#9247.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210829180328.354109-1-nyh@scylladb.com>
We have in our Ninja build file various targets which ask to build just
a single build mode. For example, "ninja dev" builds everything in dev
mode - including Scylla, tests, and distribution artifacts - but
shouldn't build anything in other build modes (debug, release, etc.),
even if they were previously configured by configure.py.
However, we had a bug where these build-mode-specific targets
nevertheless compiled *all* configured modes, not just the requested
mode.
The bug was introduced in commit edd54a9463 -
targets "dist-server-compat" and "dist-unified-compat" were introduced,
but instead of having per-build-mode versions of these targets, only
one of each was introduced building all modes. When these new targets
were used in a couple of places in per-build-mode targets, it forced
these targets to build all modes instead of just the chosen one.
The solution is to split the dist-server-compat target into multiple
dist-server-compat-{mode}, and similarly split dist-unified-compat.
The unsplit target is also retained - for use in targets that really
want all build modes.
Fixes#9260.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210829123418.290333-1-nyh@scylladb.com>
The code assumes that the snapshot that was taken locally is never
applied. Currently logic to detect that is flawed. It relies on an
id of a most recently applied snapshot (where a locally taken snapshot
is considered to be applied as well). But if between snapshot creation
and the check another local snapshot is taken ids will not match.
The patch fixes this by propagating locality information together with
the snapshot itself.
State machine should be able to store more then one snapshot at a time
(one may be the currently used one and another is transferred from a
leader but not applied yet).
Since io_fiber persist entries before sending out messages even non
stable entries will become stable before observed by other nodes.
This patch also moves generation of append messages into get_outptut()
call because without the change we will lose batching since each
advance of last_idx will generate new append message.
This method has nothing to do with storage service and
is only needed to move feature service options from one
method to another. This can be done by the only caller
of it.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210827133954.29535-1-xemul@scylladb.com>
The only case in this test effectively carries 6 of them. When run
as they are now (sequentially) the total test run time in debug mode
is ~35 minutes. When split each case takes ~6 minutes to complete.
In dev/release mode it's ~1 minute vs ~10 seconds respectively.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All callers has been patched already. This argument can now
be used to replace get_local_storage_proxy().get_db().local()
call.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Previously, the layout for storing raft snapshot
descriptors contained a `config` field, which had `blob`
data type.
That means `raft::configuration` for the snapshot was serialized
as a whole in binary form. It's convenient to implement and
is the most compact form of representing the data, but:
1. Hard to debug due to the need to de-serialize the data.
2. Plants a time bomb wrt. changing data layout and also the
documentation in the future.
Remove the `config` field from `system.raft_snapshots` and
extract it to a separate `system.raft_config` table to store
the data in exploded form.
Also, modify the schema of `system.raft_snapshots` table in
the following way: add a `server_id` field as a part of
composite partition key ((group_id, server_id)) to
be able to start multiple raft servers belonging to one raft
group on the same scylla node.
Rename `id` field in `raft_snapshots` to `snapshot_id` so
it's self-documenting.
Rename `snapshot_id` from clustering key since a given server
can have only one snapshot installed at a time.
Note that the `raft::server_address` stucture contains an opaque
`info` member, which is `bytes`, but in the `raft_config` table
we use `ip_addr inet` field, instead. We always know that the
corresponding member field is going to contain an IP address (either v4
or v6) of a given raft server.
So, now the snapshots schema looks like this:
CREATE TABLE raft_snapshots (
group_id timeuuid,
server_id uuid,
snapshot_id uuid,
idx int,
term int,
-- no `config` field here, moved to `raft_config` table
PRIMARY KEY ((group_id, server_id))
)
CREATE TABLE raft_config (
group_id timeuuid,
my_server_id uuid,
server_id uuid,
disposition text, -- can be either 'CURRENT` or `PREVIOUS'
can_vote bool,
ip_addr inet,
PRIMARY KEY ((group_id, my_server_id), server_id, disposition)
);
This way it's much easier to extend the schema with new fields,
very easy to debug and inspect via CQL, and it's much more descriptive
in terms of self-documentation.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Before this commit, the service_level_controller will notify
the subscribers on stale deletes, meaning, deletes of localy
non exixtent service_levels.
The code flow shouldn't ever get to such a state, but as long
as this condition is checked instead of being asserted it is
worthwhile to change the code to be safe.
Closes#9253
This series converts the `term::raw` objects the grammar produces to expressions.
For each grammar production, an expression type is either reused or created.
The term::raw methods are converted to free functions accepting expressions as
input (but not yet generating expressions as output).
There is some friction because the original code base had four different expression
domains: `term`, `term::raw`, `selectable`, and `selectable::raw`. They don't match
exactly, so in some cases we need to add additional state to distinguish between them.
There are also many run-time checks introduced (on_internal_error) since the union
of the domains is much larger than each individual domain.
The method used is to erect a bi-directional bridge between term::raw and expressions,
convert various term::raw subclasses one by one, and then remove the bridge and
term::raw.
Test: unit (dev).
Closes#9170
* github.com:scylladb/scylla:
cql3: expr: eliminate column_specification_or_tuple
cql3: expr: hide column_specification_or_tuple
cql3: term::raw: remove term::raw and scaffolding
cql3: grammar: collapse conversions between term::raw and expressions
cql3: relation: convert to_term() to experssions
cql3: functions: avoid intermediate conversions to term::raw
cql3: create_aggregate_statement: convert term::raw to expression
cql3: update_statement, insert_statement: convert term::raw to expression
cql3: select_statement: convert term::raw to expression
cql3: token_relation: convert term::raw to expressions
cql3: operation: convert term::raw to expression
cql3: multi_column_relation: convert term::raw to expressions
cql3: single_column_relation: convert term::raw to expressions
cql3: column_condition: convert term::raw to expressions
cql3: expr: don't convert subexpressions to term::raw during the prepare phase
cql3: attributes: convert to expressions
cql3: expr: introduce test_assignment_all()
cql3: expr: expose prepare_term, test_assignment in the expression domain
cql3: expr: provide a bridge between expressions and assignment_testable
cql3: expr, user types: convert user type literals to expressions
cql3: selection: make selectable.hh not include expr/expresion.hh
cql3: sets, user types: move user types raw functions around
cql3: expr, sets, maps: convert set and map literals to collection_constructor
cql3: sets, maps, expr: move set and map raw functions around
cql3: expr, lists: convert lists::literal to new collection_constructor
cql3: lists, expr: move list raw functions around
cql3: tuples, expr: convert tuples::literal to expr::tuple_constructor
cql3: expr, tuples: deinline and move tuple raw functions
cql3: expr, constants: convert constants::literal to untyped_constant
cql3: constants: move constants::literal implementation around
cql3: expr, abstract_marker: convert to expressions
cql3: column_condition: relax types around abstact_marker::in_raw
cql3: tuple markers: deinline and rearrange
cql3: abstract_marker, term_expr: rearrange raw abstract marker implementation
cql3: expr, constants: convert cql3::constants::null_literal to new cql3::expr::null
cql3: expr, constants: deinline null_literal
cql3: constants: extricate cql3::constants::null_literal::null_value from null_literal
cql3: term::raw, expr: convert type casts to expressions
cql3: type_cast: deinline some methods
cql3: expr: prepare expr::cast for unprepared types
cql3: expr, functions: move raw function calls to expressions
cql3: expr, term::raw: add conversions between the two types
cql3: expr, term::raw: add reverse bridge
cql3: term::raw, expr: add bridge between term::raw and expressions
cql3: eliminate multi_column_raw
cql3: term::raw, multi_column_raw: unify prepare() signatures
Whenever a node_head_ptr is assigned to nil_root, the _backref inside it is
overwritten. But since nil_root is shared between shards, this causes severe
cache line bouncing. (It was observed to reduce the total write throughput
of Scylla by 90% on a large NUMA machine).
This backreference is never read anyway, so fix this bug by not writing it.
Fixes#9252Closes#9246
column_specification_or_tuple is now used internally, wrapping and
a column_specification or a vector and immediately unwrapping in
the callee. The only exceptions are bind_variable and tuple_constructor,
which handles both cases.
Use the underlying types directly instead, and add dispatching
to prepare_term_multi_column() for the two cases it handles.
column_specification_or_tuple was introduced since some terms
were prepared using a single receiver e.g. (receiver = <term>) and
some using multiple receivers (e.g. (r1, r2) = <term>. Some
term types supported both.
To hide this complexity, the term->expr conversion used a single
interface for both variations (column_expression_or_tuple), but now
that we got rid of the term class and there are no virtual functions
any more, we can just use two separate functions for the two variants.
Internally we still use column_expression_or_tuple, it can be
removed later.
The grammar now talks to expression API:s solely, so it can be converted
internally to expressions too. Calls to as_term_raw() and as_expression()
are removed, and productions return expressions instead of term::raw:s.
Change term::raw in multi_column_relation to expressions. Because a single
raw class is used to represent multiple shapes (IN ? and IN (x, y, z)),
some of the expressions are optional, corresponding to nullables before the
conversion.
to_term() is not converted, since it's part of the larger relation
hierarchy.
Change term::raw in single_column_relation to expressions. Because a single
raw class is used to represent multiple shapes (IN ? and IN (x, y, z)),
some of the expressions are optional, corresponding to nullables before the
conversion.
to_term() is not converted, since it's part of the larger relation
hierarchy.
Change term::raw in column_condition::raw to expressions. Because a single
raw class is used to represent multiple shapes (IN ? and IN (x, y, z)),
some of the expressions are optional, corresponding to nullables before the
conversion.
to_term() is not converted, since it's part of the larger relation
hierarchy.
Now that we have the prepare machinery exposed as expression API:s
(not just term::raw) we can avoid conversions from expressions to
term::raw when preparing subexpressions.
Convert the three variables in attrbutes::raw to expressions. Since
those attributes are optional, use std::optional to indicate it
(since we can't rely on shared_ptr<term::raw> being null).
The test_assignment class has a test_all() helper to test
a vector of assignment_testable. But expressions are not
derived from assignment_testable, so introduce a new helper
that does the same for expressions.
So far prepare (in the term domain) was called via term::raw. To be
able to prepare in the expression domain, expose functions prepare_term()
and test_assignment() that accept expressions as arguments.
prepare_term() was chosen rather that prepare() to differentiate wrt.
the other domain that can be prepared (selectables).
While we have a bridge between expressions and term::raw, which is
derived from assignment_testable, we will soon get rid of term::raw
and so won't be able to interface with API:s that require an
assignment_testable. So add a bridge for that. The user is
function::get(), which uses assignment_testable to infer the
function overload from the argument types.
Convert the user_types::literal raw to a new expression type
usertype_constructor. I used "usertype" to convey that is is a
((user type) constructor), not a (user (type constructor)).
We have this dependency now:
column_identifier -> selectable -> expression
and want to introduce this:
expression -> user types -> column_identifier
This leads to a loop, since expression is not (yet) forward
declarable.
Fix by moving any mention of expression from selectable.hh to a new
header selection-expr.hh.
database.cc lost access to timeout_config, so adjust its includes
to regain it.
Add set and map styles to collection_constructor. Maps are implemented as
collection_constructor{tuple_constructor{key, value}...}. This saves
having a new expression type, and reduces the effort to implement
recursive descent evaluation for this omitted expression type.
Move them closer to prepare related functions for modification. Since
sets and maps share some implementation details in the grammar, they
are moved and converted as a unit.
Introduce a collection_constructor (similar to C++'s std::initializer_list)
to hold subexpressions being gathered into a list. Since sets, maps, and
lists construction share some attributes (all elements must be of the
same type) collection_constructor will be used for all of them, so it
also holds an enum. I used "style" for the enum since it's a weak
attribute - an empty set is also an empty map. I chose collection_constructor
rather than plain 'collection' to highlight that it's not the only way
to get a collection (selecting a collection column is another, as an
example) and to hint at what it does - construct a collection from
more primitive elements.
Introduce tuple_constructor (not a literal, since (?, ?) and (column_value,
column_value) are not literals) to represent a tuple constructed from
subexpressions. In the future we can replace column_value_tuple
with tuple_constructor(column_value, column_value, ...), but this is
not done now.
I chose the name 'tuple_constructor' since other expressions can represent
tuples (e.g. my_tuple_column, :bind_variable_of_tuple_type,
func_returning_tuple()). It also explains what the expression does.
Introduce a new expression untyped_constant that corresponds to
constants::literal, which is removed. untyped_constant is rather
ugly in that it won't exist post-prepare. We should probably instead
replace it with typed constants that use the widest possible type
(decimal and varint), and select a narrower type during the prepare
phase when we perform type inference. The conversion itseld is
straightforward.
Convert the four forms of abstract_marker to expr::bind_variable (the
name was chosen since variable is the role of the thing, while "marker"
refers more to the grammar). Having four variants is unnecessary, but
this patch doesn't do anything about that.
We can only convert expressions to term::raw, not the subclass
abstract_marker::in_raw, so relax the types. They will all be converted
to expressions. Relaxing types isn't good, but the structure is enforced
now by the grammar (and dynamically using variant casts), and in the future
by a typecheck pass (which will allow us to remove the many variations
of markers).
null_literal (which is in the term::raw domain) will be converted to an
expression, so unnest the nested class null_value (which is in the term
domain and is not converted now).
We reuse the expr::cast type that was previously used for selectables.
When preparing, subexpressions are converted to term::raw; this will
be removed later.
These methods will be converted to the expression variant, and
it's impossible to do this while inlined due to #include cycles. In
any case, deinlining is better.
Since there is no type_cast.cc, and since they'll become part of
expr_term call chain soon, they're moved there, even though it seems
odd for this patch. It's a waste to create type_cast.cc just for those
three functions.
The cast expression has two operands: the subexpression to cast and the
type to cast to. Since prepared and unprepared expressions are the
same type, we don't have to do anything, but prepared and unprepared
types are different. So add a variant to be able to support both.
The reason the selectable->expression transformation did not need to
do this is that casts in a selector cannot accept a user defined type.
Note those casts also have different syntax and different execution,
so we'll have to choose whether to unify the two semantics, or whether
to keep them separate. This patch does not force anything (but does hint
at unification by not including any discriminant beyond the type's
rawness). The string representation matches the part of the grammar
it was derived from (or conversion back to CQL will yield wrong
results).
Remove cql3::functions::function_call::raw and replace it with
cql3::expr::function_call, which already existed from the selector
migration to expressions. The virtual functions implementing term::raw
are made free functions and remain in place, to ease migration and
review.
Note that preparing becomes a more complicated as it needs to
account for anonymous functions, which were not representable
in the previous structure (and still cannot be created by the
parser for the term::raw path).
The parser now wraps all its arguments with the term::raw->expr
bridge, since that's what expr::function_call expects, and in
turn wraps the function call with an expr->term::raw bridge, since
that's what the rest of the parser expects. These will disappear
when the migration completes.
Since expressions can nest, and since we won't covert everything at once,
add a way to store a term::raw as an expression. We can now have a
term::raw that is internally an expression, and an expression that is
implemented as term::raw.
A term_raw_expression is a term::raw that holds an expression. It will
be used to incrementally convert the source base to expressions, while
still exposing the result to the common interface of shared_ptr<term::raw>.
Now that the signatures of term::raw::prepare and multi_column_raw::prepare
are identical, we can eliminate multi_column_raw, replacing it with
term::raw where needed. In some cases we delete it from the inheritance chain
since we reach term::raw via a different base class.
Note that a dynamic_cast<> is eliminated, so we compenate for the addition
of runtime checks in the previous patch by the deletion of runtime checks
in this patch.
In order to replace the term::raw hierarchy with expressions,
we need to unify the signatures of term::raw::prepare() and
term::multi_column_raw::prepare(). This is because we'll only have
one expression type to represent both single values and tuples
(although, different subexpression types will may used).
The difference in the two prepare() signatures is the
`receiver` parameter - which is a (type, name) pair used
to perfom type inference on the expression being prepared,
with the name used to report errors. In a perfect world, this
would just be an expression - a tuple or a singular expression
as the case requires. But we don't have the needed expression
infrastructure yet - general tuples or name-annotated expressions.
Resolve the problem by introducing a variant for the single-value
and tuple. This is more or less creating a mini-expression type
used just for this. Once our expression type grows the needed
capabilities, it can replace this type.
Note that for some cases, this replaces compile-time checks by
runtime checks (which should never trigger). In other cases
the classes really needed both interfaces, so the new variant
is a better fit.
This series incorporates various refactorings aimed mostly
at eliminating extra parameters to `serializer_*_impl` functions
for `EnumDef` and `ClassDef` AST classes.
Instead of carrying these parameters here and there over many
places, they are calculated on a preliminary run to collect
additional metadata, such as: namespaces and template parameters
from parent scopes. This metadata is used later to extend AST
classes.
The patchset does not introduce any changes in the generation
procedures, exclusively dealing with internal code structuring.
NOTE: although metadata collection involves an extra run through
the parse tree, the proper way should be to populate it instantly
while parsing the input. This is left to be adjusted lated in a
follow-up series.
Closes#8148
* github.com:scylladb/scylla:
idl: add descriptions for the top-level generation routines
idl: make ns_qualified name a class method
idl: cache template declarations inside enums and classes
idl: cache parent template params for enums and classes
idl: rename misleading `local_types` to `local_writable_types`
idl: remove remaining uses of `namespaces` argument
idl: remove `is_final` function and use `.final` AST class property
idl: remove `parent_template_param` from `local_types` set
idl: cache namespaces in AST nodes
idl: remove unused variables
Preparation should be able to record warnings that make it back to the
user via the query response.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
"
This series moves the timeout parameter, that is passed to most
f_m_r methods, into the reader_permit. This eliminates
the need to pass the timeout around, as it's taken
from the permit when needed.
The permit timeout is updated in certain cases
when the permit/reader is paused and retrieved
later on for reuse.
Following are perf_simple_query results showing ~1%
reduction in insns/op and corresponding increase in tps.
$ build/release/test/perf/perf_simple_query -c 1 --operations-per-shard 1000000 --task-quota-ms 10
Before:
102500.38 tps ( 75.1 allocs/op, 12.1 tasks/op, 45620 insns/op)
After:
103957.53 tps ( 75.1 allocs/op, 12.1 tasks/op, 45372 insns/op)
Test: unit(dev)
DTest:
repair_additional_test.py:RepairAdditionalTest.repair_abort_test (release)
materialized_views_test.py:TestMaterializedViews.remove_node_during_mv_insert_3_nodes_test (release)
materialized_views_test.py:InterruptBuildProcess.interrupt_build_process_with_resharding_half_to_max_test (release)
migration_test.py:TTLWithMigrate.big_table_with_ttls_test (release)
"
* tag 'reader_permit-timeout-v6' of github.com:bhalevy/scylla:
flat_mutation_reader: get rid of timeout parameter
reader_concurrency_semaphore: use permit timeout for admission
reader_concurrency_semaphore: adjust reactivated reader timeout
multishard_mutation_query: create_reader: validate saved reader permit
repair: row_level: read_mutation_fragment: set reader timeout
flat_mutation_reader: maybe_timed_out: use permit timeout
test: sstable_datafile_test: add sstable_reader_with_timeout
reader_permit: add timeout member
With data segregation on repair, thousands of sstables are potentially
added to maintenance set which causes high latency due to stalls.
That's because N*M sstables are created by a repair,
where N = # of ranges
and M = # of segregations
For TWCS, M = # of windows.
Assuming N = 768 and M = 20, ~15k sstables end up in sstable set
To fix this problem, let's avoid performing data segregation in repair,
as offstrategy will already perform the segregation anyway.
So from now on, only N non-overlapping sstables will be added to set.
Read amplification isn't affected because a query will only touch one
sstable in maintenance set.
When offstrategy starts, it will pick all sstables from set and
compact them in a single step while performing data segregation,
so data is properly laid out before integrated into the main set.
tests:
- sstable_compaction_test.twcs_reshape_with_disjoint_set_test
- mode(dev)
- manual test using repair-based bootstrap
Fixes#9199.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210824185043.76475-1-raphaelsc@scylladb.com>
Current code std::move()-s the range tombstone into consumer thus
moving the tombstone's linkage to the containing list as well. As
the result the orignal range tombstone itself leaks as it leaves
the tree and cannot be reached on .clear(). Another danger is that
the iterator pointing to the tombstone becomes invalid while it's
then ++-ed to advance to the next entry.
The immediate fix is to keep the tombstone linked to the list while
moving.
fixes: #9207
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210825100834.3216-1-xemul@scylladb.com>
"
This series implements section 6.4 of the Raft PhD. It allows to do
linearisable reads on a follower bypassing raft log entirely. After this
series server::read_barrier can be executed on a follower as well as
leader and after it completes local user's state machine state can be
accessed directly.
"
* 'raft-read-v9' of github.com:scylladb/scylla-dev:
raft: test: add read_barrier test to replication_test
raft: test: add read_barrier tests to fsm_test
raft: make read_barrier work on a follower as well as on a leader
raft: add a function to wait for an index to be applied
raft: (server) add a helper to wait through uncertainty period
raft: make fsm::current_leader() public
raft: add hasher for raft::internal::tagged_uint64
serialize: add serialized for std::monostate
raft: fix indentation in applier_fiber
This patch implements RAFT extension that allows to perform linearisable
reads by accessing local state machine. The extension is described
in section 6.4 of the PhD. To sum it up to perform a read barrier on
a follower it needs to asks a leader the last committed index that it
knows about. The leader must make sure that it is still a leader before
answering by communicating with a quorum. When follower gets the index
back it waits for it to be applied and by that completes read_barrier
invocation.
The patch adds three new RPC: read_barrier, read_barrier_reply and
execute_read_barrier_on_leader. The last one is the one a follower uses
to ask a leader about safe index it can read. First two are used by a
leader to communicate with a quorum.
Add a helper to be able to wait until a Raft cluster
leader is elected. It can be used to avoid sleeps
when it's necessary to forward a request to the leader,
but the leader is yet unknown.
"
Factor out replication test, make it work with different clocks, add
some features, and add a many nodes test with steady_clock. Also
refactor common test helper.
Many nodes test passes for release and dev and normal tick of 100ms for
up to 1000 servers. For debug mode it's much fewer due to lack of
optimizations so it's only tested for smaller numbers.
Tests: unit ({dev}), unit ({debug}), unit ({release})
"
* 'raft-many-22-v12' of https://github.com/alecco/scylla: (21 commits)
raft: candidate timeout proportional to cluster size
raft: testing: many nodes test
raft: replication test: remove unused tick_all
raft: replication test: delays
raft: replication test: packet drop rpc helper
raft: replication test: connectivity configuration
raft: replication test: rpc network map in raft_cluster
raft: replication test: use minimum granularity
raft: replication test: minor: rename local to int ids
raft: replication test: fix restart_tickers when partitioning
raft: replication test: partition ranges
raft: replication test: isolate one server
raft: replication test: move objects out of header
raft: replication test: make dummy command const
raft: replication test: template clock type
raft: replication test: tick delta inside raft_cluster
raft: replication test: style - member initializer
raft: replication test: move common code out
raft: testing: refactor helper
raft: log election stages
...
Now that the timeout is stored in the reader
permit use it for admission rather than a timeout
parameter.
Note that evictable_reader::next_partition
currently passes db::no_timeout to
resume_or_create_reader, which propagated to
maybe_wait_readmission, but it seems to be
an oversight of the f_m_r api that doesn't
pass a timeout to next_partition().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The timeout needs to be propagated to the reader's permit.
Reset it to db::no_timeout in repair_reader::pause().
Warn if set_timeout asks to change the timeout too far into the
past (100ms). It is possible that it will be passed a
past timeout from the rcp path, where the message timeout
is applied (as duration) over the local lowres_clock time
and parallel read_data messages that share the query may end
up having close, but different timeout values.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To avoid dueling candidates with large clusters, make the timeout
proportional to the cluster size.
Debug mode is too slow for a test of 1000 nodes so it's disabled, but
the test passes for release and dev modes.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Tests with many nodes and realistic timers and ticks.
Network delays are kept as a fraction of ticks. (e.g. 20/100)
Tests with 600 or more nodes hang in debug mode.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Allow test supplied delays for rpc communication.
Allow supplying network delay, local delay (nodes within the same
server), how many nodes are local, and an extra small delay simulating
local load.
Modify rpc class to support delays. If delays are enabled, it no longer
directly calls the other node's server code but it schedules it to be
called later. This makes the test more realistic as in the previous
version the first candidate was always going to get to all followers
first, preventing a dueling candidates scenario.
Previously, tickers were all scheduled at the same time, so there was no
spread of them across the tick time. Now these tickers are scheduled
with a uniform spread across this time (tick delta).
Also previously, for custom free elections used tick_all() which
traversed _in_configuration sequentially and ticked each. This, combined
with rpc outbound directly calling methods in the other server without
yielding, caused free elections to be unrealistic with same order
determined and first candidate always winning. This patch changes this
behavior. The free election uses normal tickers (now uniformly
distributed in tick delay time) and its loop waits for tick delay time
(yielding) and checks if there's a new leader. Also note the order might
not be the same in debug mode if more than one tick is scheduled.
As rpc messages are sent delayed, network connectivity needs to be
checked again before calling the function on the remote side.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
When partitioning, elect_new_leader restarts tickers, so don't
re-restart them in this case.
When leader is dropped and no new leader is specified, restart tickers
before free election.
If no change of leader, restart tickers.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Add --dist-only option to disable compiling code, just build packages.
It will significantly speed up rebuild packages, make packaging
development easier.
Closes#9227
Due to an overzealous assertion put in the code (in one of the last iterations, by the way!) it was impossible to create an aggregate which accepts multiple arguments. The behavior is now fixed, and a test case is provided for it.
Tests: unit(release)
Closes#9211
* github.com:scylladb/scylla:
cql-pytest: add test case for UDA with multiple args
cql3: fix aggregates with > 1 argument
Introduce `raft` experimental option.
Adjust the tests accordingly to accomodate the new option.
It's not enabled by default when providing
`--experimental=true` config option and should be
requested explicitly via `--experimental-options=raft`
config option.
Hide the code related to `raft_group_registry` behind
the switch. The service object is still constructed
but no initialization is performed (`init()` is not
called) if the flag is not set.
Later, other raft-related things, such as raft schema
changes, will also use this flag.
Also, don't introduce a corresponding gossiper feature
just yet, because again, it should be done after the
raft schema changes API contract is stabilized.
This will be done in a separate series, probably related to
implementing the feature itself.
Tests: unit(dev)
Ref #9239.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210823121956.167682-1-pa.solodovnikov@scylladb.com>
Simple coroutinization of the schema load functions, leaving the code tidier.
Test: unit (dev)
Closes#9217
* github.com:scylladb/scylla:
database: adjust indentation after coroutinization of schema table parsing code
database: convert database::parse_schema_tables() to a coroutine
database: remove unneeded temporary in do_parse_schema_tables()
database: convert do_parse_schema_tables() to a coroutine
Merged patch series by By Benny Halevy:
Prepare for updating seastar submodule to a change
that requires deferred actions to be noexcept
(and return void).
Test: unit(dev, debug)
* tag 'deferred_action-noexcept-v1' of github.com:bhalevy/scylla:
everywhere: make deferred actions noexcept
cql3: prepare_context: mark methods noexcept
commitlog: segment, segment_manager: mark methods noexcept
everywhere: cleanup defer.hh includes
Commit 8d6e575 introduced a new stat, instructions per fragment.
Computing this new stat can end with a division by zero when
the number of fragmens read is 0. Here we fix it by reporting
0 ins/f when no fragments were read.
Fixes#9231Closes#9232
Use a forward declaration of cql3::expr::oper_t to reduce the
number of translation units depending on expression.hh.
Before:
$ find build/dev -name '*.d' | xargs cat | grep -c expression.hh
272
After:
$ find build/dev -name '*.d' | xargs cat | grep -c expression.hh
154
Some translation units adjust their includes to restore access
to required headers.
Closes#9229
Prepare for updating seastar submodule to a change
that requires deferred actions to be noexcept
(and return void).
Test: unit(dev, debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Get rid of unused includes of seastar/util/{defer,closeable}.hh
and add a few that are missing from source files.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
loading_shared_values/loading_cache'es iterators interface is dangerous/fragile because
iterator doesn't "lock" the entry it points to and if there is a
preemption point between aquiring non-end() iterator and its
dereferencing the corresponding cache entry may had already got evicted (for
whatever reason, e.g. cache size constraints or expiration) and then
dereferencing may end up in a use-after-free and we don't have any
protection against it in the value_extractor_fn today.
And this is in addition to #8920.
So, instead of trying to fix the iterator interface this patch kills two
birds in a single shot: we are ditching the iterators interface
completely and return value_ptr from find(...) instead - the same one we
are returning from loading_cache::get_ptr(...) asyncronous APIs.
A similar rework is done to a loading_shared_values loading_cache is
based on: we drop iterators interface and return
loading_shared_values::entry_ptr from find(...) instead.
loading_cache::value_ptr already takes care of "lock"ing the returned value so that it
would relain readable even if it's evicted from the cache by the time
one tries to read it. And of course it also takes care of updating the
last read time stamp and moving the corresponding item to the top of the
MRU list.
Fixes#8920
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20210817222404.3097708-1-vladz@scylladb.com>
This is side effect of allowing to run scylla_io_setup in nonroot mode,
the script able to run in non-root user even the installation is not
nonroot mode.
Result of that, the script finally failed to write io_properties.yaml
and causes permission denied. Since the evaluation takes long time, we
should run permission check before starting it.
We need to add root privilege check again, but skip it on nonroot mode.
Fixes#8915Closes#8984
The parse() function of high-level sstable metadata types are
trivial straight line code and can be easily simplified by
conversion to coroutines.
Test: unit (dev)
Closes#9224
* github.com:scylladb/scylla:
sstables: parse(*): adjust indentation after coroutine conversion
sstables: parse(compression&): eliminate unnecessary indirection
sstables: convert parse(compression&) to a coroutine
sstables: convert parse(commitlog_interval&) to a coroutine
sstables: parse(streaming_histogram&): eliminate unnecessary indirection
sstables: convert parse(streaming_histogram&) to a coroutine
sstables: convert parse(estimated_histogran&) to a coroutine
sstables: convert parse(statistics&) to a coroutine
sstables: convert parse(summary&) to a coroutine
All the places in code that configure the mentioned creds builder
from client_|server_encryption_options now do it the same way.
This patch generalizes it all in the utils:: helper.
The alternator code "ignores" require_client_auth and truststore
keys, but it's easy to make the generalized helper be compatible.
Also make the new helper coroutinized from the beginning.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
On start main() brushes up the client_encryption_options option
so that any user of it sees it in some "clean" state and can
avoid using get_or_default() to parse.
This patch removes this assumption (and the cleaning code itself).
Next patch will make use of it and relax the duplicated parsing
complexity back.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Main collects a bunch of local variables from config and passes
them as arguments to messaging service initialization helper.
This patch replaces all these args with const config reference.
The motivation is to facilitate next patching by providing the
server encryption options k:v set right in the m.s. init code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a warning message printed when internode encryption is
set up "incorrectly". The incorrectness "if" uses local variables
that soon will be moved away. This patch makes the check rely
purely on the config.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The get_or_default and is_true are two aux bits that are used
to parse the config options. The former is duplicated in the
alternator code as well.
Put both in utils namespace for future.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Workaround for Clang bug: https://bugs.llvm.org/show_bug.cgi?id=51515
When compiled on aarch64 with ASAN support and -Og/-Oz/-Os optimization
level, `raft_sys_table_storage::do_store_log_entries` crashes during the
tests. ASAN incorrectly reports `stack-use-after-return` on
`std::vector` list initialization after initial coroutine suspension
(initializer list's data pointer starts to point to garbage).
The workaround is simple: don't use initializer lists in such case
and replace with a series of `emplace_back` calls.
Tests: unit(debug, aarch64)
Fixes#9178
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210818102038.92509-1-pa.solodovnikov@scylladb.com>
We need to pass --supervisor option just for scylla-server module,
and also pass --packaging option to scylla-jmx module to avoid running
systemctl command, since the option may run in container, and container
may not have systemd.
Fixes#9141Closes#9142
Make the code tidier.
The conversion is not mechanical: the finally block is converted
to straight line code. stop()/close() must not fail anyway, and we
cannot recover from such failures. The when_all_succeed() for stopping
the semaphores is also converted to straight-line code - there is no
advantage to stopping them in parallel, as we're just waiting for
running tasks to complete and clean up.
Test: unit (dev)
Closes#9218
"
The mutation_fragments hashing code sitting in row-level repair
upsets clang and makes it spend 20 minutes compiling itself. This
set speeds this up greatly by moving the hashing code into the
mutation_fragment.cc and turning it into the appending_hash<>
specialisation. A simple sanity checking test makes sure this
doesn't change resulting hash values.
tests: unit.hashers_test(dev, release) // hash values matched, phew
dtest.repair_additional_test.repair_large_partition_existing_rows_test(release)
"
* 'br-row-level-comp-speedup-2.2' of https://github.com/xemul/scylla:
mutation_fragment: Specialize appending_hash for it
tests: Add sanity check for hashing mutation_fragments
Row-level rpair hashes the mutation fragment and wraps this into a
private fragment_hasher class. For some reason it takes ~20 minutes
for clang to compile the row_level.o with -O3 level (release mode).
Putting the whole fragment_hasher into a dedicated file reduces the
compilation times ~9 times.
However, it seems more natural not to move the fragment_hasher around
but to specialize the appending_hash<> for mutation_fragment and make
row_level.cc code just call feed_hash().
Compilation times (release mode):
before after
row_level.o 19m34s 2m4s
mutation_fragment.o 13s 17s
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patch is going to change the way row-level repair code hashes
mutation_fragment objects. This patch prepares the sanity check for
the hash values not be accidentally changed by hashing some simple
fragments and comparing them against known expected values.
The hash_mutation_fragment_for_test helper is added for this patch
only and will be removed really soon.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Documentation was extracted from abstract_replication_strategy::get_ranges(),
which says:
// get_ranges() returns the list of ranges held by the given endpoint.
// The list is sorted, and its elements are non overlapping and non wrap-around.
That's important because users of get_keyspace_local_ranges() expect
that the returned list is both sorted and non overlapping, so let's
document it to prevent someone from removing any of these properties.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210805140628.537368-1-raphaelsc@scylladb.com>
Currently, view will be not updated because the streaming reason is set
to streaming::stream_reason::rebuild. On the receiver side, only
streaming with the reason streaming::stream_reason::repair will trigger
view update.
Change the stream reason to repair to trigger view update for load and
stream. This makes load_and_stream behaves the same as nodetool refresh.
Note: However, this is not very efficient though.
Consider RF = 3, sst1, sst2, sst3 from the older cluster. When sst1 is
loaded, it streams to 3 replica nodes, if we generate view updates, we
will have 3 view updates for this replica (each of the peer nodes finds
its peer and writes the view update to peer). After loading sst2 and
sst3, we will have 9 view updates in total for a single partition.
If we create the view after the load and stream process, we will only
have 3 view updates for a single partition.
If we create the view after the load and stream process, we will only
have 3 view updates for a single partition.
Fixes#9205Closes#9213
Operations and generators can be composed to create more complex
operations and generators. There are certain composition patterns useful
for many different test scenarios.
We implement a couple of such patterns. For example:
- Given multiple different operation types, we can create a new
operation type - `either_of` - which is a "union" of the original
operation types. Executing `either_of` operation means executing an
operation of one of the original types, but the specific type
can be chosen in runtime.
- Given a generator `g`, `op_limit(n, g)` is a new generator which
limits the number of operations produced by `g`.
- Given a generator `g` and a time duration of `d` ticks, `stagger(g, d)` is a
new generator which spreads the operations from `g` roughly every `d`
ticks. (The actual definition in code is more general and complex but
the idea is similar.)
Some of these patterns have correspodning notions in Jepsen, e.g. our
`stagger` has a corresponding `stagger` in Jepsen (although our
`stagger` is more general).
Finally, we implement a test that uses this new infrastructure.
Two `Executable` operations are implemented:
- `raft_call` is for calling to a Raft cluster with a given state
machine command,
- `network_majority_grudge` partitions the network in half,
putting the leader in the minority.
We run a workload of these operations against a cluster of 5 nodes with
6 threads for executing the operations: one "nemesis thread" for
`network_majority_grudge` and 5 "client threads" for `raft_call`.
Each client thread randomly chooses a contact point which it tries first
when executing a `raft_call`, but it can also "bounce" - call a
different server when the previous returned "not_a_leader" (we use the
generic "bouncing" wrapper to do this).
For now we only print the resulting history. In a follow-up patchset
we will analyze it for consistency anomalies.
* kbr/raft-test-generator-v4:
test: raft: randomized_nemesis_test: a basic generator test
test: raft: generator: a library of basic generators
test: raft: introduce generators
test: raft: introduce `future_set`
test: raft: randomized_nemesis_test: handle `raft::stopped_error` in timeout futures
The previous commits introduced basic the generator concept and a
library of most common composition patterns.
In this commit we implement a test that uses this new infrastructure.
Two `Executable` operations are implemented:
- `raft_call` is for calling to a Raft cluster with a given state
machine command,
- `network_majority_grudge` partitions the network in half,
putting the leader in the minority.
We run a workload of these operations against a cluster of 5 nodes with
6 threads for executing the operations: one "nemesis thread" for
`network_majority_grudge` and 5 "client threads" for `raft_call`.
Each client thread randomly chooses a contact point which it tries first
when executing a `raft_call`, but it can also "bounce" - call a
different server when the previous returned "not_a_leader" (we use the
generic "bouncing" wrapper to do this).
For now we only print the resulting history. In a follow-up patchset
we will analyze it for consistency anomalies.
Operations and generators can be composed to create more complex
operations and generators. There are certain composition patterns useful
for many different test scenarios.
This commit introduces a couple of such patterns. For example:
- Given multiple different operation types, we can create a new
operation type - `either_of` - which is a "union" of the original
operation types. Executing `either_of` operation means executing an
operation of one of the original types, but the specific type
can be chosen in runtime.
- Given a generator `g`, `op_limit(n, g)` is a new generator which
limits the number of operations produced by `g`.
- Given a generator `g` and a time duration of `d` ticks, `stagger(g, d)` is a
new generator which spreads the operations from `g` roughly every `d`
ticks. (The actual definition in code is more general and complex but
the idea is similar.)
And so on.
Some of these patterns have correspodning notions in Jepsen, e.g. our
`stagger` has a corresponding `stagger` in Jepsen (although our
`stagger` is more general).
We introduce the concepts of "operations" and "generators", basic
building blocks that will allow us to declaratively write randomized
tests for torturing simulated Raft clusters.
An "operation" is a data structure representing a computation which
may cause side effects such as calling a Raft cluster or partitioning
the network, represented in the code with the `Executable` concept.
It has an `execute` function performing the computation and returns
a result of type `result_type`. Different computations of the same type
share state of type `state_type`. The state can, for example, contain
database handles.
Each execution is performed on an abstract `thread' (represented by a `thread_id`)
and has a logical starting time point. The thread and start point together form
the execution's `context` which is passed as a reference to `execute`.
Two operations may be called in parallel only if they are on different threads.
A generator, represented through the `Generator` concept, produces a
sequence of operations. An operation can be fetched from a generator
using the `op` function, which also returns the next state of the
generator (generators are purely functional data structures).
The generator concept is inspired by the generators in the Jepsen
testing library for distributed systems.
We also implement `interpreter` which "interprets", or "runs", a given
generator, by fetching operations from the generator and executing them
with concurrency controlled by the abstract threads.
The algorithm used in the interpreter is also similar to the interpreter
algorithm in Jepsen, although there are differences. Most notably we don't
have a "worker" concept - everything runs on a single shard; but we use
"abstract threads" combined with futures for concurrency.
There is also no notion of "process". Finally, the interpreter doesn't
keep an explicit history, but instead uses a callback `Recorder` to notify
the user about operation invocations and completions. The user can
decide to save these events in a history, or perhaps they can analyze
them on the fly using constant memory.
A set of futures that can be polled.
Polling the set (`poll` function) returns the value of one of
the futures which became available or `std::nullopt` if the given
logical durationd passes (according to the given timer), whichever
event happens first. The current implementation assumes sequential
polling.
New futures can be added to the set with `add`.
All futures can be removed from the set with `release`.
The timeout futures in `call` and `reconfigure` may be discarded after
Raft servers were `abort()`ed which would result in
`raft::stopped_error` and the test complained about discarded
exceptional futures. Discard these errors explicitly.
supervisor scripts for Docker and supervisor scripts for offline
installer are almost same, drop Docker one and share same code to
deduplicate them.
Closes#9143Fixes#9194
In order to decouple the service level controller from the systems logic, we introduce an API for subscribing to configuration changes. The timing of the call was determined with resource creation and destruction in mind. An API subscriber can create
resources that will be available from the very start of the service level existence it can also destroy them since the service level
is guarantied not to exist anymore at the time of the call to the deletion notification callback.
Testing:
unit tests - all + a newly added one.
dtests - next-gating (dev mode)
Closes#9097
* github.com:scylladb/scylla:
service level controller: Subscriber API unit test
Service Level Controller: Add a listener API for service level config changes
changes
This change adds an api for registering a listener for service_level
configuration chanhes. It notifies about removal addition and change of
service level.
The hidden assumption is that some listeners are going to create and/or
manage service level specific resources and this it what guided the
time of the call to the subscriber.
Addition and change of a service level are called before the actual
change takes place, this guaranties that resource creation can take
place before the service level or new config starts to be used.
The deletion notification is called only after the deletion took place
and this guranties that the service level can't be active and the
resources created can be safely destroyed.
Refs #9053
Flips default for commitlog disk footprint hard limit enforcement to off due
to observed latency stalls with stress runs. Instead adds an optional flag
"commitlog_use_hard_size_limit" which can be turned on to in fact do enforce it.
Sort of tape and string fix until we can properly tweak the balance between
cl & sstable flush rate.
Closes#9195
We decided to enable repair based node operations by default for replace
node operations.
To do that, a new option --allowed-repair-based-node-ops is added. It
lists the node operations that are allowed to enable repair based node
operations.
The operations can be bootstrap, replace, removenode, decommission and rebuild.
By default, --allowed-repair-based-node-ops is set to contain "replace".
Note, the existing option --enable-repair-based-node-ops is still in
play. It is the global switch to enable or disable the feature.
Examples:
- To enable bootstrap and replace node ops:
```
scylla --enable-repair-based-node-ops true --allowed-repair-based-node-ops replace,bootstrap
```
- To disable any repair based node ops:
```
scylla --enable-repair-based-node-ops false
```
Closes#9197
It turns out that user-defined aggregates did not need any elaborate coding in order to make them exposed to the users. The whole infrastructure is already there, including system schema tables and support for running aggregate queries, so this series simply adds lots and lots of boilerplate glue code to make UDA usable.
It also comes with a simple test which shows that it's possible to define and use such an aggregate.
Performance not tested, since user-defined functions are still experimental, so nothing really changes in this matter.
Tests: unit(release)
Fixes#7201Closes#9165
* github.com:scylladb/scylla:
cql-pytest: add a test suite for user-defined aggregates
cql-pytest: add context managers for functions and aggregates
cql3: enable user-defined aggregates in CQL grammar
cql3: add statements for user-defined aggregates
cql3,functions: add checking if a function is used in UDA
gms: add UDA feature
migration_manager: add migrating user-defined aggregates
db,schema_tables: add handling user-defined aggregates
pagers: make a lambda mutable in fetch_page
cql3: wrap handling paging result with with_thread_if_needed
cql3: correctly mark function selectors as needing threads
cql3: add user-define aggregate representation
The test suite now consists of a single user aggregate:
a custom implementation for existing avg() built-in function,
as well as a couple of cases for catching incorrect operations,
like using wrong function signatures or dropping used functions.
Aggregates are propagated, created and dropped very similarly
to user-defined functions - a set of helper functions
for aggregates are added based on the UDF implementation.
Function call selectors correctly checked if their arguments
are required to run in threaded context, but forgot to check
the function itself - which is now done.
A user-defined aggregate is represented as an aggregate
which calls its state function on each input row
and then finalizes its execution by calling its final function
on the final state, after all rows were already processed.
The code for handling background view updates used to propagate
exceptions unconditionally, which leads to "exceptional future
ignored" warnings if the update was put to background.
From now on, the exception is only propagated if its future
is actually waited on.
Fixes#6187
Tested manually, the warning was not observed after the patch
Closes#9179
If the wasmtime library is available for download, it will be
set up by install-dependencies and prepared for linking.
Closes#9151
[avi: regenerate toolchain, which also updates clang to 12.0.1]
I initially tried to use a noncopyable_function to avoid the unnecessary
template usage.
However, since database::apply_in_memory is a hot function. It is better
to use with_gate directly. The run_async function does nothing but calls
with_gate anyway.
Closes#9160
On some environment /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
does not exist even it supported CPU scaling.
Instead, /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor is
avaliable on both environment, so we should switch to it.
Fixes#9191Closes#9193
What should the following pair of statements do?
CREATE INDEX xyz ON tbl(a)
CREATE INDEX IF NOT EXISTS xyz ON tbl(b)
There are two reasonable choices:
1. An index with the name xyz already exists, so the second command should
do nothing, because of the "IF NOT EXISTS".
2. The index on tbl(b) does *not* yet exist, so the command should try to
create it. And when it can't (because the name xyz is already taken),
it should produce an error message.
Currently, Cassandra went with choice 1, and Scylla went with choice 2.
After some discussions on the mailing list, we agreed that Scylla's
choice is the better one and Cassandra's choice could be considered a
bug: The "IF NOT EXIST" feature is meant to allow idempotent creation of
an index - and not to make it easy to make mistakes without not noticing.
The second command listed above is most likely a mistake by the user,
not anything intentional: The command intended to ensure than an index
on column b exists, but after the silent success of the command, no such
index exists.
So this patch doesn't change any Scylla code (it just adds a comment),
and rather it adds a test which "enshrines" the current behavior.
The test passes on Scylla and fails on Cassandra so we tag it
"cassandra_bug", meaning that we consider this difference to be
intentional and we consider Cassandra's behavior in this case to be wrong.
Fixes#9182.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210811113906.2105644-1-nyh@scylladb.com>
The extra status print is not needed in the log.
Fixes the following error:
ERROR 2021-08-10 10:54:21,088 [shard 0] storage_service -
service/storage_service.cc:3150 @do_receive: failed to log message:
fmt='send_meta_data: got error code={}, from node={}, status={}':
fmt::v7::format_error (argument not found)
Fixes#9183Closes#9189
The reader is used by load and stream to read sstables from the upload
directory which are not guaranteed to belong to the local shard.
Using the make_range_sstable_reader instead of
make_local_shard_sstable_reader.
Tests:
backup_restore_tests.py:TestBackupRestore.load_and_stream_using_snapshot_test
backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_2_test
backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_1_test
migration_test.py:TestLoadAndStream.load_and_stream_asymmetric_cluster_test
migration_test.py:TestLoadAndStream.load_and_stream_decrease_cluster_test
migration_test.py:TestLoadAndStream.load_and_stream_frozen_pk_test
migration_test.py:TestLoadAndStream.load_and_stream_increase_cluster_test
migration_test.py:TestLoadAndStream.load_and_stream_primary_replica_only_test
Fixes#9173Closes#9185
Those methods are never used so it's better not to keep a dead code
around.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Closes#9188
Adds a `hinted_handoff_pause_hint_replay` error injection point. When
enabled, hint replay logic behaves as if it is run, but it gets stuck in
a loop and no hints are actually sent until the point is disabled again.
This injection point will be useful in dtests - it will simulate
infinitely slow hint replay and will make it possible to test how some
operations behave while hint replay logic is running.
The first intended use case of this injection point is testing the HTTP
API for waiting for hints (#8728).
Refs: #6649Closes#8801
* github.com:scylladb/scylla:
hints: fix indentation after previous patch
hints: error injection for pausing hint replay
hints: coroutinize lambda inside send_one_file
Each hint sender runs an asynchronous loop with tries to flush and then
send hints. Between each attempt, it sleeps at most 10 seconds using
sleep_abortable. However, an overload of sleep_abortable is used which
does not take an abort_source - it should abort the sleep in case
Seastar handles a SIGINT or SIGTERM signal. However, in order for that
to work, the application must not prevent default handling of those
signals in Seastar - but Scylla explicitly does it by disabling the
`auto_handle_sigint_sigterm` option in reactor config. As a result,
those sleeps are never aborted, and - because we wait for the async
loops to stop - they can delay shutdown by at most 10 seconds.
To fix that, an abort_source is added to the hints sender, and the
abort_source is triggered when the corresponding sender is requested to
stop.
Fixes: #9176Closes#9177
Since key_compare does not conform to SimpleLessCompare, the benchmark
tests the non-optimized version of bptree (without SIMD key search).
We want to test the optimized version.
Closes#9180
The DynamoDB documentation
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html
describes several hard limits on the size of the size of expressions
(ProjectionExpression, ConditionExpression, UpdateExpression,
FilterExpression) and various elements they contain.
In this patch we begin testing those limits with a comprehensive test for
the *length* of each of these four expressions: we test that lengths up to
(and including) 4096 bytes are allowed but longer expressions are rejected.
We also add TODOs for additional documented limits that should be tested
in the future.
Currently, this test passes on DynamoDB but xfails on Alternator because
Alternator does *not* enforce any limits on the expression length. I don't
think this is a real problem, and we may consider keeping it this way,
but we should at least be aware that this difference exists and an
xfailing test will remind us.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210810081948.2012120-2-nyh@scylladb.com>
DynamoDB limits attribute names in items to lengths of up 65535 bytes,
but in some cases (such as key attributes) the limit is lower - 255.
This patch adds tests for many of these cases.
All the new tests pass on DynamoDB, but some still xfail on Alternator
because Alternator is too lenient - sometimes allowing longer attribute
names than DynamoDB allows. While this may sound great, it also has
downsides: The oversized attribute names perform badly, and as they
grow, Alternator's internal limits will be reached as well, and result
in an unsightly "internal server error" being reported instead of the
expected user-friendly error.
Refs #9169.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210810081948.2012120-1-nyh@scylladb.com>
"
* Make `sstable::make_reader()` return `flat_mutation_reader_v2`,
retain the old one as `sstable::make_reader_v1()`
* Start weaning tests off `sstable::make_reader_v1()` (done all the
easy ones, i.e. those not involving range tombstones)
"
* tag 'sstable-make-reader-v2-v1' of github.com:cmm/scylla:
tests: use flat_mutation_reader_v2 in the easier part of sstable_3_x_test
tests: upgrade the "buffer_overflow" test to flat_mutation_reader_v2
tests: get rid of sstable::make_reader_v1() in broken_sstable_test
tests: get rid of sstable::make_reader_v1() in the trivial cases
sstables: make sstable::make_reader() return flat_mutation_reader_v2
"
Validation compaction -- although I still maintain that it is a good
descriptive name -- was an unfortunate choice for the underlying
functionality because Origin has burned the name already as it uses it
for a compaction type used during repair. This opens the door for
confusion for users coming from Cassandra who will associate Validation
compaction with the purpose it is used for in Origin.
Additionally, since Origin's validation compaction was not user
initiated, it didn't have a corresponding `nodetool` command to start
it. Adding such a command would create an operational difference between
us and Origin.
To avoid all this we fold validation compaction into scrub compaction,
under a new "validation" mode. I decided against using the also
suggested `--dry-mode` flag as I feel that a new mode is a more natural
choice, we don't have to define how it interacts with all the other
modes, unlike with a `--dry-mode` flag.
Fixes: #7736
Tests: unit(dev), manual(REST API)
"
* 'scrub-validation-mode/v2' of https://github.com/denesb/scylla:
compaction/compaction_descriptor: add comment to Validation compaction type
compaction/compaction_descriptor: compaction_options: remove validate
api: storage_service: validate_keyspace -> scrub_keyspace (validate mode)
compaction/compaction_manager: hide perform_sstable_validation()
compaction: validation compaction -> scrub compaction (validate mode)
compaction/compaction_descriptor: compaction_options: add options() accessor
compaction/compaction_descriptor: compaction_options::scrub::mode: add validate
Rename the old version to `sstables::make_reader_v1()`, to have a
nicely searcheable eradication target.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Adds a `hinted_handoff_pause_hint_replay` error injection point. When
enabled, hint replay logic behaves as if it is run, but it gets stuck in
a loop and no hints are actually sent until the point is disabled again.
This injection point will be useful in dtests - it will simulate
infinitely slow hint replay and will make it possible to test how some
operations behave while hint replay logic is running.
The first intended use case of this injection point is testing the HTTP
API for waiting for hints (#8728).
Refs: #6649
The table::run_compaction is a trivial wrapper for
table::compact_sstables.
We have lots of similar {start, trigger, run}_compaction functions.
Dropping the run_compaction wrapper to reduce confusion.
Closes#9161
There are situations where a node outside the current configuration is
the only node that can become a leader. We become candidates in such
cases. But there is an easy check for when we don't need to; a comment was
added explaining that.
* kbr/candidate-outside-config-v3:
raft: sometimes become a candidate even if outside the configuration
raft: fsm: update _commit_idx when applying snapshot
This PR introduces a new feature to hinted handoff: ability to wait until hints from given node are replayed towards a chosen set of nodes.
It replaces the old mechanism which waits for hints to be replayed before repair and exposes it through an HTTP API. The implementation is completely different, so this PR begins with a revert of the old functionality and then introduces the new implementation.
Waiting for hints is made possible with the help of "hint sync points". A sync point is a collection of positions in some hint queues from one node - those positions are encoded into the sync point's description as a hexadecimal string. The sync point consists only of its hexadecimal description - there is no state kept on any of the nodes.
Two operations are available through the HTTP API:
- `/hints_manager/waiting_point` (POST) - _Create a sync point_. Given a set of `target_hosts`, creates a sync point which encodes positions currently at the end of all queues pointing to any of the `target_hosts`.
- `/hints_manager/waiting_point` (GET) - _Wait or check the sync point_. Given a description of a sync point, checks if the sync point was already reached. If you provide a non-zero `timeout` parameter and the sync point is not reached yet, this endpoint will wait until it the point reached or the timeout expires.
Hinted handoff uses the commitlog framework in order to store and replay hints. Each entry (here, a serialized hint) can be identified by a "replay position", which contains the ID of the segment containing the hint, and its position in the file. Replay positions are ordered with respect to segment ID and then position in the file; because segment IDs are assigned sequentially and entries are also written sequentially, this order corresponds to the chronological order in which hints were written. This order also corresponds to the order in which hints are replayed, provided that hint segments are processed starting with the one with the smallest ID first.
The main idea is to track the positions of both the most recently written hint, and the most recently replayed hint. When creating a hint sync point, the position of the last written hint is encoded; when the sync point is waited on, the hints manager waits until the last replayed position reached the position encoded in the sync point. The description of the sync point encodes positions on a per-hint queue basis - separately for each shard, destination endpoint and hint type (regular or MV).
Note: although hints manager destroys and re-creates commitlog instances, the ordering described above still works - the ID of the first segment assigned by the commitlog instance corresponds to the number of milliseconds since the epoch, so commitlog instances created by newer instances will have larger IDs.
Before the hints manager is enabled, it performs segment _rebalancing_: for a given endpoint, it makes sure that each shard gets roughly the same number of hint segments. For example, if there are 3 shards and shard 1 has 7 segments, then shard 0 will get 2 segments, shard 1: 3 segments, and shard 2: 2 segments. Apart from distributing the work evenly between shards on startup, it also handles the case when the node is resharded - if the number of shards is reduced, segments from removed shards will be redistributed to lower shards.
Because of the possibility of segments being moved between shards on restart, this makes accurate tracking of hint replay harder. In order to simplify the problem, this PR changes the order in which hint segments are replayed - segments from other shards (here called "foreign" segments) are replayed first, before any "local" segment from this shard. Foreign segments are treated as if they were placed before the 0 replay position - when waiting for a hint sync point, we will __always__ wait for foreign segments to be replayed.
This behavior makes sure that hints generated before the sync point was created will be replayed - and, if segment rebalancing happened in the meantime, we will potentially replay some more segments which were moved across shards.
This PR starts with a revert of the "hints: delay repair until hints are replayed" (#8452) functionality. Some infrastructure introduced in the original PR started to be used by other places in the code, so this is not a simple revert of the merge commit - instead, commits of the old PR are reverted separately and modified in order to make the code compile.
The following commits from the original PR were omitted from the revert because the code introduced by them became used by other logic in repair:
- 0db45d1df5 (repair: introduce abort_source for repair abort)
- 3a2d09b644 (repair: introduce abort_source for shutdown)
- 49f4a2f968 (repair: plug in waiting for hints to be sent before repair)
Refs: #8102Fixes: #8727
Tests: unit(dev)
Closes#8982
* github.com:scylladb/scylla:
api: add HTTP API for hint sync points
api: register hints HTTP API outside set_server_done
storage_proxy: add functions for creating and waiting for hint sync pts
hints: add functions for creating and waiting for sync points
hints: add hint sync point structure
utils,alternator: move base64 code from alternator to utils
hints: make it possible to wait until hints are replayed
hints: track the RP of the last replayed position
hints: track the RP of the last written hint
hints: change last_attempted_rp to last_succeeded_rp
hints: rearrange error handling logic for hint sending
hints: sort segments by ID, divide into foreign and local
Revert "db/hints: allow to forcefully update segment list on flush"
Revert "db/hints: add a metric for counting processed files"
Revert "db/hints: make it possible to wait until current hints are sent"
Revert "storage_proxy: add functions for syncing with hints queue"
Revert "messaging_service: add verbs for hint sync points"
Revert "storage_proxy: implement verbs for hint sync points"
Revert "config: add wait_for_hint_replay_before_repair option"
Revert "storage_proxy: coordinate waiting for hints to be sent"
Revert "repair: plug in waiting for hints to be sent before repair"
Revert "hints: dismiss segment waiters when hint queue can't send"
Revert "storage_proxy: stop waiting for hints replay when node goes down"
Revert "storage_proxy: add abort_source to wait_for_hints_to_be_replayed"
Adds HTTP endpoints for manipulating hint sync points:
- /hinted_handoff/sync_point (POST) - creates a new sync point for
hints towards nodes listed in the `target_hosts` parameter
- /hinted_handoff/sync_point (GET) - checks the status of the sync
point. If a non-zero `timeout` parameter is given, it waits until the
sync point is reached or the timeout expires.
Registration of the currently unused hinted handoff endpoints is moved
out from the set_server_done function. They are now explicitly
registered in main.cc by calling api::set_hinted_handoff and also
uninitialized by calling api::unset_hinted_handoff.
Setting/unsetting HTTP API separately will allow to pass a reference to
the sync_point_service without polluting the set_server_done function.
Adds a sync_point structure. A sync point is a (possibly incomplete)
mapping from hint queues to a replay position in it. Users will be able
to create sync points consisting of the last written positions of some
hint queues, so then they can wait until hint replay in all of the
queues reach that point.
The sync point supports serialization - first it is serialized with the
help of IDL to a binary form, and then converted to a hexadecimal
string. Deserialization is also possible.
The base64 encoding/decoding functions will be used for serialization of
hint sync point descriptions. Base64 format is not specific to
Alternator, so it can be moved to utils.
Adds necessary infrastructure which allows, for a given endpoint
manager, to wait until hints are replayed up to a specified position. An
abort source must be specified which, if triggered, cancels waiting for
hint replay.
If the endpoint manager is stopped, current waiters are dismissed with
an exception.
Keeps track of a position which serves as an upper bound for positions
of already replayed hints - i.e. all hints with replay positions
strictly lower than it are considered replayed.
In order to accurately track this bound during hint replay, a std::map
is introduced which contains positions of hints which are currently
being sent.
The position of the last written hint is now tracked by the endpoint
hints manager.
When manager is constructed and no hints are replayed yet, the last
written hint position is initialized to the beginning of a fake segment
with ID corresponding to the current number of milliseconds since the
epoch. This choice makes sure that, in case a new hint sync point is
created before any hints are written, the position recorded for that
hint queue will be larger than all replay positions in segments
currently stored on disk.
Instead of tracking the last position for which hint sending is
attempted, the last successfully replayed position is tracked.
The previous variable was used to calculate the position from which hint
replay should restart in case of an error, in the following way:
_last_not_complete_rp = ctx_ptr->first_failed_rp.value_or(
ctx_ptr->last_attempted_rp.value_or(_last_not_complete_rp));
Now, this formula uses the last_succeeded_rp in place of
last_attempted_rp. This change does not have an effect on the choice of
the starting position of the next retry:
- If the hint at `last_attempted_rp` has succeeded, in the new algorithm
the same position will be recorded in `last_succeeded_rp`, and the
formula will yield the same result.
- If the hint at `last_attempted_rp` has failed, it will be accounted
into `first_failed_rp`, so the formula will yield the same result.
The motivation for this change is that in the next commits of this PR we
will start tracking the position of the last replayed hint per hint
queue, and the meaning of the new variable makes it more useful - when
there are no failed hints in the hint sending attempt, last_succeeded_rp
gives us information that hints _up to this position_ were replayed; the
last_attempted_rp variable can only tell us that hints _before that
position_ were replayed successfully.
Instead of calling the `on_hint_send_failure` method inside the hint
sending task in places where an error occurs, we now let the exceptions
be returned and handle them inside a single `then_wrapped` attached to
the hint sending task.
Apart from the `then_wrapped`, there is one more place which calls
`on_hint_send_failure` - in the exception handler for the future which
spawns the asynchronous hint sending task. It needs to be kept separate
because it is a part of a separate task.
Endpoint hints manager keeps a commitlog instance which is used to write
hints into new segments. This instance is re-created every 10 seconds,
which causes the previous instance to leave its segments on disk.
On the other hand, hints sender keeps a list of segments to replay which
is updated only after it becomes empty. The list is repopulated with
segments returned by the commitlog::get_segments_to_replay() method
which does not specify the order of the segments returned.
As a preparation for the upcoming hint sync points feature, this commit
changes the order in which segments are replayed:
- First, segments written by other shards are replayed. Such segments
may appear in the queue because of segment rebalancing which is done
at startup.
The purpose of replaying "foreign" segments first is that they are
problematic for hint sync points. For each hint queue, a hint sync
point encodes a replay position of the last written hint on the local
shard. Accounting foreign segments precisely would make the
implementation more complicated. To make things simpler, waiting for
sync points will always make sure that all foreign segments are
replayed. This might sometimes cause more hints to be waited on than
necessary if a restart occurs in the meantime.
- Segments written by the local shard are replayed later, in order of
their IDs. This makes sure that local hints are replayed in the order
they were written to segments, and will make it possible to use replay
positions to track progress of hint replay.
This reverts commit e48739a6da.
This commit removes the functionality from endpoint hints manager which
allowed to flush hints immediately and forcefully update the list of
segments to replay.
The new implementation of waiting for hints will be based on replay
positions returned by the commitlog API and it won't be necessary to
forcefully update the segment list when creating a sync point.
This reverts commit 5a49fe74bb.
This commit removes a metric which tracks how many segments were
replayed during current runtime. It was necessary for current "wait for
hints" mechanism which is being replaced with a different one -
therefore we can remove the metric.
This reverts commit 427bbf6d86.
This commit removes the infrastructure which allows to wait until
current hints are replayed in a given hint queue.
It will be replaced with a different mechanism in later commits.
This reverts commit 244738b0d5.
This commit removes create_hint_queue_sync_point and
check_hint_queue_sync_point functions from storage_proxy, which were
used to wait until local hints are sent out to particular nodes.
Similar methods will be reintroduced later in this PR, with a completely
different implementation.
This reverts commit 82c419870a.
This commit removes the HINT_SYNC_POINT_CREATE and HINT_SYNC_POINT_CHECK
rpc verbs.
The upcoming HTTP API for waiting for hint replay will be restricted
to waiting for hints on the node handling the request, so there is no
need for new verbs.
This reverts commit 485036ac33.
This commit removes the handlers for HINT_SYNC_POINT_CREATE and
HINT_SYNC_POINT_CHECK verbs.
The upcoming HTTP API for waiting for hint replay will be restricted
to waiting for hints on the node handling the request, so there is no
need for new verbs.
This reverts commit 86d831b319.
This commit removes the wait_for_hints_before_repair option. Because a
previous commit in this series removes the logic from repair which
caused it to wait for hints to be replayed, this option is now useless.
We can safely remove this option because it is not present in any
release yet.
This reverts commit 46075af7c4.
This commit removes the logic responsible for waiting for other nodes to
replay their hints. The upcoming HTTP API for waiting for hint replay
will be restricted to waiting for hints on the node handling the
request, so there is no need for coordinating multiple nodes.
This reverts commit 49f4a2f968.
The idea to wait for hints to be replayed before repair is not always a
good one. For example, someone might want to repair a small token range
or just one table - but hinted handoff cannot selectively replay hints
like this.
The fact that we are waiting for hints before repair caused a small
number of regressions (#8612, #8831).
This commit removes the logic in repair which caused it to wait for
hints. Additionally, the `storage_proxy.hh` include, which was
introduced in the commit being reverted is also removed and smaller
header files are included instead (gossiper.hh and fb_utilities.hh).
This reverts commit 9d68824327.
First, we are reverting existing infrastructure for waiting for hints in
order to replace it with a different one, therefore this commit needs to
be reverted as well.
Second, errors during hint replay can occur naturally and don't
necessarily indicate that no progress can be made - for example, the
target node is heavily loaded and some hints time out. The "waiting for
hints" operation becomes a user-issued command, so it's not as vital to
ensure liveness.
This reverts commit 22e06ace2c.
The upcoming HTTP API for waiting for hint replay will be restricted
to waiting for hints on the node handling the request, so we are
removing all infrastructure related to coordinating hint waiting -
therefore this commit needs to be reverted.
This reverts commit 958a13577c.
The `wait_for_hints_to_be_replayed` function is going to be completely
removed in this PR, so this commit needs to be reverted, too.
Even we switched to Ubuntu based container image, housekeeping still
using yum repository.
It should be switched to apt repository.
Fixes#9144Closes#9147
* seastar ce3cc2687f...07758294ef (12):
> perftune.py: change hwloc-calc parameters order
Fixes perftune on Fedora 34 based hwloc
> resource: pass configuration to nr_processing_units()
> semaphore: semaphore_timed_out: derive from timed_out_error
> Merge "resource: use hwloc_topology_holder" from Benny
> Merge "file: ioctl, fcntl and lifetime_hint interfaces in seastar::file" from Arun George
> pipe: mark pipe_reader and pipe_writer ctors as noexcept
> test: pipe: add simple unit test
> test: source_location_test: relax function name check for gcc 11
> http: add 429 too_many_requests status code
> Added [[nodiscard]] to abort-source's subscribe
> io_queue: Use on_internal_error in io_queue
> reactor: Remove unused epoll poller from reactor
read_schema_partition_for_keyspace() copies some parameters to capture them
in a coroutine, but the same can be achieved more cleanly by changing the
reference parameters to value parameters, so do that.
Test: unit (dev)
Closes#9154
This trial patch set moves compaction_strategy.hh and compaction_garbage_collector.hh to compaction directory and drops two unused compact_for_mutation_query_state and compact_for_data_query_state.
Closes#9156
* github.com:scylladb/scylla:
compaction: Move compaction_garbage_collector.hh to compaction dir
compaction: Move compaction_strategy.hh to compaction dir
mutation_compactor: Drop compact_for_mutation_query_state and compact_for_data_query_state
Prepare for the upcoming strict ALLOW FILTERING check by modifying
unit-test queries that need it. Current code allows such queries both
with and without ALLOW FILTERING; future code will reject them without
ALLOW FILTERING.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
The cluster would forget its configuration when taking a snapshot,
making it unable to reelect a leader.
We fix the problem and introduce a regression test.
The last commit introduces some additional assertions for safety.
* kbr/snapshot-preserve-config-v4:
raft: sanity checking of apply index
test: raft: regression test for storing cluster configuration when taking snapshots
raft: store cluster configuration when taking snapshots
There are situations where a node outside the current configuration is
the only node that can become a leader. We become candidates in such
cases. But there is an easy check for when we don't need to; a comment was
added explaining that.
All entries up to snapshot.idx must obviously be committed, so why not
update _commit_idx to reflect that.
With this we get a useful invariant:
`_log.get_snapshot().idx <= _commit_idx`.
For example, when checking whether the latest active configuration is
committed, it should be enough to compare the configuration index to the
commit index. Without the invariant we would need a special case if the
latest configuration comes from a snapshot.
Before the fix introduced in the previous patch, the cluster would
forget its configuration when taking a snapshot, making it unable to
reelect a leader. This regression test catches that.
We add a function `log_last_conf_before(index_t)` to `fsm` which, given
an index greater than the last snapshot index, returns the configuration
at this index, i.e. the configuration of the last configuration entry
before this index.
This function is then used in `applier_fiber` to obtain the correct
configuration to be stored in a snapshot.
In order to ensure that the configuration can be obtained, i.e. the
index we're looking at is not smaller than the last snapshot index, we
strengthen the conditions required for taking a snapshot: we check that
`_fsm` has not yet applied a snapshot at a larger index (which it may
have due to a remote snapshot install request). This also causes fewer
unnecessary snapshots to be taken in general.
Calculating clustering ranges on a local index has been rewritten to use the new `expression` variant.
This allows us to finally remove the old `bounds_ranges` function.
Closes#9080
* github.com:scylladb/scylla:
cql3: Remove unused functions like bounds_ranges
cql3: Use expressions to calculate the local-index clustering ranges
statement_restrictions_test: tests for extracting column restrictions
expression: add a function to extract restrictions for a column
We must not apply remote snapshots with commit indexes smaller than our
local commit index; this could result in out-of-order command
application to the local state machine replica, leading to
serializability violations.
Message-Id: <20210805112736.35059-1-kbraun@scylladb.com>
This applies to the case when pages are broken by replicas based on
memory limits (not row or partition limits).
If replicas stop pages in the following places:
replica1 = {
row 1,
<end-of-page>
row 2
}
replica2 = {
row 3
}
The coordinator will reconcile the first page as:
{
row 1,
row 3
}
and row 2 will not be emitted at all in the following pages.
The coordinator should notice that replica1 returned a short read and
ignore everything past row 1 from other replicas, but it doesn't.
There is a logic to do this trimming, but it is done in
got_incomplete_information_across_partitions() which is executed only
for the partition for which row limits were exhausted.
Fix by running the logic unconditionally.
Fixes#9119
Tests:
- unit (dev)
- manual (2 node cluster, manual reproducer)
Message-Id: <20210802231539.156350-1-tgrabiec@scylladb.com>
In issue #9083 a user noted that whereas Cassandra's partition-count
estimation is accurate, Scylla's (rewritten in commit b93cc21) is very
inaccurate. The tests introduce here, which all xfail on Scylla, confirm
this suspicion.
The most important tests are the "simple" tests, involving a workload
which writes N *distinct* partitions and then asks for the estimated
partition count. Cassandra provides accurate estimates, which grow
more accurate with more partitions, so it passes these tests, while
Scylla provides bad estimates and fails them.
Additional tests demonstrate that neither Scylla nor Cassandra
can handle anything beyond the "simple" case of distinct partitions.
Two tests which xfail on both Cassandra and Scylla demonstrate that
if we write the same partitions to multiple sstables - or also delete
partitions - the estimated partition counts will be way off.
Refs #9083
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210726211315.1515856-1-nyh@scylladb.com>
We are folding validation compaction into scrub (at least on the
interface level), so remove the validation entry point accordingly and
have users go through `perform_sstable_scrub()` instead.
Fold validation compaction into scrub compaction (validate mode). Only
on the interface level though: to initiate validation compaction one now
has to use `compaction_options::make_scrub(compaction_options::scrub::mode::validate)`.
The implementation code stays as-is -- separate.
Realized that the overall complexity of partition filtering in
cleanup is O(N * log(M)), where
N is # of tokens
M is # of ranges owned by the node
Assuming N=10,000,000 for a table and M=257, N*log(M) ~= 80,056,245
checks performed during the whole cleanup.
This can be optimized by taking advantage that owned ranges are
both sorted and non wrapping, so an incremental iterator-oriented
checker is introduced to reduce complexity from O(N * log(M)) to
O(N + M) or just O(N).
BEFORE
240MB to 237MB (~98% of original) in 3239ms = 73MB/s. ~950016 total partitions merged to 949943.
719MB to 719MB (~99% of original) in 9649ms = 74MB/s. ~2900608 total partitions merged to 2900576.
1GB to 1GB (~100% of original) in 15231ms = 74MB/s. ~4536960 total partitions merged to 4536852.
1GB to 1GB (~100% of original) in 15244ms = 74MB/s. ~4536960 total partitions merged to 4536840.
1GB to 1GB (~100% of original) in 15263ms = 74MB/s. ~4536832 total partitions merged to 4536783.
1GB to 1GB (~100% of original) in 15216ms = 74MB/s. ~4536832 total partitions merged to 4536812.
AFTER
240MB to 237MB (~98% of original) in 3169ms = 74MB/s. ~950016 total partitions merged to 949943.
719MB to 719MB (~99% of original) in 9444ms = 76MB/s. ~2900608 total partitions merged to 2900576.
1GB to 1GB (~100% of original) in 14882ms = 76MB/s. ~4536960 total partitions merged to 4536852.
1GB to 1GB (~100% of original) in 14918ms = 76MB/s. ~4536960 total partitions merged to 4536840.
1GB to 1GB (~100% of original) in 14919ms = 76MB/s. ~4536832 total partitions merged to 4536783.
1GB to 1GB (~100% of original) in 14894ms = 76MB/s. ~4536832 total partitions merged to 4536812.
Fixes#6807.
test: mode(dev).
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210802213159.182393-1-raphaelsc@scylladb.com>
Finding clustering ranges has been rewritten to use the new
expression variant.
Old bounds_ranges() and other similar ones are no longer needed.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Removes old code used to calculate local-index clustering range
and replaces it with new based on the expression variant.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Not all calculate_natural_endpoints implementations respect can_yield
flag, for example, everywhere_replication_strategy.
This patch adds yield at the caller site to fix stalls we saw in
do_get_ranges.
Fixes#8943Closes#9139
The sstable_list is destroyed right after the temporary
lw_shared_ptr<sstable_list> returned from `cf.get_sstables()`
is dereferenced.
Fixes#9138
Test: unit(dev)
DTest: resharding_test.py:ReshardingTombstones_with_DateTieredCompactionStrategy.disable_tombstone_removal_during_reshard_test (debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210804075813.42526-1-bhalevy@scylladb.com>
In this patch we add another test case for a case where ALLOW FILTERING
should not be required (and Cassandra doesn't require it) but Scylla
does.
This problem was introduced by pull request #9122. The pull request
fixed an incorrect query (see issue #9085) involving both an index and
a multi-column restriction on a compound clustering key - and the fix is
using filtering. However, in one specific case involving a full prefix,
it shouldn't require filtering. This test reproduces this case.
The new test passes on Cassandra (and also theoretically, should pass),
but fails on Scylla - the check_af_optional() call fails because Scylla
made the ALLOW FILTERING mandatory for that case.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210803092046.1677584-1-nyh@scylladb.com>
We already have tests for Query's ExclusiveStartKey option, but we
only exercised it as a way for paging linearly through all the results.
Now we add a test that confirms that ExclusiveStartKey can be used not
just for paging through all the result - but also for jumping directly to
the middle of a partition after any clustering key (existing or non-
existing clustering key). The new test also for the first time verifies
that ExclusiveStartKey with a specific format works (previous tests just
copied LastEvaluatedKey to ExclusiveStartKey, so any opaque cookie could
have worked).
The test passes on both DynamoDB and Alternator so it did not find a new
bug. But it's useful to have as a regression test, in case in the future
we want to improve paging performance (see #6278) - and need to keep in
mind that ExclusiveStartKey is not just for paging.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210729114703.1609058-1-nyh@scylladb.com>
They were already correctly returned to the caller, but we had a
leftover discarded future that would sometimes end up with a
broken_promise exception. Ignore the exception explicitly.
Message-Id: <20210803122207.78406-1-kbraun@scylladb.com>
This add support for Azure snitch. The work is an adaptation of
AzureSnitch for Apache Cassandra by Yoshua Wakeham:
https://raw.githubusercontent.com/yoshw/cassandra/9387-trunk/src/java/org/apache/cassandra/locator/AzureSnitch.java
Also change `production_snitch_base` to protect against
a snitch implementation setting DC and rack to an empty string,
which Lubos' says can happen on Azure.
Fixes#8593Closes#9084
* github.com:scylladb/scylla:
scylla_util: Use AzureSnitch on Azure
production_snitch_base: Fallback for empty DC or rack strings
azure_snitch: Azure snitch support
Query pager was reusing query_uuid only when it had no local state (no
_last_pkey), so querier cache was not used when paging locally.
This bug affects performance of aggregate queries like count(*).
Fixes#9127
Message-Id: <20210803003941.175099-1-tgrabiec@scylladb.com>
Following conversion to corotuines in fc91e90c59, remove extra
indents and braces left to make the change clearer.
One variable had to be renamed since without the braces it
duplicated another variable in the same block.
Test: unit (dev)
Closes#9125
With commit 1924e8d2b6, compaction code was moved into a
top level dir as compaction is layered on top of sstables.
Let's continue this work by moving all compaction unit tests
into its own test file. This also makes things much more
organized.
sstable_datafile_test, as its name implies, will only contain
sstable data tests. Perhaps it should be renamed to only
sstable_data_test, as the test also contains tests involving
other components, not only the data one.
BEFORE
$ cat test/boost/sstable_datafile_test.cc | grep TEST_CASE | wc -l
105
AFTER
$ cat test/boost/sstable_compaction_test.cc | grep TEST_CASE | wc -l
57
$ cat test/boost/sstable_datafile_test.cc | grep TEST_CASE | wc -l
48
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210802192120.148583-1-raphaelsc@scylladb.com>
"
Now reshape can be aborted on either boot or refresh.
The workflow is:
1) reshape starts
2) user notices it's taking too long
3) nodetool stop RESHAPE
the good thing is that completed reshape work isn't lost, allowing
table to enjoy the benefits of all reshaping done up to the abortion
point.
Fixes#7738.
"
* 'abort_reshape_v1' of https://github.com/raphaelsc/scylla:
compaction: Allow reshape to be aborted
api: make compaction manager api available earlier
Now reshape can be aborted on either boot or refresh.
The workflow is:
1) reshape starts
2) user notices it's taking too long
3) nodetool stop RESHAPE
the good thing is that completed reshape work isn't lost, allowing
table to enjoy the benefits of all reshaping done up to the abortion
point.
Fixes#7738.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When perfing cleanup, merging reader showed up as significant.
Given that cleanup is performed on a single sstable at a time,
merging reader becomes an extra layer doing useless work.
1.71% 1.71% scylla scylla [.] merging_reader<mutation_reader_merger>::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}::operator()
mutation compactor, to get rid of purgeable expired data
and so on, still consumes the data retrieved by sstable
reader, so no semantic change is done.
With the overhead removed, cleanup becomes ~9% faster, see:
BEFORE
real 1m15.240s
user 0m2.648s
sys 0m0.128s
240MB to 237MB (~98% of original) in 3301ms = 71MB/s.
719MB to 719MB (~99% of original) in 9761ms = 73MB/s.
1GB to 1GB (~100% of original) in 15372ms = 73MB/s.
1GB to 1GB (~100% of original) in 15343ms = 74MB/s.
1GB to 1GB (~100% of original) in 15329ms = 74MB/s.
1GB to 1GB (~100% of original) in 15360ms = 73MB/s.
AFTER
real 1m9.154s
user 0m2.428s
sys 0m0.123s
240MB to 237MB (~98% of original) in 3010ms = 78MB/s.
719MB to 719MB (~99% of original) in 8997ms = 79MB/s.
1GB to 1GB (~100% of original) in 14114ms = 80MB/s.
1GB to 1GB (~100% of original) in 14145ms = 80MB/s.
1GB to 1GB (~100% of original) in 14106ms = 80MB/s.
1GB to 1GB (~100% of original) in 14053ms = 80MB/s.
With 1TB set, ~20m would had been reduced instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210730190713.462135-1-raphaelsc@scylladb.com>
Add a function, which given an expression and a column,
extracts all restrictions involving this column.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
We should not use the current term; we should use the term of the
snapshot's index, which may be lower.
* https://github.com/kbr-/scylla/tree/snapshot-right-term-fix:
test: raft: regression test for using the correct term when taking a snapshot
test: raft: randomized_nemesis_test: server configuration parameter
raft: use the correct term when storing a snapshot
Message changed according to what 'scylla_bootparam_setup' currently does
(set a clock source at boot time) instead of of what it used to do in
the past (setting huge pages).
Closes#9116.
When a WHERE clause contains a multi-column restriction and an indexed
regular column, we must filter the results. It is generally not
possible to craft the index-table query so it fetches only the
matching rows, because that table's clustering key doesn't match up
with the column tuple.
Fixes#9085.
Tests: unit (dev, debug)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#9122
schema_tables is quite hairy, but can be easily simplified with coroutines.
In addition to switching future-returning functions to coroutines, we also
switch Seastar threads to coroutines. This is less of a clear-cut win; the
motivation is to reduce the chances of someone calling a function that
expects to run in a thread from a non-thread context. This sometimes works
by accident, but when it doesn't, it's pretty bad. So a uniform calling convention
has some benefit.
I left the extra indents in, since the indent-fixing patch is hard to rebase in case
a rebase is needed. I will follow up with an indent fix post merge.
Test: unit (dev, debug, release)
Closes#9118
* github.com:scylladb/scylla:
db: schema_tables: drop now redundant #includes
db: schema_tables: coroutinize drop_column_mapping()
db: schema_tables: coroutinize column_mapping_exists()
db: schema_tables: coroutinize get_column_mapping()
db: schema_tables: coroutinize read_table_mutations()
db: schema_tables: coroutinize create_views_from_schema_partition()
db: schema_tables: coroutinize create_views_from_table_row()
db: schema_tables: unpeel lw_shared_ptr in create_Tables_from_tables_partition()
db: schema_tables: coroutinize create_tables_from_tables_partition()
db: schema_tables: coroutinize create_table_from_name()
db: schema_tables: coroutinize read_table_mutations()
db: schema_tables: coroutinize merge_keyspaces()
db: schema_tables: coroutinize do_merge_schema()
db: schema_tables: futurize and coroutinize merge_functions()
db: schema_tables: futurize and coroutinize user_types_to_drop::drop
db: schema_tables: futurize and coroutinize merge_types()
db: schema_tables: futurize and coroutinize merge_tables_and_views()
db: schema_tables: coroutinize store_column_mapping()
db: schema_tables: futurize and coroutinize read_tables_for_keyspaces()
db: schema_tables: coroutinize read_table_names_of_keyspace()
db: schema_tables: coroutinize recalculate_schema_version()
db: schema_tables: coroutinize merge_schema()
db: schema_tables: introduce and use with_merge_lock()
db: schema_tables: coroutinize update_schema_version_and_announce()
db: schema_tables: coroutinize read_keyspace_mutation()
db: schema_tables: coroutinize read_schema_partition_for_table()
db: schema_tables: coroutinize read_schema_partition_for_keyspace()
db: schema_tables: coroutinize query_partition_mutation()
db: schema_tables: coroutinize read_schema_for_keyspaces()
db: schema_tables: coroutinize convert_schema_to_mutations()
db: schema_tables: coroutinize calculate_schema_digest()
db: schema_tables: coroutinize save_system_schema()
This series improves the repair logging by removing the unused sub_ranges_nr counter, adding peer node ip in the log, removing redundant logs in case of error.
Closes#9120
* github.com:scylladb/scylla:
repair: Remove redudnary error log in tracker::run
repair: Do not log errors in repair_ranges
repair: Move more repair single range code into repair_info::repair_range
repair: Use the same uuid from the repair_info
repair: Drop sub_ranges_nr counter
It was observed that since fce124bd90 ('Merge "Introduce
flat_mutation_reader_v2" from Tomasz') database_test takes much longer.
This is expected since it now runs the upgrade/downgrade reader tests
on all existing tests. It was also observed that in a similar time frame
database_test sometimes times our on test machines, taking much
longer than usual, even with the extra work for testing reader
upgrade/downgrade.
In an attempt to reproduce, I noticed ti failing on EMFILE (too many
open file descriptors). I saw that tests usually use ~100 open file
descriptors, while the default limit is 1024.
I suspect we have runaway concurrency, but I was not able to pinpoint the
cause. It could be compaction lagging behind, or cleanup work for
deleting tables (the test
test_database_with_data_in_sstables_is_a_mutation_source creates and
deletes many tables).
As a stopgap solution to unblock the tests, this patch raises the file
descriptor limit in the way recommended by [1]. While tests shouldn't
use so many descriptors, I ran out of ideas about how to plug the hole.
Note that main() does something similar, through more elaborate since
it needs to communicate to users. See ec60f44b64 ("main: improve
process file limit handling").
[1] http://0pointer.net/blog/file-descriptor-limits.htmlCloses#9121
We recently saw repairs blocking and not making progress for a prolonged
time (days). One of the primary suspects is reads belonging to several
repairs deadlocking on the streaming read concurrency semaphore. This is
something that we've seen in the past during internal testing, and
although theoretically we have fixed it, such deadlocks are notoriously
hard to reliably reproduce so not seeing them in recent testing doesn't
mean they definitely cannot happen.
The main reason these deadlocks can happen in the first place is that
reads belonging to repairs don't use timeouts. This means that if there
happens to be a deadlock, neither of the participating repairs will give
up and the only way to release the deadlock is restarting the node.
This patch proposes a workaround for these recently saw repair
problems by introducing a timeout for reads belonging to row-level
reads, building on the premise that a failed repair is better than a
stuck repair. A timeout allows one of the participants to give up,
releasing the deadlock, allowing the others to proceed.
The timeout value chosen by this patch is 30m. Note that this applies to
reading a single mutation fragment, not for the entire read. Thirty
minutes should be more than enough for producing a single mutation
fragment.
Refs: #5359
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Signed-off-by: Asias He <asias@scylladb.com>
Closes#9098
Restrict expected exception message to filter only relevant exception,
matching both for scylla and cassandra.
For example, the former has this message:
Cannot use query parameters in CREATE MATERIALIZED VIEW statements
While the latter throws this:
Bind variables are not allowed in CREATE MATERIALIZED VIEW statements
Also, place cleanup code in try-finally clause.
Tests: cql-pytest:test_materialized_view.py(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210802083912.229886-1-pa.solodovnikov@scylladb.com>
This is a translation of Cassandra's CQL unit test source file
validation/entities/TimestampTest.java into our our cql-pytest framework.
This test file checks has a few tests (8) on various features of
cell timestamps. All these tests pass on Cassandra and on Scylla - i.e.,
these tests no new Scylla bug was detected :-)
Two of the new tests are very slow (6 seconds each) and check a trivial
feature that was already checked elsewhere more efficiently (the fact
that TTL expiration works), so I marked them "skip" after verifying they
really pass.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210801142738.1633126-1-nyh@scylladb.com>
The calller of tracker::run will log the error. Remove the log inside
tracker::run in case of error to reduce redundancy.
Before:
WARN 2021-07-28 15:24:32,325 [shard 0] repair - repair id [id=1,
uuid=e9c63a2b-07c9-4a38-b9ad-b74cbe180366] failed: std::runtime_error
({shard 0: std::runtime_error (repair id [id=1,
uuid=e9c63a2b-07c9-4a38-b9ad-b74cbe180366] on s hard 0 failed to repair
513 out of 513 ranges), shard 1: std::runtime_error (repair id [id=1,
uuid=e9c63a2b-07c9-4a38-b9ad-b74cbe180366] on shard 1 failed to
repair 513 out of 513 ranges)})
WARN 2021-07-28 15:24:32,325 [shard 0] repair - repair_tracker run for
repair id [id=1, uuid=e9c63a2b-07c9-4a38-b9ad-b74cbe180366] failed:
std::runtime_error ({shard 0: std::runtime_error (repair id [id=1,
uuid=e9c63a2b-07c9-4a38-b9ad-b74cbe180366] on shard 0 failed to repair
513 out of 513 ranges), shard 1: std::runtime_error (repair id [id=1,
uuid=e9c63a2b-07c9-4a38-b9ad-b74cbe180366] on shard 1 failed to
repair 513 out of 513 ranges)})
After:
WARN 2021-07-28 15:33:17,038 [shard 0] repair - repair_tracker run for
repair id [id=1, uuid=497b2f14-e294-4e76-b792-e6b2d17a8cb9] failed:
std::runtime_error ({shard 0: std::runtime_error (repair id [id=1,
uuid=497b2f14-e294-4e76-b792-e6b2d17a8cb9] on shard 0 failed to repair
513 out of 513 ranges), shard 1: std::runtime_error (repair id [id=1,
uuid=497b2f14-e294-4e76-b792-e6b2d17a8cb9] on shard 1 failed to
repair 513 out of 513 ranges)})
ERROR 2021-07-28 15:33:17,453 [shard 0] rpc - client 127.0.0.2:7000:
fail to connect: Connection refused
append_reply packets can be reordered and thus reply.commit_idx may be
smaller than the one it the tracker. The tracker's commit index is used
to check if a follower needs to be updated with potentially empty
append message, so the bug may theoretically cause unneeded packets to
be sent.
Message-Id: <YQZZ/6nlNb5nQyXp@scylladb.com>
"
Refs #4186 but does not fix it, because I punted on the "number (and
kinds) of objects migrated and evicted" part.
"
* tag 'gh-4186-reclamation-diagnostics-on-stalls-v6' of github.com:cmm/scylla:
logalloc: add on-stall memory reclaim diagnostics
utils: add a coarse clock
logalloc: split tracker::impl::reclaim into reclaim & reclaim_locked
logalloc: metrics: remove unneeded captures and a pleonasm
logalloc: add metrics for evicted and freed memory
logalloc: count evicted memory
logalloc: count freed memory
Reuse the existing `reclaim_timer` for stall detection.
* Since a timer is now set around every reclaim and compaction, use a
coarse one for speed.
* Set log level according to conditions (stalls deserve a warning).
* Add compaction/migration/eviction/allocation stats.
Refs #4186.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Implement a millisecond-resolution `std::chrono`-style clock using
`CLOCK_MONOTONIC_COARSE`. The use cases are those where you care
about clock sampling latency more than about accuracy.
Assuming non-ancient versions of the kernel & libc, all clock types
recognized by `clock_gettime()` are implemented through a vDSO, so
`clock_gettime()` is not an actual system call. That means that even
`CLOCK_MONOTONIC` (which is what `std::chrono::steady_clock` uses) is
not terribly expensive in practice.
But `CLOCK_MONOTONIC_COARSE` is still 3.5 times faster than that (on
my machine the latencies are 4ns versus 14ns) and is also supposed to
be easier on the cache.
The actual granularity of `CLOCK_MONOTONIC_COARSE` is tick (on x86-64,
anyway) -- but `getclock_getres()` says it has millisecond resolution,
so we use that.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
The tables local is a lw_shared_ptr which is created and then refeferenced
before returning. It can be unpeeled to the pointed-to type, resulting in
one less allocation.
Right now, merge_functions() expects to be called in a thread.
Remove that requirement by converting it into a coroutine and returning
a future.
De-threading helps reduce errors where something expects to be called
in a thread, but isn't.
user_types_to_drop::drop is a function object returning void, and expecting
to be called in a thread. Make it return a future and convert the
only value it is initialized to to a coroutine.
De-threading helps reduce errors where something expects to be called
in a thread, but isn't.
Right now, merge_types() expects to be called in a thread.
Remove that requirement by converting it into a coroutine and returning
a future.
The [[nodiscard]] attribute is moved from the function to the
return type, since the function now returns a future which is
nodiscard anyway.
The lambda returned is not coroutinized (yet) since it's part
of the user_types_to_drop inner function that still returns void
and expects to be called in a thread.
De-threading helps reduce errors where something expects to be called
in a thread, but isn't.
Right now, merge_tables_and_views() expects to be called in a thread.
Remove that requirement by converting it into a coroutine and returning
a future.
De-threading helps reduce errors where something expects to be called
in a thread, but isn't.
Right now, read_tables_for_keyspaces() expects to be called in a thread.
Remove that requirement by converting it into a coroutine and returning
a future.
De-threading helps reduce errors where something expects to be called
in a thread, but isn't.
Rather than open-coding merge_lock()/merge_unlock() pairs, introduce
and use a helper. This helps in coroutinization, since coroutines
don't support RAII with destructors that wait.
It is actually `partition_snapshot_flush_accounter`, as opposed to
`partition_snapshot_read_accounter`.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
NOTE: this series depends on a Seastar submodule update, currently queued in next: 0ed35c6af052ab291a69af98b5c13e023470cba3
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
1. `make_exception_future<>` for futures
2. `co_return coroutine::exception(...)` for coroutines
which return `future<T>` (the mechanism does not work for `future<>`
without parameters, unfortunately)
Tests: unit(release)
Closes#9079
* github.com:scylladb/scylla:
system_keyspace: pass exceptions without throwing
sstables: pass exceptions without throwing
storage_proxy: pass exceptions without throwing
multishard_mutation_query: pass exceptions without throwing
client_state: pass exceptions without throwing
flat_mutation_reader: pass exceptions without throwing
table: pass exceptions without throwing
commitlog: pass exceptions without throwing
compaction: pass exceptions without throwing
database: pass exceptions without throwing
"
Previously, the following functions were
incorrectly marked as pure, meaning that the
function is executed at "prepare" step:
* `currenttimestamp()`
* `currenttime()`
* `currentdate()`
* `currenttimeuuid()`
For functions that possibly depend on timing and random seed,
this is clearly a bug. Cassandra doesn't have a notion of pure
functions, so they are lazily evaluated.
Make Scylla to match Cassandra behavior for these functions.
Add a unit-test for a fix (excluding `currentdate()` function,
because there is no way to use synthetic clock with query
processor and sleeping for a whole day to demonstrate correct
behavior is clearly not an option).
Also, extend the cql-pytest for #8604 since there are now more
non-deterministic CQL functions, they are all subject to the test
now.
Fixes: #8816
"
* 'timeuuid_function_pure_annotation_v3' of https://github.com/ManManson/scylla:
test: test_non_deterministic_functions: test more non-pure functions
cql3: change `current*()` CQL functions to be non-pure
Check that all existing non-pure functions (except for
`currentdate()`) work correctly with or without prepared
statements.
Tests: cql-pytest/test_non_deterministic_functions.py(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
These include the following:
* `currenttimestamp()`
* `currenttime()`
* `currentdate()`
* `currenttimeuuid()`
Previously, they were incorrectly marked as pure, meaning
that the function is executed at "prepare" step.
For functions that possibly depend on timing and random seed,
this is clearly a bug. Cassandra doesn't have a notion of pure
functions, so they are lazily evaluated.
Make Scylla to match Cassandra behavior for these functions.
Add a unit-test for a fix (excluding `currentdate()` function,
because there is no way to use synthetic clock with query
processor and sleeping for a whole day to demonstrate correct
behavior is clearly not an option).
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
"
`create_view_statement::announce_migration()` has an incorrect
check to verify that no bound variables were supplied to a
select statement of a materialized view.
This used `prepare_context::empty()` static method,
which doesn't check the current instance for emptiness but
constructs a new empty instance instead.
The following bit of code actually checked that the pointer
to the new empty instance is not null:
if (!_variables.empty()) {
throw exceptions::invalid_request_exception(format("Cannot use query parameters in CREATE MATERIALIZED VIEW statements"));
}
Use `get_variable_specifications().empty()` instead to fix the semantics of
the `if` statement.
This series also removes this `empty()` method, because it's
not used anymore. The corresponding non-default constructor
is also removed due to being unused.
Tests: unit(dev), cql-pytest:test_materialized_view.py(scylla dev, cassandra trunk)
"
Fixes#9117
* 'create_view_stmt_check_bound_vars_v3' of https://github.com/ManManson/scylla:
test: add a test checking that bind markers within MVs SELECT statement don't lead to a crash
cql3: prepare_context: remove unused methods
cql3: create_view_statement: fix check for bound variables
cql3: make `prepare_context::get_variable_specifications()` return const-ref for lvalue overload
The request should fail with `InvalidRequest` exception and shouldn't
crash the database.
Don't check for actual error messages, because they are different
between Scylla and Cassandra.
The former has this message:
Cannot use query parameters in CREATE MATERIALIZED VIEW statements
While the latter throws this:
Bind variables are not allowed in CREATE MATERIALIZED VIEW statements
Tests: cql-pytest/test_materialized_view.py(scylla dev, cassandra trunk)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The code for checking that an MV's select statement doesn't
have any bind markers uses the wrong method and always returns
`false` even when it should not.
`prepare_context::empty()` is a misleading name because
it doesn't check if the current instance is empty, but creates
an empty instance wrapped in a `lw_shared_ptr` instead.
Thus, the code in `create_view_statement::announce_migration()`
checks that the pointer is not empty, which is always false.
Use `get_variable_specifications().empty()` to check that the
specifications vector inside the `prepare_context`
instance is not empty.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
There's no point in copying the `_specs` vector by value in such
case, just return a const reference. All existing uses create
a copy either way.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
"
`function_call` AST nodes are created for each function
with side effects in a CQL query, i.e. non-deterministic
functions (`uuid()`, `now()` and some others timeuuid-related).
These nodes are evaluated either when a query itself is executed
or query restrictions are computed (e.g. partition/clustering
key ranges for LWT requests).
We need to cache the calls since otherwise when handling a
`bounce_to_shard` request for an LWT query, we can possibly
enter an infinite bouncing loop (in case a function is used
to calculate partition key ranges for a query), since the
results can be different each time.
Furthermore, we don't support bouncing more than one time.
Returning `bounce_to_shard` message more than one time
will result in a crash.
Caching works only for LWT statements and only for the function
calls that affect partition key range computation for the query.
`variable_specifications` class is renamed to `prepare_context`
and generalized to record information about each `function_call`
AST node and modify them, as needed:
* Check whether a given function call is a part of partition key
statement restriction.
* Assign ids for caching if above is true and the call is a part
of an LWT statement.
There is no need to include any kind of statement identifier
in the cache key since `query_options` (which holds the cache)
is limited to a single statement, anyway.
Function calls are indexed by the order in which they appear
within a statement while parsing. There is no need to
include any kind of statement identifier to the cache key
since `query_options` (which holds the cache) is limited
to a single statement, anyway.
Note that `function_call::raw` AST nodes are not created
for selection clauses of a SELECT statement hence they
can only accept only one of the following things as parameters:
* Other function calls.
* Literal values.
* Parameter markers.
In other words, only parameters that can be immediately reduced
to a byte buffer are allowed and we don't need to handle
database inputs to non-pure functions separately since they
are not possible in this context. Anyhow, we don't even have
a single non-pure function that accepts arguments, so precautions
are not needed at the moment.
Add a test written in `cql-pytest` framework to verify
that both prepared and unprepared lwt statements handle
`bounce_to_shard` messages correctly in such scenario.
Fixes: #8604
Tests: unit(dev, debug)
NOTE: the patchset uses `query_options` as a container for
cached values. This doesn't look clean and `service::query_state`
seems to be a better place to store them. But it's not
forwarded to most of the CQL code and would mean that a huge number
of places would have to be amended.
The series presents a trade-off to avoid forwarding `query_state`
everywhere (but maybe it's the thing that needs to be done, nonetheless).
"
* 'lwt_bounce_to_shard_cached_fn_v6' of https://github.com/ManManson/scylla:
cql-pytest: add a test for non-pure CQL functions
cql3: cache function calls evaluation for non-deterministic functions
cql3: rename `variable_specifications` to `prepare_context`
As preparation for converting term::raw an expression, make it
forward declarable so that we can have a term::raw that is an
expression, and an expression that is a term::raw, without driving
the compiler insane.
Closes#9101
hierarchy with expressions' from Avi Kivity
Currently, the grammar has two parallel hierarchies. One hierarchy is
used in the WHERE clause, and is based on a combination of `term`
and expressions. The other is used in the SELECT clause, and is
using the cql3::selection::selectable hierarchy. There is some overlap
between the hierarchies: both can name columns. Logically, however,
they overlap completely - in SQL anything you can select you can
filter on, and vice versa. So merging the two hierarchies is important if
we want to enrich CQL. This series does that, partially (see below),
converting the SELECT clause to expressions.
There is another hierarchy split: between the "raw", pre-prepare object
hierarchy, and post-prepare non-raw. This series limits itself to converting
the raw hierarchy and leaves the non-raw hierarchy alone.
An important design choice is not to have this raw/non-raw split in expressions.
Note that most of the hierarchy is completely parallel: addition is addition
both before prepare and after prepare (but see [1]). The main difference
is around identifiers - before preparation they are unresolved, and after
preparation they become `column_definition` objects. We resolve that by
having two separate types: `unresolved_identifier` for the pre-prepare phase,
and the existing `column_value` for post-prepare phase.
Alternative choices would be to keep a separate expression::raw variant, or
to template the expression variant on whether it is raw or not. I think it would
cause undue bloat and confusion.
Note the series introduces many on_internal_error() calls. This is because
there is not a lot of overlap in the hierarchies today; you can't have a cast in
the WHERE clause, for example. These on_internal_error() calls cannot be
triggered since the grammar does not yet allow such expressions to be
expressed. As we expand the grammar, they will have to be replaced with
working implementations.
Lastly, field selection is expressible in both hierarchies. This series does not yet
merge the two representations (`column_value.sub` vs `field_selection`), but it
should be easy to do so later.
[1] the `+` operator can also be translated to list concatenation, which we may
choose to represent by yet another type.
Test: unit(dev)
Closes#9087
* github.com:scylladb/scylla:
cql3: expression: update find_atom, count_if for function_call, cast, field_selection
cql3: expressions: fix printing of nested expressions
cql3: selection: replace selectable::raw with expression
cql3: expression: convert selectable::with_field_selection::raw to expression
cql3: expression: convert selectable::with_cast::raw to expression
cql3: expression: convert selectable::with_anonymous_function::raw to expression
cql3: expression: convert selectable::with_function_call::raw to expressions
cql3: selectable: make selectable::raw forward-declarable
cql3: expressions: convert writetime_or_ttl::raw to expression
cql3: expression: add convenience constructor from expression element to nested expression
utils: introduce variant_element.hh
cql3: expression: use nested_expression in binary_operator
cql3: expression: introduce nested_expression class
Convert column_identifier_raw's use as selectable to expressions
make column_identifier::raw forward declarable
cql3: introduce selectable::with_expression::raw
Introduce a test using `cql-pytest` framework to assert that
both prepared an unprepared LWT statements (insert with
`IF NOT EXISTS`) with a non-deterministic function call
work correctly in case its evaluation affects partition
key range computation (hence the choice of `cas_shard()`
for lwt query).
Tests: cql-pytest/test_non_deterministic_functions.py
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
And reuse these values when handling `bounce_to_shard` messages.
Otherwise such a function (e.g. `uuid()`) can yield a different
value when a statement re-executed on the other shard.
It can lead to an infinite number of `bounce_to_shard` messages
sent in case the function value is used to calculate partition
key ranges for the query. Which, in turn, will cause crashes
since we don't support bouncing more than one time and the second
hop will result in a crash.
Caching works only for LWT statements and only for the function
calls that affect partition key range computation for the query.
`variable_specifications` class is renamed to `prepare_context`
and generalized to record information about each `function_call`
AST node and modify them, as needed:
* Check whether a given function call is a part of partition key
statement restriction.
* Assign ids for caching if above is true and the call is a part
of an LWT statement.
There is no need to include any kind of statement identifier
in the cache key since `query_options` (which holds the cache)
is limited to a single statement, anyway.
Note that `function_call::raw` AST nodes are not created
for selection clauses of a SELECT statement hence they
can only accept only one of the following things as parameters:
* Other function calls.
* Literal values.
* Parameter markers.
In other words, only parameters that can be immediately reduced
to a byte buffer are allowed and we don't need to handle
database inputs to non-pure functions separately since they
are not possible in this context. Anyhow, we don't even have
a single non-pure function that accepts arguments, so precautions
are not needed at the moment.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Convert all known tri-compares that return an int to return std::strong_ordering.
Returning an int is dangerous since the caller can treat it as a bool, and indeed
this series uncovered a minor bug (#9103).
Test: unit (dev)
Fixes#1449Closes#9106
* github.com:scylladb/scylla:
treewide: remove redundant "x <=> 0" compares
test: mutation_test: convert internal tri-compare to std::strong_ordering
utils: int_range: change to std::strong_ordering
test: change some internal comparators to std::strong_ordering
utils: big_decimal: change to std::strong_ordering
utils: fragment_range: change to std::strong_ordering
atomic_cell: change compare_atomic_cell_for_merge() to std::strong_ordering
types: drop scaffolding erected around lexicographical_tri_compare
sstables: keys: change to std::strong_ordering internally
bytes: compare_unsigned(): change to std::strong_ordering
uuid: change comparators to std::strong_ordering
types: convert abstract_type::compare and related to std::strong_ordering
types: reduce boilerplate when comparing empty value
serialized_tri_compare: change to std::strong_ordering
compound_compat: change to std::strong-ordering
types: change lexicographical_tri_compare, prefix_equality_tri_compare to std::strong_ordering
Same as 4309785, dpkg does not re-install confffiles when it removed by
user, we are missing sysconfdir.conf for scylla-housekeeping on rollback.
To prevent this, we need to stop removing drop-in file directory on
'remove'.
Fixes#9109Closes#9110
"
There are few places that call global storage service, but all
are easily fixable without significant changes.
1. alternator -- needs token metadata, switch to using proxy
2. api -- calls methods from storage service, all handlers are
registered in main and can capture storage service from there
3. thrift -- calls methods from storage service, can carry the
reference via controller
4. view -- needs tokens, switch to using (global) proxy
5. storage_service -- (surprisingly) can use "this"
tests: unit(dev), dtest(simple_boot_shutdown, dev)
"
* 'br-unglobal-storage-service' of https://github.com/xemul/scylla:
storage_service: Make it local
storage_service: Remove (de)?init_storage_service()
storage_service: Use container() in run_with(out)_api_lock
storage_service: Unmark update_topology static
storage_service: Capture this when appropriate
view: Use proxy to get token metadata from
thrift: Use local storage service in handlers
thrift: Carry sharded<storage_service>& down to handler
api: Capture and use sharded<storage_service>& in handlers
api: Carry sharded<storage_service>& down to some handlers
alternator: Take token metadata from server's storage_proxy
alternator: Keep storage_proxy on server
"
This is the 3rd slowest test in the set. There are 3 cases out
there that are hard-coded to be sequential. However, splitting
them into boost test cases helps running this test faster in
--parallel-cases mode. Timings for debug mode:
Total before the patch: 25 min
Sequential after the patch: 25 min
Basic case: 5 min
Evict-paused-readers case: 5 min
Single-mutation-buffer case: 15 min
tests: unit.multishard_combining_reader_as_mutation_source(debug)
"
* 'br-parallel-mcr-test' of https://github.com/xemul/scylla:
test: Split test_multishard_combining_reader_as_mutation_source into 3
test: Fix indentation after previous patch
test: Move out internals of test_multishard_combining_reader_as_mutation_source
Continuing the work from e4eb7df1a1, let's guarantee
serialization of sstable set updates by making all sites acquire
the mutation permit. Then table no longer rely on serialization
mechanism of row cache's update functions.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210728174740.78826-1-raphaelsc@scylladb.com>
The benefit is that inside repair_info::repair_range we have the peer
node information. It is useful to log peer nodes.
In addition, we can avoid logging the similar logs twice in case the
range is skipped. For example:
INFO 2021-07-28 14:57:15,388 [shard 1] repair - Repair 417 out of 513
ranges, id=[id=1, uuid=72344b26-1db2-48a0-bc5b-e8ac2874e154], shard=1,
keyspace=keyspace1, table={standard1}, range=(5380136883876426790,
5406788998747631705]
WARN 2021-07-28 14:57:15,388 [shard 1] repair - Repair 417 out of 513
ranges, id=[id=1, uuid=72344b26-1db2-48a0-bc5b-e8ac2874e154], shard=1,
keyspace=keyspace1, table={standard1}, range=(5380136883876426790,
5406788998747631705], peers={127.0.0.2}, live_peers={}, status=skipped
There are 3 places that can now declare local instance:
- main
- cql_test_env
- boost gossiper test
The global pointer is saved in debug namespace for debugging.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
One of them just re-wraps arguments in std::ref and calls for
global storage service. The other one is dead code which also
calls the global s._s. Remove both and fix the only caller.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
And use container() to reshard to shard 0. This removes one
more call for global storage service instance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some storage_service methods call for global storage service instance
while they can enjoy "this" pointer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The mutate_MV() call needs token metadata and it gets them from
global storage service. Fixing it not to use globals is a huge
refactoring, so for now just get the tokens from global storage
proxy.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The thrift_handler class' methods need storage service. This
patch makes sure this class has sharded storage service
reference on board.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The reference in question is already there, handlers that need
storage service can capture it and use. These handlers are not
yet stopped, but neither is the storage service itself, so the
potentially dangling reference is not being set up here.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Both set_server_storage_service and set_server_storage_proxy set up
API handlers that need storage service to work. Now they all call for
global storage service instance, but it's better if they receive one
from main. This patch carries the sharded storage service reference
down to handlers setting function, next patch will make use of it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a local_nodelist_handler serving API requests that calls
for global storage service to get token metadata from. Now it
can get storage proxy reference from server upon construction
and use it for tokens.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
This series changes the behavior of the system when executing reads
annotated with "bypass cache" clause in CQL. Such reads will not
use nor populate the sstable partition index cache and sstable index page cache.
"
* 'bypass-cache-in-sstable-index-reads' of github.com:tgrabiec/scylla:
sstables: Do not populate page cache when searching in promoted index for "bypass cache" reads
sstables: Do not populate partition index cache for "bypass cache" reads
If x is of type std::strong_ordering, then "x <=> 0" is equivalent to
x. These no-ops were inserted during #1449 fixes, but are now unnecessary.
They have potential for harm, since they can hide an accidental of the
type of x to an arithmetic type, so remove them.
Ref #1449.
The signature already returned std::strong_ordering, but an
internal comparator returned int. Switch it, so it now uses
the strong_ordering overload of lexicographicall_tri_compare().
Ref #1449.
The original signatures with `int` are retained (by calling the new
signatures), until the callers are converted. Constraints are used to
disambiguate.
Ref #1449.
This series removes unused partition level repair related code.
Closes#9105
* github.com:scylladb/scylla:
repair: Drop stream plan related code
locator: Add missing file.hh include in production_snitch_base
repair: Drop request_transfer_ranges and do_streaming
repair: Drop parallelism_semaphore
gcc 10.3.1 spews the following error:
```
_test_generator::generate_scenario(std::mt19937&) const’:
test/boost/mutation_reader_test.cc:3731:28: error: comparison of integer expressions of different signedness: ‘int’ and ‘long unsigned int’ [-Werror=sign-compare]
3731 | for (auto i = 0; i < num_ranges; ++i) {
| ~~^~~~~~~~~~~~
```
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210728073538.2467040-1-bhalevy@scylladb.com>
```
clang++
build/dev/locator/production_snitch_base.o
locator/production_snitch_base.cc
In file included from locator/production_snitch_base.cc:41:
In file included from ./locator/production_snitch_base.hh:41:
In file included from
/usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/unordered_map:38:
/usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/type_traits:1329:23:
error: incomplete type 'seastar::file' used in type trait expression
__bool_constant<__has_trivial_destructor(_Tp)>>
^
```
This code compiles now due to indirect include from repair.hh.
Fixes#9103
compare overload was declared as "bool" even though it is a tri-cmp.
causes us to never use the speed-up shortcut (lessen search set),
in turn meaning more overhead for collections.
Closes#9104
"
While working on evicting range tombstones one of the nastiest
difficulties is that mutation_partition has very loose control
over adding and removing of rows and range tombstones. This is
because it exposes both collections via public methods, so it's
pretty easy to grab a non-const reference on either of it and
modify the collection.
At the same time restricting the API with returning only const
reference on the collection is not possible either, since finding
in or iterating over a const-referenced collection would expose
the const-reference element as well, while it can be perfectly
valid to modify the single row/tombstone without touching the
whole collection.
In other words there's the need for an access method that both
guarantees that no new elements are added to the collection,
nor existing ones are removed from it, AND doesn't impose const
on the obtained elements.
The solution proposed here is the immutable_collection<> template
that wraps a non-const collection reference and gives the caller
only reading methods (find, lower_bound, begin, etc) so that it's
guaranteed that the external user of mutation_partition won't be
able to modify the collections. Those that already use the const
reference on the mutation_partition itself are OK, they also use
const-referenced everything. The places than do need to modify
the partition's collections are thus made explicit.
tests: unit(dev)
"
* 'br-mutation-partition-collection-view-3' of https://github.com/xemul/scylla:
mutation_partition: Return immutable collection for range tombstones
mutation_partition: Pin mutable access to range tombstones
mutation_partition: Return immutable collection for rows
mutation_partition: Pin mutable access to rows
utils: Introduce immutable_collection<>
btree: Generalize some iterator methods
btree: Make iterators not modify the tree itself
btree tests: Dont use iterator erase
mutation_partition: Shuffle declarations
range_tombstone_list: Mark more methods noexcept
range_tombstone_list, code: Mark external_memory_usage noexcept
The combination of the new types and these functions cannot happen yet,
but as they are generic functions it is better to implement them in
case it becomes possible later.
Now that all selectable::raw subclasses have been converted to
cql3::selectable::with_expression::raw, the class structure is
just a wrapper around expressions. Peel it, converting the
virtual member functions to free functions, and replacing
object instances with expression or nested_expression as the
case allows.
Add a field_selection variant element to expression. Like function_call
and cast, the structure from which a field is selectewd cannot yet be
an expression, since not all seletable::raw:s are converted. This will
be done in a later pass. This is also why printing a field selection now
does not print the selected expression; this will also be corrected later.
Add a cast variant element to expression. Like function_call, the
argument being converted cannot yet be an expression, since not
all seletable::raw:s are converted. This will be done in a later
pass. This is also why printing a cast now does not print the
casted expression; this will also be corrected later.
Rather than creating a new variant element in expression, we extend
function_call to handle both named and anonymous functions, since
most of the processing is the same.
Add a function_call variant element to hold function calls. Note
that because not all selectables are yet converted, function call
arguments are still of type selectable::raw. They will be converted
to expressions later. This is also why printing a function now
does not print its arguments; this will also be corrected later.
As temporary scaffolding while we're converting selectable::raw
subclasses to expressions, we'll need expressions to refer to
selectable::raw (specifically, function call arguments, which will
end up as expressions as well). To avoid a #include loop, make
selectable::raw forward-declarable by moving it to namespace scope.
Create a new element in the expression variant, column_mutation_attribute,
signifying we're picking up an attribute of a column mutation (not a
column value!). We use an enum rather than a bool to choose between
writetime and ttl (the two mutation attributes) for increased
explicitness.
Although there can only be one type for the column we're operating
on (it must be an unresolved_identifer), we use a nested_expression.
This is because we'll later need to also support a column_value
as the column type after we prepare it. This is somewhat similar
to the address of operator in C, which syntactically takes any
expression but semantically operates only on lvalues.
It is convenient to initialize a nested_expression variable from
one of the types that compose the expression variant, but C++ doesn't
allow it. Add a constructor that does this. Use the new variant_element
concept to constrain the input to be one of the variant's elements.
A type trait (is_variant_element) and a concept (VariantElement)
that tell if a type T is a member of a variant or not. It can be
used even if the variant's elements are not yet defined (just
forward-declared).
The exression type cannot be a member of a struct that is an
element of the expression variant. This is because it would then
be required to contain itself. So introduce a nested_expression
type to indirectly hold an expression, but keep the value semantics
we expect from expressions: it is copyable and a copy has separate
identity and storage.
In fact binary_operator had to resort to this trick, so it's converted
to nested_expression in the next patch.
Introduce unresolved_identifer as an unprepared counterpart to column_value.
column_identifier_raw no longer inherits from selectable::raw, but
methods for now to reduce churn.
Patch the .row_tombstones() to return the range_tombstone_list
wrapped into the immutable_collection<> so that callers are
guaranteed not to touch the collection itself, but still can
modify the tombstones.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some callers of mutation_partition::row_tomstones() don't want
(and shouldn't) modify the list itself, while they may want to
modify the tombstones. This patch explicitly locates those that
need to modify the collection, because the next patch will
return immutable collection for the others.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Patch the .clustered_rows() method to return the btree of rows
wrapped into the immutable_collection<> so that callers are
guaranteed not to touch the collection itself, but still can
modify the elements in it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some callers of mutation_partition::clustered_rows() don't want
(and shouldn't) modify the tree of rows, while they may want to
modify the rows themselves. This patch explicitly locates those
that need to modify the collection, because the next patch will
return immutable collection for the others.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Wokring with collections can be done via const- and non-const
references. In the former case the collection can only be read
from (find, iterate, etc) in the latter it's possible to alter
the collection (erase elements from or insert them into). Also
the const-ness of the collection refernece is transparently
inherited by the returned _elements_ of the collection, so when
having a const reference on a collection it's impossible to
modify the found element.
This patch introduces a immutable_collection -- a wrapper over
a random collection that makes sure the collection itself is not
modified, but the obtained from it elements can be non-const.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The non-const iterator has constructor from key pointer and
the tree_if_singular method. There's no reasons why these
two are absent in the const_iterator.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The const_iterator cannot modify anything, but the plain
iterator has public methods to remove the key from the tree.
To control how the tree is modified this method must be
marked private and modification by iterator should come
from somewhere else.
This somewhere else is the existing key_grabber that's
already used to move keys between trees. Generalize this
ability to move a key out of a tree (i.e. -- erase).
Once done -- mark the iterator::erase_and_dispose private.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patches will mark btree::iterator methods that modify
the tree itself as private, so stop using them in tests.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Its methods that provide access to enclosed collections of rows
and range tombstones are intermixed, so group them for smoother
next patching and mark noexcept while at it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Those returning iterators and size for the underlying
collection of range tombstones are all non-throwing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The range_tombstone_list's method is at the top of the
stack of calls each not throwing anything, so do the
deep-dive noexcept marking.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Otherwise we run into a #include loop when try to have an expression
with column_identifier::raw: expression.hh -> column_identifier.hh
-> selectable.hh -> expression.hh.
Prepare to migrate selectable::raw sub-classes to expressions by
creating a bridge betweet the two types. with_expression::raw
is a selectable::raw and implements all its methods (right now,
trivially), and its contents is an expression. The methods are
implemented using the usual visitor pattern.
Although the switch in `to_string(compaction_options::scrub::mode)`
covers all possible cases, gcc 10.3.1 warns about:
```
sstables/compaction.cc: In function ‘std::string_view sstables::to_string(sstables::compaction_options::scrub::mode)’:
sstables/compaction.cc:95:1: error: control reaches end of non-void function [-Werror=return-type]
```
Adding __builtin_unreachable(), as in `to_string(compaction_type)`
does calm the compiler down, but it might cause undefined behavior
in the future in case the switch won't cover all cases, or
the passed value is corrupt somehow.
Instead, call on_internal_error_noexcept to report the
error and abort if configure to do so, otherwise,
just return an "(invalid)" string.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210727130251.2283068-1-bhalevy@scylladb.com>
"
This series fixes two issues which cause very poor efficiency of reads
when there is a lot of range tombstones per live row in a partition.
The first issue is in the row_cache reader. Before the patch, all range
tombstones up to the next row were copied into a vector, and then put
into the buffer until it's full. This would get quadratic if there is
much more range tombstones than fit in a buffer.
The fix is to avoid the accumulation of all tombstones in the vector
and invoke the callback instead, which stops the iteration as soon as
the buffer is full.
Fixes#2581.
The second, similar issue was in the memtable reader.
Tests:
- unit (dev)
- perf_row_cache_update (release)
"
* tag 'no-quadratic-rt-in-reads-v1' of github.com:tgrabiec/scylla:
test: perf_row_cache_update: Uncomment test case for lots of range tombstones
row_cache: Consume range tombstones incrementally
partition_snapshot_reader: Avoid quadratic behavior with lots of range tombstones
tests: mvcc: Relax monotonicity check
range_tombstone_stream: Introduce peek_next()
"
It exists in the node-ops handler which is registered by repair code,
but is handled by storage service. Probably, the whole node-ops handler
should instead be moved into repair, but this looks like rather huge
rework. So instead -- put the node-ops verb registration inside the
storage-service.
This removes some more calls for global storage service instance and
allows slight optimization of node-ops cross-shards calls.
tests: unit(dev), start-stop
"
* 'br-remove-storage-service-from-nodeops' of https://github.com/xemul/scylla:
storage_service: Replace globals with locals
storage_service: Remove one extra hop of node-ops handler
storage_service: Fix indentation after previous patch
storage_service: Move cross-shard hop up the stack
repair: Drop empty verbs reg/unreg methods
repair, storage_service: Move nodeops reg/unreg to storage service
repair: Coroutinize row-level start/stop
Bring supervisor support from dist/docker to install.sh, make it
installable from relocatable package.
This enables to use supervisor with nonroot / offline environment,
and also make relocatable package able to run in Docker environment.
Related #8849Closes#8918
This patch follows #9002, further reducing the complexity of the sstable readers.
The split between row consumer interfaces and implementations has been first added in 2015, and there is no reason to create new implementations anymore. By merging those classes, we achieve a sizeable reduction in sstable reader length and complexity.
Refs #7952
Tests: unit(dev)
Closes#9073
* github.com:scylladb/scylla:
sstables: merge row_consumer into mp_row_consumer_k_l
sstables: move kl row_consumer
sstables: merge consumer_m into mp_row_consumer_m
sstables: move mp_row_consumer_m
This is a translation of Cassandra's CQL unit test source file
validation/entities/TypeTest.java into our our cql-pytest framework.
This is a tiny test file, with only four test which apparently didn't
find their place in other source files. All four tests pass on Cassandra,
and all but one pass on Scylla - the test marked xfail discovered one
previously-unknown incompatibility with Cassandra:
Refs #9082: DROP TYPE IF EXISTS shouldn't fail on non-existent keyspace
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210726140934.1479443-1-nyh@scylladb.com>
Prevent accidental conversions to bool from yielding the wrong results.
Unprepared users (that converted to bool, or assigned to int) are adjusted.
Ref #1449
Test: unit (dev)
Closes#9088
* seastar 93d053cd...ce3cc268 (4):
> doc: update coroutine exception paragraph with make_exception
> coroutine: add make_exception helper
> coroutine: use std::move for forwarding exception_ptr
> doc: tutorial: document direct exception propagation
With the new throw-less coroutine exception support, we can modify
some of Scylla's new coroutine code to generate exceptions a bit more
efficiently, without actually thowing an exception.
Before the patch, all range tombstones up to the next row were copied
into a vector, and then put into the buffer until it's full. This
would get quadratic if there is much more range tombstones than fit in
a buffer.
The fix is to avoid the accumulation of all tombstones in the vector
and invoke the callback instead, which stops the iteartion as soon as
the buffer is full.
Fixes#2581.
next_range_tombstone() was populating _rt_stream on each invocation
from the current iterator ranges in _range_tombstones. If there is a
lot of range tombstones, all would be put into _rt_stream. One problem
is that this can cause a reactor stall. Fix by more incremental
approach where we populate _rt_stream with minimal amount on each
invocation of next_range_tombstone().
Another problem is that this can get quadratic. The iterators in
_range_tombstones are advanced, but if lsa invalidates them across
calls they can revert back to the front since they go back to
_last_rt, which is the last consumed range tombstone, and if the
buffer fills up, not all tombstones from _rt_stream could be
consumed. The new code doesn't have this problem because everything
which is produced out of the iterators in _range_tombstones is
produced only once. What we put into _rt_stream is consumed first
before we try to feed the _rt_stream with more data.
Consecutive range tombstones can have the same position. They will, in
one of the test cases, after the range tombstone merger in
partition_snapshot_flat_reader no longer uses range_tombstone_list to
merge data form multiple versions, which deoverlaps, but rather merges
the streams corresponding to each version, which interleaves range
tombstones from different versions.
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
1. make_exception_future<> for futures
2. co_return coroutine::exception(...) for coroutines
which return future<T> (the mechanism does not work for future<>
without parameters, unfortunately)
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
1. make_exception_future<> for futures
2. co_return coroutine::exception(...) for coroutines
which return future<T> (the mechanism does not work for future<>
without parameters, unfortunately)
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
1. make_exception_future<> for futures
2. co_return coroutine::exception(...) for coroutines
which return future<T> (the mechanism does not work for future<>
without parameters, unfortunately)
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
1. make_exception_future<> for futures
2. co_return coroutine::exception(...) for coroutines
which return future<T> (the mechanism does not work for future<>
without parameters, unfortunately)
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
1. make_exception_future<> for futures
2. co_return coroutine::exception(...) for coroutines
which return future<T> (the mechanism does not work for future<>
without parameters, unfortunately)
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
1. make_exception_future<> for futures
2. co_return coroutine::exception(...) for coroutines
which return future<T> (the mechanism does not work for future<>
without parameters, unfortunately)
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
1. make_exception_future<> for futures
2. co_return coroutine::exception(...) for coroutines
which return future<T> (the mechanism does not work for future<>
without parameters, unfortunately)
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
1. make_exception_future<> for futures
2. co_return coroutine::exception(...) for coroutines
which return future<T> (the mechanism does not work for future<>
without parameters, unfortunately)
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
1. make_exception_future<> for futures
2. co_return coroutine::exception(...) for coroutines
which return future<T> (the mechanism does not work for future<>
without parameters, unfortunately)
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
1. make_exception_future<> for futures
2. co_return coroutine::exception(...) for coroutines
which return future<T> (the mechanism does not work for future<>
without parameters, unfortunately)
The node-ops verb handler is the lambda of storage-service and it
can stop using global storage service instance for no extra charge.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's now clear that the verb handler goes to some "random"
shard, then immediatelly switches to shard-0 and then does
the handling. Avoid the extra hop and go to shard-0 right
at once.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The storage_service::node_ops_cmd_handler runs inside a huge
invoke_on(0, ...) lambda. Make it be called on shard-0. This
is the preparation for next two patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The storage service is the verb sender, so it must be the verb
registrator. Another goal of this patch is to allow removal of
repair -> storage_service dependency.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Partition count is of a type size_t but we use std::plus<int>
to reduce values of partition count in various column families.
This patch changes the argument of std::plus to the right type.
Using std::plus<int> for size_t compiles but does not work as expected.
For example plus<int>(2147483648LL, 1LL) = -2147483647 while the code
would probably want 2147483649.
Fixes#9090
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Closes#9074
This is a translation of Cassandra's CQL unit test source file
validation/entities/TupleTypeTest.java into our our cql-pytest framework.
This test file checks has a few tests on various features of tuples.
Unfortunately, some of the tests could not be easily translated into
Python so were left commented out: Some tests try to send invalid input
to the server which the Python driver "helpfully" forbids; Two tests
used an external testing library "QuickTheories" and are the only two
tests in the Cassandra test suite to use this library - so it's not
a worthwhile to translate it to Python.
11 tests remain, all of them pass on Cassandra, and just one fails on
Scylla (so marked xfail for now), reproducing one known issue:
Refs #7735: CQL parser missing support for Cassandra 3.10's new "+=" syntax
Actually, += is not supposed to be supported on tuple columns anyway, but
should print the appropriate error - not the syntax error we get now as
the "+=" feature is not supported at all.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210722201900.1442391-1-nyh@scylladb.com>
Since detach_buffer is used before closing and
destroying the reader, we want to mark it as noexcept
to simply the caller error handling.
Currently, although it does construct a new circular_buffer,
none of the constructors used may throw.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210617114240.1294501-2-bhalevy@scylladb.com>
detach_buffer exchanges the current _buffer with
a new buffer constructed using the circular_buffer(Alloc)
constructor. The compiler implicitly constructs a
tracking_allocator(reader_permit) and passes it
to the circular_buffer constructor.
This patch just makes that explicit so it would be
clearer to the reader what's going on here.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210617114240.1294501-1-bhalevy@scylladb.com>
Loading snapshot id and term + vote involve selecting static
fields from the "system.raft" table, constrained by a given
group id.
The code incorrectly assumes that, for example,
`SELECT snapshot_id FROM raft WHERE group_id=?` in
`load_snapshot` always returns only one row.
This is not true, since this will return a row
for each (pk, ck) combination, which is (group_id, index)
for "system.raft" table.
The same applies for the `load_term_and_vote`, which selects
static `vote_term` and `vote` from "system.raft".
This results in a crash at node startup when there is
a non-empty raft log containing more than one entry
for a given `group_id`.
Restrict the selection to always return one row by applying
`LIMIT 1` clause.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210723183232.742083-1-pa.solodovnikov@scylladb.com>
The class is repurposed to be more generic and also be able
to hold additional metadata related to function calls within
a CQL statement. Rename all methods appropriately.
Visitor functions in AST nodes (`collect_marker_specification`)
are also renamed to a more generic `fill_prepare_context`.
The name `prepare_context` designates that this metadata
structure is a byproduct of `stmt::raw::prepare()` call and
is needed only for "prepare" step of query execution.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Commit 2150c0f7a2 proposed by issue #5619
added a limitation that USING TIMESTAMP cannot be more than 3 days into
the future. But the actual code used to check it,
timestamp - now > MAX_DIFFERENCE
only makes sense for *positive* timestamps. For negative timestamps,
which are allowed in Cassandra, the difference "timestamp - now" might
overflow the signed integer and the result is undefined - leading to the
undefined-behavior sanitizer to complain as reported in issue #8895.
Beyond the sanitizer, in practice, on my test setup, the timestamp -2^63+1
causes such overflow, which causes the above if() to make the nonsensical
statement that the timestamp is more than 3 days into the future.
This patch assumes that negative timestamps of any magnitude are still
allowed (as they are in Cassandra), and fixes the above if() to only
check timestamps which are in the future (timestamp > now).
We also add a cql-pytest test for negative timestamps, passing on both
Cassandra and Scylla (after this patch - it failed before, and also
reported sanitizer errors in the debug build).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210621141255.309485-1-nyh@scylladb.com>
This is the 2nd PR in series with the goal to finish the hackathon project authored by @tgrabiec, @kostja, @amnonh and @mmatczuk (improved virtual tables + function call syntax in CQL). This one introduces a new implementation of the virtual tables, the streaming tables, which are suitable for large amounts of data.
This PR was created by @jul-stas and @StarostaGit
Closes#8961
* github.com:scylladb/scylla:
test/boost: run_mutation_source_tests on streaming virtual table
system_keyspace: Introduce describe_ring table as virtual_table
storage_service: Pass the reference down to system_keyspace
endpoint_details: store `_host` as `gms::inet_address`
queue_reader: implement next_partition()
virtual_tables: Introduce streaming_virtual_table
flat_mutation_reader: Add a new filtering reader factory method
We want to serialize snapshot application with command application
otherwise a command may be applied after a snapshot that already contains
the result of its application (it is not necessary a problem since the
raft by itself does not guaranty apply-once semantics, but better to
prevent it when possible). This also moves all interactions with user's
state machine into one place.
Message-Id: <YPltCmBAGUQnpW7r@scylladb.com>
"
The cql-server -> storage-service dependency comes from the server's
event_notifier which (un)subscribes on the lifecycle events that come
from the storage service. To break this link the same trick as with
migration manager notifications is used -- the notification engine
is split out of the storage service and then is pushed directly into
both -- the listeners (to (un)subscribe) and the storage service (to
notify).
tests: unit(dev), dtest(simple_boot_shutdown, dev)
manual({ start/stop,
with/without started transport,
nodetool enable-/disablebinary
} in various combinations, dev)
"
* 'br-remove-storage-service-from-transport' of https://github.com/xemul/scylla:
transport.controller: Brushup cql_server declarations
code: Remove storage-service header from irrelevant places
storage_service: Remove (unlifecycle) subscribe methods
transport: Use local notifier to (un)subscribe server
transport: Keep lifecycle notifier sharded reference
main: Use local lifecycle notifier to (un)subscribe listeners
main, tests: Push notifier through storage service
storage_service: Move notification core into dedicated class
storage_service: Split lifecycle notification code
transport, generic_server: Remove no longer used functionality
transport: (Un)Subscribe cql_server::event_notifier from controller
tests: Remove storage service from manual gossiper test
The controller code sits in the cql_transport namespace and
can omit its mentionings. Also the seastar::distributed<>
is replaced with modern seastar::sharded<> while at it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some .cc files over the code include the storage service
for no real need. Drop the header and include (in some)
what's really needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now the controller has the lifecycle notifier reference and
can stop using storage service to manage the subscription.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The storage proxy and sl-manager get subscribed on lifecycle
events with the help of storage service. Now when the notifier
lives in main() they can use it directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now it's time to move the lifecycle notifier from storage
service to the main's scope. Next patches will remove the
$lifecycle-subscriber -> storage_service dependency.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Introduce the endpoint_lifecycle_notifier class that's in
charge of keeping track of subscribers and notifying them.
The subscribers will thus be able to set and unset their
subscription without the need to mess with storage service
at all.
The storage_service for now keeps the notifier on board, but
this is going to change in the next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This prepares the ground for moving the notification engine
into own class like it was done for migration_notifier some
time ago.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After subscription management was moved onto controller level
a bunch of code can be dropped:
- passing migration notifier beyond controller
- event_notifier's _stopped bit
- event_notifier .stop() method
- event_notifier empty constructor and destrictor
- generic_server's on_stop virtual method
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a migration notifier that's carried through cql_server
_just_ to let event-notifier (un)subscribe on it. Also there's
a call for global storage-service in there which will need to
be replaced with yet another pass-through argument which is not
great.
It's easier to establish this subscription outside of cql_server
like it's currently done for proxy and sl-manager. In case of
cql_server the "outside" is the controller.
This patch just moves the subscription management from cql_server
to controller, next two patches will make more use of this change.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This PR contains the parts relevant to batchlog_manager stop in #8998 without adding a gate to the storage_proxy for synchronization with on-going queries in storage_proxy::drain_on_shutdown.
As explained in #9009, we see that the batchlog_manager isn't stopped if scylla shuts down during startup, e.g. when waiting for gossip to settle, since currently the batchlog_manager is stopped only from `storage_service::do_drain`, while `storage_service::drain_on_shutdown` deferred shutdown is installed only later on:
222ef17305/main.cc (L1419-L1421)Fixes#9009
Test: unit(dev)
DTest: compact_storage_tests.py:TestCompactStorage.wide_row_test paging_test:TestPagingDatasetChanges.test_cell_TTL_expiry_during_paging update_cluster_layout_tests:TestUpdateClusterLayout.simple_add_new_node_while_adding_info_{1,2}_test (dev)
Closes#9010
* github.com:scylladb/scylla:
main: add deferred stop of batchlog_manager
batchlog_manager: refactor drain out of stop
batchlog_manager: stop: break _sem on shard 0
batchlog_manager: stop: use abort_source to abort batchlog_replay_loop
batchlog_manager: do_batch_log_replay: hold _gate
* seastar 388ee307...93d053cd (5):
> doc: tutorial: document seastar::coroutine::all()
> doc: tutorial: nest "exceptions in coroutines" under "coroutines"
> coroutine: add a way of propagating exceptions without throwing
> input_stream: Fix read_exactly(n) incorrectly skipping data
> coroutines: introduce all() template for waiting for multiple futures
The class name `coroutine` became problematic since seastar
introduced it as a namespace for coroutine helpers.
To avoid a clash, the class from scylla is wrapped in a separate
namespace.
Without this patch, Seastar submodule update fails to compile.
Message-Id: <6cb91455a7ac3793bc78d161e2cb4174cf6a1606.1626949573.git.sarna@scylladb.com>
This is a rewrite of an old PR: #7582
`TOKEN()` restrictions don't work properly when a query uses an index.
For example this returns both rows:
```cql
CREATE TABLE t(pk int, ck int, v int, PRIMARY KEY(pk, ck));
CREATE INDEX ON t(v);
INSERT INTO t (pk, ck, v) VALUES (0, 0, 0);
INSERT INTO t (pk, ck, v) VALUES (1, 0, 0);
SELECT token(pk), pk, ck, v FROM t WHERE v = 0 AND token(pk) = token(0) ALLOW FILTERING;
```
This functionality is supported on both old and new indexes. In old
indexes the type of the token column was `blob`. This causes problems,
because `blob` representation of tokens is ordered differently. Tokens
represented as blobs are ordered like this:
```
0, 1, 2, 3, 4, 5, ..., bigint_max, bigint_min, ...., -5, -4, -3, -2, -1
```
Because of that clustering range for `token()` restrictions needs to be
translated to two clustering ranges on the `blob` column.
To create old indexes disable the feature called:
`CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX` or run scylla version from branch
[`cvybhu/si-token2-old-index`](https://github.com/cvybhu/scylla/commits/si-token2-old-index)
I'm not sure if it's possible to create automatic tests with old
indexes. I ran `dev-test` manually on the `si-token2-old-index` branch,
and the only tests that failed were the ones testing row ordering. Rows
should be ordered by `token`, but because in old indexes the token is
represented as a `blob` this ordering breaks. This is a known issue
(#7443), that has been fixed by introducing new indexes.
To sum up:
* `token()` restrictions are fixed on both new and old indexes.
* When using old indexes, the rows are not properly ordered by token.
* With new indexes the rows are properly ordered by token.
Fixes#7043Closes#9067
* github.com:scylladb/scylla:
tests: add secondary index tests with TOKEN clause
secondary_index_test: extract test data
secondary_index: Fix TOKEN() restrictions in indexed SELECTs
expression: Add replace_token function
The row_consumer interface has only one implementation:
mp_row_consumer_k_l; and we're not planning other ones,
so to reduce the number of inheritances, and the number
of lines in the sstable reader, these classes may be
combined.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
In preparation for the next patch combining row_consumer and
mp_row_consumer_k_l, move row_consumer next to row_consumer.
Because row_consumer is going to be removed, we retire some
old tests for different implementations of the row_consumer
interface; as a result, we don't need to expose internal
types of kl sstable reader for tests, so all classes from
reader_impl.hh are moved to reader.cc, and the reader_impl.hh
file is deleted, and the reader.cc file has an analogous
structure to the reader.cc file in sstables/mx directory.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
The consumer_m interface has only one implementation:
mp_row_consumer_m; and we're not planning other ones,
so to reduce the number of inheritances, and the number
of lines in the sstable reader, these classes may be
combined.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
To make next patch combining consumer_m and mp_row_consumer_m
more readable, move mp_row_consumer_m next to consumer_m.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Add tests of SELECTs with TOKEN clauses on tables with secondary
indexes (both global and local).
test_select_with_token_range_cases checks all possible token range
combinations (inclusive/exclusive/infinity start/end) on tables without
index, with local or with global index.
test_select_with_token_range_filtering checks whether TOKEN restrictions
combined with column restrictions work properly. As different code paths
are taken if index is created on clustering key (first or non-first) or
non-primary-key column, the tests checks scenarios when index is created
on different columns.
Extract test data to a separate variables, allowing it to be easily
reused by other tests. The tokens are hard-coded, because calculating
their value brought too much complexity to this code.
When using an index, restrictions like token(p) <= x were ignored.
Because of this a query like this would select all rows where r = 0:
SELECT * FROM tab WHERE r = 0 and token(p) > 0;
Adds proper handling of token restrictions to queries that use indexes.
Old indexes represented token as a blob, which complicates
clustering bounds. Special code is included, which translates
token clustering bounds to blob clustering bounds.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Today, table relies on row_cache::invalidate() serialization for
concurrent sstable list updates to produce correct results.
That's very error prone because table is relying on an implementation
detail of invalidate() to get things right.
Instead, let's make table itself take care of serialization on
concurrent updates.
To achieve that, sstable_list_builder is introduced. Only one
builder can be alive for a given table, so serialization is guaranteed
as long as the builder is kept alive throughout the update procedure.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210721001716.210281-1-raphaelsc@scylladb.com>
That function is dangerously used by distributed loader, as the latter
was responsible for invalidating cache for new sstable.
load_sstable() is an unsafe alternative to
add_sstable_and_update_cache() that should never have been used by
the outside world. Instead, let's kill it and make loader use
the safe alternative instead.
This will also make it easier to make sure that all concurrent updates
to sstable set are properly serialized.
Additionally, this may potentially reduce the amount of data evicted
from the cache, when the sstables being imported have a narrow range,
like high level sstables imported from a LCS table. Unlikely but
possible.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210721131949.26899-1-raphaelsc@scylladb.com>
Also remove `parent_template_param` argument for
`handle_enum` and `handle_class` functions.
`setup_namespace_bindings` is renamed to `setup_additional_metadata`
since it now also sets parent template arguments for each object.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Rename `add_to_types` function to `register_local_writable_type`
and `local_types` set to `local_writable_types`.
Also rename other related functions accordingly, by adding
`writable` to the names.
Previous names were misleading since `local_types` refers not
to all local types but only to those which are marked with
`[[writable]]` attribute.
Nonetheless, we are going to need a mapping of all local types
to resolve type references from `BasicType` AST node
instances. So the `local_types` set is retained, but now it
corresponds to the list of all local types.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Previously local types set contained a items, which are lists
of `[cls, parent_template_param]`. The second element is never
used, so remove it and move `cls` from the list.
All uses are adjusted accordingly.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Do a pre-processing pass to cache namespaces info in each
type declaration AST node (`ClassDef` and `EnumDef`) and
store it in the `ns_context` field of a node.
Switch to `ns_context` and eliminate `namespaces` parameter
carried over through all writer functions.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This patch removes unused `parent_template_param` and `namespaces`
variables obtained from unpacking values from the `local_types` set.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Adds replace_token function which takes an expression
and replaces all left hand side occurences of token()
with the given column definition.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
When a leader moves to a follower state it aborts all requests that are
waiting on an admission semaphore with not_a_leader exception. But
currently it specifies itself as a new leader since abortion happens
before the fsm state changes to a follower. The patch fixes this by
destroying leader state after fsm state already changed to be a
follower.
Message-Id: <YPbI++0z5ZPV9pKb@scylladb.com>
* seastar ef320940...388ee307 (4):
> Merge 'Add a stall analyser tool' from Benny Halevy
> compat: implement coroutine_handle<void> for <experimental/coroutine> header
> Merge "Make app_template::run noexcept" from Pavel E
> perftune.py: make RPS CPU set to be a full CPU set
The stall analyser tool was requested by the SCT team to help make
sense of Scylla's stall reports and find more stall bugs!
Stop the batchlog manager using a deferred
action in main to make sure it is stopped after
its start() method has been called, also
if we bail out of main early due to exception.
Change the bm.stop() calls in storage_service
to just stop the replay loop using drain().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
drain() aborts the replay loop fiber
and returns its future.
It's grabbing _gate so stop() will wait on it.
The intention is to call stop_replay_loop from
storage_service::decommission and do_drain rather
than stop, so we can stop the batchlog manager once,
using a deferred action in main.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Harden start/stop by using an abort_source to abort from
the replay loop.
Extract the loop into batchlog_replay_loop() coroutine,
with the _stop abourt source as a stop condition,
plus use it for sleep_abortable to be able to promptly
stop while sleeping.
start() stores batchlog_replay_loop's future in a newly added
_started member, which is waited on in stop() to synchronize
with the start process at any stage.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
So we can wait on do_batch_log_replay on stop().
Note that do_batch_log_replay is called both from
batchlog_replay_loop and from the storage_service.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In an upcoming commit I will add "system.describe_ring" table which uses
endpoint's inet address as a part of CK and, therefore, needs to keep them
sorted with `inet_addr_type::less`.
Introduced in d72b91053b.
If region was not compactible, for example because it has dense
segments, we would keep evicting even though the target for reclaimed
segments was met. In the worst case we may have to evict whole cache.
Refs #9038 (unlikely to be the cause though)
Message-Id: <20210720104039.463662-1-tgrabiec@scylladb.com>
Related issues: scylladb/sphinx-scylladb-theme#87
All the variables related to the multiversion extension are now defined in conf.py instead of using the GitHub Actions file.
How to test this PR
Run make multiversionpreview on docs folder. When you open https://0.0.0.0:5500, the browser should render the documentation site.
Closes#7957
lsa_buffer allocations are aligned to 4K. If smaller size is
requested, whole 4K is used. However, only requested size was used in
accounting segment occupancy. This can confuse reclaimer which may
think the segment is sparse while it is actually dense, and compacting
it will yield no or little gain. This can cause inefficient memory
reclamation or lack of progress.
Refs #9038
Message-Id: <20210720104110.463812-1-tgrabiec@scylladb.com>
There's no need for extended `scylla_raft_dependencies`,
which includes the entire `scylla_core` target list.
Revert the tests which don't need the extended list to use
a minimal set of dependencies and switch to using
`scylla_core` as a dependency for
`raft_sys_table_storage_test` and `raft_address_map_test`.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210705104712.295499-1-pa.solodovnikov@scylladb.com>
struct permit_list exists so the intrusive list declaration which needs
the definition of reader_permit can be hidden in the .cc. But it turns
out that if the hook type is fully spelled out, the intrusive list
declaration doesn't need T to be defined. Exploit this to get rid of
this extra indirection.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210720073121.63027-2-bdenes@scylladb.com>
_free_space may be initialized with garbage so kind() getter should
only look at the bit which corresponds to the kind. Misclasification
of segment as being of different kind may result in a hang during
segment compaction.
Surfaced in debug mode build where the field is filled with 0xbebebebe.
Introduced in b5ca0eb2a2.
Fixes#9057
Message-Id: <20210719232734.443964-1-tgrabiec@scylladb.com>
Add examples from issue #8991 to tests
Both of these tests pass on `cassandra 4.0` but fail on `scylla 4.4.3`
First test tests that selecting values from indexed table using only clustering key returns correct values.
The second test tests that performing this operation requires filtering.
The filtering test looks similar to [the one for #7608](1924e8d2b6/test/cql-pytest/test_allow_filtering.py (L124)) but there are some differences - here the table has two clustering columns and an index, so it could test different code paths.
Contains a quick fix for the `needs_filtering()` function to make these tests pass.
It returns `true` for this case and the one described in #7708.
This implementation is a bit conservative - it might sometimes return `true` where filtering isn't actually needed, but at least it prevents scylla from returning incorrect results.
Fixes#8991.
Fixes#7708.
Closes#8994
* github.com:scylladb/scylla:
cql3: Fix need_filtering on indexed table
cql-pytest: Test selecting using only clustering key requires filtering
cql-pytest: Test selecting from indexed table using clustering key
The downgrade_to_v1 didn't reset the state of range tombstone assembler
in case of the calls to next_partition or fast_forward_to, which caused
a situation where the closing range tombstone change is cleared from the
buffer before being emitted, without notifying the assembler. This patch
fixes the behaviour in fast_forward_to as well.
Fixes#9022Closes#9023
* github.com:scylladb/scylla:
flat_mutation_reader: downgrade_to_v1 - reset state of rt_assembler
flat_mutation_reader: introduce public method returning the default size of internal buffer.
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.
This is fixed by using new conditions in cases where
we are using an index.
Fixes#8991.
Fixes#7708.
For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The downgrade_to_v1 didn't reset the state of range tombstone assembler
in case of the calls to next_partition or fast_forward_to, which caused
a situation where the closing range tombstone change is cleared from the
buffer before being emitted, without notifying the assembler. This patch
fixes the behaviour in fast_forward_to as well.
Fixes#9022
When skipping bytes at the end of a continuous_data_consumer range,
the position of the consumer is moved after the skipped bytes, but
the position of the underlying input_stream is not.
This patch adds skipping of the underlying input_stream, to make
its position consistent with the position of the consumer.
Fixes#9024
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Closes#9039
* github.com:scylladb/scylla:
tests: add test for skipping bytes at end of consumer
continuous_data_consumer: properly skip bytes at the end of a range
The semaphore accepts a functor as in its constructor which is run just
before throwing on wait queue overload. This is used exclusively to bump
a counter in the database::stats, which counts queue overloads. However,
there is now an identical counter in
reader_concurrency_semaphore::stats, so the database can just use that
directly and we can retire the now unused prethrow action.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210716111105.237492-1-bdenes@scylladb.com>
The new tests confirms that the regression issue, where
we didn't correctly skip bytes at the end of a
continuous_data_consumer range, is fixed.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
When skipping bytes at the end of a continuous_data_consumer range,
the position of the consumer is moved after the skipped bytes, but
the position of the underlying input_stream is not.
This patch adds skipping of the underlying input_stream, to make
its position consistent with the position of the consumer.
Fixes#9024
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
... when decommissioned (reworked)' from Eliran Sinvani
This is a rework of #8916 The polling loop of the service level
controller queries a distributed table in order to detect configuration
changes. If a node gets decommissioned, this loop continues to run until
shutdown, if a node stays in the decommissioned mode without being shut
down, the loop will fail to query the table and this will result in
warnings and eventually errors in the log. This is not really harmful
but it adds unnecessary noise to the log. The series below lays the
infrastructure for observing storage service state changes, which
eventually being used to break the loop upon preparation for
decommissioning. Tests: Unit test (dev) Failing tests in jenkins.
Fixes#8836
The previous merge (possibly due to conflict resolution) contained a
misplaced get that caused an abort on shutdown.
Closes#9035
* github.com:scylladb/scylla:
Service Level Controller: Stop configuration polling loop upon leaving the cluster
main: Stop using get_local_storage_service in main
We need a permit to initialize said object which makes the semaphore
used and hence trigger an error if an exception is thrown in the
constructor. Move the initialization to the end of the constructor to
prevent this.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210719040449.9202-1-bdenes@scylladb.com>
Adds test that creates a table with primary key (p, c1, c2)
with a global index on c2 and then selects where c1 = 1 and c2 = 1.
This should require filtering, but doesn't.
Refs #8991.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Adds test that creates a table with primary key (p, c1, c2)
with a global index on c2 and then selects where c1 = 1 and c2 = 1.
This currently fails.
Refs #8991.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Seastar's default limit of 10,000 iocbs per shard is too low for
some workload (it places an upper bound on the number of idle
connections, above which a crash occurs). Use the new Seastar
feature to raise the default to 50000.
Also multiply the global reservation by 5, and round it upwards
so the number is less weird. This prevents io_setup() from failing.
For tests, the reservation is reduced since they don't create large
numbers of connections. This reduces surprise test failures when they
are run on machines that haven't been adjusted.
Fixes#9051Closes#9052
In order to avoid data loss bugs, that could come due to lack of
serialization when using the preemptable build_new_sstable_list(),
let's document the serialization requirement.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210714201301.188622-1-raphaelsc@scylladb.com>
"
The debug-mode tests nowadays take ~1 hours to complete on a
24-cores threadripper machine. This is mostly because of a bunch
of individual test cases that run sequentially (since they sit
in one test) each taking half-an-hour and longer.
The previous attempt was to break the longest tests into pieces,
and to update the list of long-running test in suite.yaml file,
but the concern was that the linkage time and disk space would
grow without limits if this continues. Also the long-running tests
list needs to be revisited every so often.
So the new attempt is to resurrect Avi's patch that ran test
cases in parallel for boost tests. This set applies parallelizm
to all tests and allows to blacklist those that shound't (the
logalloc needs the very first case to prime_segment_pools so
that other cases run smoothly, thus is cannot be parallelized).
Although this wild parallelizm adds an overhead for _each_ test
case this is good enough even for short dev-mode tests (saves
25% of runtime), but greatly relaxes the maintenance of the
"parallelizable list of tests".
For debug tests the problem is not 100% solved. There are 6 cases
that run longer than 30min, while all the others complete much-
-much faster. So if excluding those slow 6 cases the full parallel
run saves 50+% of the runtime -- 60+m now vs 25m with the patch.
Those 6 slowest cases will need more incremental care.
The --parallel-cases mode is not yet default, because it requires
larger max-aio-nr value to be set, which is not (yet?) automatic.
Also it sometimes hits nr-open-files limit, which also needs more
work.
tests: unit(dev), unit(debug)
"
* 'br-parallel-testpy-3' of https://github.com/xemul/scylla:
tests: Update boost long tests list
test.py: Parallelize test-cases run (for boost tests)
test.py: Prepare BoostTest for running individual cases
test.py: Prepare TestSuite::create_test() for parallelizm
test.py: Treat shortname as composite
test.py: Reformat tabluar output
The memtable_list::flush() maintains a shared_promise object
to coalesce the flushers until the get_flush_permit() resolves.
Also it needs to keep the extraneous flushes counter bumped
while doing the flush itself.
All this can be coded in a shorter form and without the need
to carry shared_promise<> around.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210716164237.10993-1-xemul@scylladb.com>
The parallelizm is acheived by listing the content of each (boost)
test and by adding a test for each case found appending the
'--run_test={case_name}' option.
Also few tests (logallog and memtable) have cases that depend on
each other (the former explicitly stated this in the head comment),
so these are marked as "no_parallel_cases" in the suite.yaml file.
In dev mode tests need 2m:5s to run by default. With parallelizm
(and updated long-running tests list) -- 1m 35s.
In debug mode there are 6 slow _cases_ that overrun 30 minutes.
They finish last and deserve some special (incremental) care. All
the other tests run ~1h by default vs ~25m in parallel.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This means adding the casename argument to its describing class
and handling it:
1. appending to the shortname
2. adding the --run_test= argument to boost args
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method in question is in charge of creating a single
entry in the list of tests to be run. The BoostTestSuite's
method is about to create several entries and this patch
prepares it for this:
- makes it distinguish individual arguments
- lets it select the test.id value itself
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When running tests in parallel-cases mode the test.uname must
include the case name to make different log and xml files for
different runs and to show which exact case is run when shown
by the tabular-output. At the same time the test shortname
identifies the binary with the whole test.
This patch makes class Test treat the shortname argument as
a dot-separated string where the 0th component is the binary
with the test and the rest is how test identifies itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This change solves several issues that would arise with the
case-by-case run.
First, the currently printed name is "$binary_name.$id". For
case-by-case run the binary name would coinside for many cases
and it will be inconvenient to identify the test case. So
the tests uname is printed instead.
Second, the tests uname doesn't contain suite name (unlike the
test binary name which does), so this patch also adds the
explicit suite name back as a separate column (like MODE)
Third, the testname + casename string length will be far above
the allocated 50 characters, so the test name is moved at the
tail of the line.
Fourth, the total number of cases is 2100+, the field of 7
characters is not enough to print it, so it's extended.
Finally the test.py output would look like this for parallel run:
================================================================================
[N/TOTAL] SUITE MODE RESULT TEST
------------------------------------------------------------------------------
[1/2108] raft dev [ PASS ] etcd_test.test_progress_leader.40 0.06s
[2/2108] raft dev [ PASS ] etcd_test.test_vote_from_any_state.45 0.03s
[3/2108] raft dev [ PASS ] etcd_test.test_progress_flow_control.43 0.04s
[4/2108] raft dev [ PASS ] etcd_test.test_progress_resume_by_append_resp.41 0.05s
[5/2108] raft dev [ PASS ] etcd_test.test_leader_election_overwrite_newer_logs.44 0.04s
[6/2108] raft dev [ PASS ] etcd_test.test_progress_paused.42 0.05s
[7/2108] raft dev [ PASS ] etcd_test.test_log_replication_2.47 0.06s
...
or like this for regular:
================================================================================
[N/TOTAL] SUITE MODE RESULT TEST
------------------------------------------------------------------------------
[1/184] raft dev [ PASS ] fsm_test.41 0.06s
[2/184] raft dev [ PASS ] etcd_test.40 0.06s
[3/184] cql dev [ PASS ] cassandra_cql_test.2 1.87s
[4/184] unit dev [ PASS ] btree_stress_test.30 1.82s
...
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A couple of improvements to prepare for the next patchset.
We move `logical_timer` and `ticker` to their own headers due to the
generality of these data structures. They are not very specific to the
test.
`logical_timer` is extended with a `schedule` function, allowing to
schedule any given function to be called at the given time point.
The interface of `network` in `randomized_nemesis_test` is extended by
`add_grudge` and `remove_grudge` functions for implementing network
partitioning nemeses.
Furthermore `network` can be now constructed with an arbitrary network
delay, which was previously hardcoded.
`with_env_and_ticker` is now generic w.r.t. return values (previously
`future<>` was assumed).
`environment` exposes a reference to the `network` through a getter.
The `not_a_leader` exception now shows the leader's ID in the exception
message. Useful for logging.
In `logical_timer::with_timeout`, when we timeout, we don't just return
`timed_out_error`. The returned exception now actually contains the
original future... well almost; in any case, the user can now do
something different to the future other than simply discarding it.
We also fix some `broken_promise` exceptions appearing in discarded
futures in certain scenarios. See the corresponding commit for detailed
explanation.
We handle `raft::dropped_entry` in the `call` function.
`persistence` is fixed to avoid creating gaps in the log when storing
snapshots and to support complex state types.
Waiting for leader was refactored into a separate function and
generalized (we wait for a set of nodes to elect a leader instead of a
single node to elect itself) to be useful in more situations.
Finally, we introduce `reconfigure`, a higher-level version of
`set_configuration` which performs error handling and supports timeouts.
* kbr/raft-nemesis-improvements-v4:
test: raft: randomized_nemesis_test: `reconfigure` function
test: raft: randomized_nemesis_test: refactor waiting for leader into a separate function
test: raft: randomized_nemesis_test: persistence: avoid creating gaps in the log when storing snapshots
test: raft: randomized_nemesis_test: persistence: handle complex state types
test: raft: randomized_nemesis_test: `call`: handle `raft::dropped_entry`
test: raft: randomized_nemesis_test: impure_state_machine/call: handle dropped channels
test: raft: randomized_nemesis_test: environment: expose the network
test: raft: randomized_nemesis_test: configurable network delay and FD convict threshold
test: raft: randomized_nemesis_test: generalize `with_env_and_ticker`
test: raft: randomized_nemesis_test: network: `add_grudge`, `remove_grudge` functions
test: raft: randomized_nemesis_test: move `ticker` to its own header
test: raft: randomized_nemesis_test: ticker: take `logger` as a constructor parameter
test: raft: logical_timer: handle immediate timeout
test: raft: logical_timer: on timeout, return the original future in the exception
test: raft: logical_timer: add `schedule` member function
test: raft: randomized_nemesis_test: move `logical_timer` to its own header
test: raft: include the leader's ID in the `not_a_leader` exception's message
The series which split the view update process into smaller parts
accidentally put an artificial 10MB limit on the generated mutation
size, which is wrong - this limit is configurable for users,
and, what's more important, this data was already validated when
it was inserted into the base table. Thus, the limit is lifted.
The series comes with a cql-pytest which failed before the fix and succeeds now. This bug is also covered by `wide_rows_test.py:TestWideRows_with_LeveledCompactionStrategy.test_large_cell_in_materialized_view` dtest, but it needs over a minute to run, as opposed to cql-pytest's <1 second.
Fixes#9047
Tests: unit(release), dtest(wide_rows_test.py:TestWideRows_with_LeveledCompactionStrategy.test_large_cell_in_materialized_view)
Closes#9048
* github.com:scylladb/scylla:
cql-pytest: add a materialized views suite with first cases
db,view: drop the artificial limit on view update mutation size
cql-pytest did not have a suite for materialized views, so one is
created. At the same time, test cases for building/updating a view on
a base table with large cells is added as a regression test for #9047.
The series which split the view update process into smaller parts
accidentally put an artificial 10MB limit on the generated mutation
size, which is wrong - this limit is configurable for users,
and, what's more important, this data was already validated when
it was inserted into the base table. Thus, the limit is lifted.
Tests: unit(release), dtest(wide_rows_test)
Currently, flat_mutation_reader_v2::is_end_of_stream() returns
flat_mutation_reader_v2::impl::_end_of_stream, which means the producer
is done. The stream may be still not yet fully consumed even if
producer is done, due to internal buffering. So consumers need to make
a more elaborate check:
rd.is_end_of_stream() && rd.is_buffer_empty()
It would be cleaner if flat_mutation_reader_v2::is_end_of_stream()
returned the state of the consumer-side of the stream, since it
belongs to the consumer-side of the API. The consumption will be as
simple as:
while (!rd.is_end_of_stream()) {
consume_fragment(rd());
}
This patch makes the change on the v2 of the reader interface. v1 is
not changed to avoid problems which could happen when backporting code
which assumes new semantics into a version with the old semantics. v2
is not in any old branch yet so it doesn't have this problem and it's
a good time to make the API change.
Note that it's always safe to use the new semantics in the context
which assumes the old semantics, so v1 users can be safely converted
to v2 even if they are unware of the change.
Fixes#3067
Message-Id: <20210715102833.146914-1-tgrabiec@scylladb.com>
Refs #8418
Broadcast can (apparently) be an address not actually on machine, but
on the other side of NAT. Thus binding local side of outgoing
connection there will fail.
Bind instead to listen_address (or broadcast, if listen_to_broadcast),
this will require routing + NAT to make the connection looking
like from broadcast from node connected to, to allow the connection
(if using partial encryption).
Note: this is somewhat verified somewhat limitedly. I would suggest
verifying various multi rack/dc setups before relying on it.
Closes#8974
Reads which bypass cache will use a private temporary instance of cached_file
which dies together with the index cursor.
The cursor still needs a cached_file with cachig layer. Binary searching needs
caching for performance, some of the pages will be reused. Another reason to still
use cached_file is to work with a common interface, and reusing it requires minimal changes.
Index cursor for reads which bypass cache will use a private temporary
instance of the partition index cache.
Promoted index scanner (ka/la format) will not go through the page cache.
Since fce124bd90 ("Merge "Introduce flat_mutation_reader_v2" from
Tomasz") tests involving mutation_reader are a lot slower due to
the new API testing. On slower machines it's enough to time out.
Work underway to improve the situation, and it will also revert back
to the original timing once the flat_mutation_reader_v2 work is done,
but meanwhile, increase the timeout.
Closes#9046
This patch applies the same changes to both kl and mx sstable readers, but because the kl reader is old, we'll focus on the newer one.
This patch makes the main sstable reader process a coroutine,
allowing to simplify it, by:
- using the state saved in the coroutine instead of most of the states saved in the _state variable
- removing the switch statement and moving the code of former switch cases, resulting in reduced number of jumps in code
- removing repetitive ifs for read statuses, by adding them to the coroutine implementation
The coroutine is saved in a new class ```processing_result_generator```, which works like a generator: using its ```generate()``` method, one can order the coroutine to continue until it yields a data_consumer::processing_result value, which was achieved previously by calling the function that is now the coroutine(```do_process_state()```).
Before the patch, the main processing method had 558 lines. The patch reduces this number to 345 lines.
However, usage of c++ coroutines has a non-negligible effect on the performance of the sstable reader.
In the test cases from ```perf_fast_forward``` the new sstable reader performs up to 2% more instructions (per fragment) than the former implementation, and this loss is achieved for cases where we're reading many subsequent rows, without any skips.
Thanks to finding an optimization during the development of the patch, the loss is mitigated when we do skip rows, and for some cases, we can even observe an improvement.
You can see the full results in attached files: [old_results.txt](https://github.com/scylladb/scylla/files/6793139/old_results.txt), [new_results.txt](https://github.com/scylladb/scylla/files/6793140/new_results.txt)
Test: unit(dev)
Refs: #7952Closes#9002
* github.com:scylladb/scylla:
mx sstable reader: reduce code blocks
mx sstable reader: make ifs consistent
sstable readers: make awaiter for read status
mx sstable reader: don't yield if the data buffer is not empty
mx sstable reader: combine FLAGS and FLAGS_2 states
mx sstable reader: reduce placeholder state usage
mx sstable reader: replace non_consuming states with a bool
mx sstable reader: reduce placeholder state usage
mx sstable reader: replace unnecessary states with a placeholder
mx sstable reader: remove false if case
mx sstable reader: remove row_body_missing_columns_label
mx sstable reader: remove row_body_deletion_label
mx sstable reader: remove column_end_label
mx sstable reader: remove column_cell_path_label
mx sstable reader: remove column_ttl_label
mx sstable reader: remove column_deletion_time_label
mx sstable reader: remove complex_column_2_label
mx sstable reader: remove row_body_missing_columns_read_columns_label
mx sstable reader: remove row_body_marker_label
mx sstable reader: remove row_body_shadowable_deletion_label
mx sstable reader: remove row_body_prev_size_label
mx sstable reader: remove ck_block_label
mx sstable reader: remove ck_block2_label
mx sstable reader: remove clustering_row_label and complex_column_label
mx sstable reader: remove labels with only one goto
mx sstable reader: replace the switch cases with gotos and a new label
mx sstable reader: remove states only reached consecutively or from goto
mx sstable reader: remove switch breaks for consecutive states
mx sstable reader: convert readers main method into a coroutine
kl sstable reader: replace states for ending with one state, simplify non_consuming
kl sstable reader: remove unnecessary states
kl sstable reader: remove unnecessary yield
kl sstable reader: remove unnecessary blocks
kl sstable reader: fix indentation
kl sstable reader: replace switch with standard flow control
kl sstable reader: remove state::CELL case
kl sstable reader: move states code only reachable from one place
kl sstable reader: remove states only reached consecutively
kl sstable reader: remove switch breaks for consecutive states
kl sstable reader: remove unreachable case
kl sstable reader: move testing hack for fragmented buffers outside the coroutine
kl sstable reader: convert readers main method into a coroutine
sstable readers: create a generator class for coroutines
Some blocks of code were surrounded by curly braces, because
a variable was declared inside a switch case. After changes,
some of the variable declarations are in if/else/while cases,
and no longer need to be in separate code blocks, while other
blocks can be extended to entire labels for simplicity.
In several places we're checking the return value of our
consumers' consume_* calls. Because the behaviour in all cases
is the same, let us use the same notation as well.
After each read* call of the primitive_consumer we need to check
if the entire primitive was in our current buffer. We can check it
in the proceed_generator object by yielding the returned read status:
if the yielded status is ready, the yield_value method returns
a structure whose await_ready() method returns true. Otherwise it
returns false.
The returned structure is co_awaited by the coroutine (due to co_yield),
and if await_ready() returns true, the coroutine isn't stopped,
conversely, if it returns false, (technical: and because its await_suspend
methods returns void) the coroutine stops, and a proceed::yes value
is saved, indicating that we need more buffers.
The skip() method returns a skip_bytes object if we want to
skip the entire buffer, otherwise it returns a proceed::yes
and trims the buffer.
If the buffer is only trimmed we don't need to interrupt
the coroutine, we simply continue instead.
The non_consuming() method is only used after assuring that
primitive_consumer::active() (in continuous_data_consumer::process())
so we don't need states where primitive_consumer::active(), which
is most of them.
We still need to make sure that the states change when they need to,
so we replace all the concerned states with the placeholder state,
and for the few states from the non_consuming() OR, where the
primitive_consumer::active() returns true, we set the value of
_consuming to false, changing it back when the state is no longer
non_consuming.
We can remove state assignments that we know are
changing a state to itself.
Similarily, if a state is changed in the same way
in an if and an else, it can be changed before the
if/else instead.
After removing the switch, the state is only used for
verify_end_state() and non_consuming(), so we can
replace states that are not used there with a single
one, so that the state still stops being one of the
appearing states when it needs to.
consume_row_marker_and_tombstone does not return proceed::no in the
mp_row_consumer_m implementation, and even if it did, we would most
likely want to yield proceed::no in that case as well.
column_cell_path_label is only reached from two goto, both
at the end of an if/else block, or consecutively, so the code
after the if/else block can be ommited by an if instead (or else).
column_ttl_label is only reached from two goto, both
at the end of an if/else block, or consecutively, so the code
after the if/else block can be ommited by an if instead (or else).
row_body_missing_columns_read_columns_label is only reached
consecutively, or from a goto after the label. This is changed to a
while loop starting at the label and ending at the goto.
The code executed in the only case we do not reach the goto (so
when exiting the loop) is moved after the while.
row_body_marker_label is only reached from one goto inside an else
case, or consecutively, so the code omitted by goto can be moved
inside the corresponding if case.
row_body_shadowable_deletion_label is only reached from one
goto, or consecutively, so the code omitted by goto can be
ommited by an if instead (or else).
row_body_prev_size_label is only reached consecutively, or from
a goto not far after the label. This is changed to a while loop
starting at the label and ending at the goto.
ck_block_label is only reached consecutively, or from
a few gotos not far after the label. This is changed
to a while loop with gotos replaced with continue's.
clustering_row_label is only reached from one goto, or consecutively,
so the code omitted by goto can be ommited by an if instead (or else).
Also remove complex_column_label because it is next to
its only goto.
Because the number of remaining cases is moderately low, and
after finishing a case we always enter another one, the switch
is removed completely, and the last remaining cases are handled
by 3 additional gotos and 1 new label.
If a state is never reached from the top of the switch, but only
by continuing from the previous case, we don't need to have a case:
for it.
Similarily, if there is a label that we goto, we don't need the
switch case.
(same as in kl sstable reader)
The function is converted to a coroutine simply by adding an
infinite loop around the switch, and starting another iteration
after yielding a value, instead of returning.
Because the coroutine resume() function does not take any arguments,
a new member is introduced to remember the "data" buffer, that was
previously an argument to the method.
After removing the switch, the only use for states in the sstable reader
are methods non_consuming() and verify_end_state().
The non_consuming() method is only used after assuring that
!primitive_consumer::active() (in continuous_data_consumer::process())
so we don't need states where primitive_consumer::active() for this
method, and is actually all of them.
We don't differentiate between ATOM_START and ATOM_START_2 in
verify_end_state(), so we can just merge them into one.
While we need tho remember times when we enter states used in verify_end_state(),
we also need to remember when we exit them. For that reason we introduce a new
state "NOT_CLOSING", that fails all comparisons in verify_end_state(), and
replaces all states that aren't used in verify_end_state()
After removing the switch, the state is only used for
verify_end_state() and non_consuming(), so we can
remove states that are not used there (and which do
not change them).
Some blocks of code were surrounded by curly braces, because
a variable was declared inside a switch case. With standard
flow control, it's no longer needed.
We get rid of the switch by using the infinite loop around the
switch for jumping to the first case, adding an infinite loop
around the second case (one break from the switch with the
state of the first case becomes a break of the new while),
and adding an if around the first case (because we never break
in the first case).
The function is converted to a coroutine simply by adding an
infinite loop around the switch, and starting another iteration
after yielding a value, instead of returning.
Because the coroutine resume() function does not take any arguments,
a new member is introduced to remember the "data" buffer, that was
previously an argument to the method.
The data_consume_rows_context and data_consume_rows_context_m are
classes, that use primitive_consumer read* methods to get primitives
from a streamed sstable, and using their corresponding consumers' (
mp_row_consumer_k_l and mp_row_consumer_m) consume* methods, they
fill the buffer of the corresponding flat_mutation_reader.
The main procedure where we decide which read* and consume* methods
to call, is do_process_state. We save the current state of the
procedure in the _state variable, to remember where to continue in
the next call. For each call, the do_process_state method returns
an information about whether we can keep filling the buffer using
more buffers from the stream (proceed::yes), or not (proceed::no).
The saved state can be (mostly) removed by using a generator
coroutine, whose state is saved when its execution is halted,
and which yields the values, that do_process_state would return
before.
The processing_result_generator is a class for managing a generator
coroutine. When the coroutine halts, the proceed_generator saves the
value yielded by the coroutine, and returns it to the caller.
This patchset combines two important changes to the way reader permits
are created and admitted:
1) It switches admission to be up-front.
2) It changes the admission algorithm.
(1) Currently permits are created before the read is started, but they
only wait for admission when going to the disk. This leaves the
resources consumption of cache and memtables reads unbounded, possibly
leading to OOM (rare but happens). This series changes this that permits
are admitted at the moment they are creating making admission up-front
-- at least those reads that pass admission at all (some don't).
(2) Admission currently is based on availability of resources. We have a
certain amount of memory available, which derived from the memory
available to the shard, as well a hardcoded count resource. Reads are
admitted when a count and a certain amount (base cost) of memory is
available. This patchset adds a new aspect to this admission process
beyond the existing resource availability: the number of used/blocked
reads. Namely it only admits new reads if in addition to the necessary
amount of resources being available, all currently used readers are
blocked. In other words we only admit new reads if all currently
admitted reads requires something other than CPU to progress. They are
either waiting on I/O, a remote shard, or attention from their consumers
(not used currently).
The reason for making these two changes at the same time is that
up-front admission means cache reads now need to obtain a permit too.
For cache reads the optimal concurrency is 1. Anything above that just
increases latency (without increasing throughput). So we want to make sure
that if a cache reader hits it doesn't get any competition for CPU and
it can run to completion. We admit new reads only if the read misses and
has to go to disk.
A side effect of these changes is that the execution stages from the
replica-side read path are replaced with the reader concurrency
semaphore as an execution stage. This is necessary due to bad
interaction between said execution stages and up-front admission. This
has an important consequence: read timeouts are more strictly enforced
because the execution stage doesn't have a timeout so it can execute
already timed-out reads too. This is not the case with the semaphore's
queue which will drop timed-out reads. Another consequence is that, now
data and mutation reads share the same execution stage, which increases
its effectiveness, on the other hand system and user reads don't
anymore.
Fixes: #4758Fixes: #5718
Tests: unit(dev, release, debug)
* 'reader-concurrency-semaphore-in-front-of-the-cache/v5.3' of https://github.com/denesb/scylla: (54 commits)
test/boost/reader_concurrency_semaphore_test: add used/blocked test
test/boost/reader_concurrency_semaphore_test: add admission test
reader_permit: add operator<< for reader_resources
reader_concurrency_semaphore: add reads_{admitted,enqueued} stats
table: make_sstable_reader(): fix indentation
table: clean up make_sstable_reader()
database: remove now unused query execution stages
mutation_reader: remove now unused restricting_reader
sstables: sstable_set: remove now unused make_restricted_range_sstable_reader()
reader_permit: remove now unused wait_admission()
reader_concurrency_semaphore: remove now unused obtain_permit_nowait()
reader_concurrency_semaphore: admission: flip the switch
database: increase semaphore max queue size
test: index_with_paging_test: increase semaphore's queue size
reader_concurrency_semaphore: add set_max_queue_size()
test: mutation_reader_test: remove restricted reader tests
reader_concurrency_semaphore: remove now unused make_permit()
test: reader_concurrency_semaphore_test: move away from make_permit()
test: move away from make_permit()
treewide: use make_tracking_only_permit()
...
This patch flips two "switches":
1) It switches admission to be up-front.
2) It changes the admission algorithm.
(1) by now all permits are obtained up-front, so this patch just yanks
out the restricted reader from all reader stacks and simultaneously
switches all `obtain_permit_nowait()` calls to `obtain_permit()`. By
doing this admission is now waited on when creating the permit.
(2) we switch to an admission algorithm that adds a new aspect to the
existing resource availability: the number of used/blocked reads. Namely
it only admits new reads if in addition to the necessary amount of
resources being available, all currently used readers are blocked. In
other words we only admit new reads if all currently admitted reads
requires something other than CPU to progress. They are either waiting
on I/O, a remote shard, or attention from their consumers (not used
currently).
We flip these two switches at the same time because up-front admission
means cache reads now need to obtain a permit too. For cache reads the
optimal concurrency is 1. Anything above that just increases latency
(without increasing throughput). So we want to make sure that if a cache
reader hits it doesn't get any competition for CPU and it can run to
completion. We admit new reads only if the read misses and has to go to
disk.
Another change made to accommodate this switch is the replacement of the
replica side read execution stages which the reader concurrency
semaphore as an execution stage. This replacement is needed because with
the introduction of up-front admission, reads are not independent of
each other any-more. One read executed can influence whether later reads
executed will be admitted or not, and execution stages require
independent operations to work well. By moving the execution stage into
the semaphore, we have an execution stage which is in control of both
admission and running the operations in batches, avoiding the bad
interaction between the two.
Queued reads don't take 10KB (not even 1KB) for years now. But the real
motivation of this patch is that due to a soon-to-come change to
admission we expect larger queues especially in tests, so be more
forgiving with queue sizes.
Soon we will switch to up-front admission which will break these tests.
No point in trying to fix them as once the switch is done we'll retire
the restricted reader too. Remove these tests now so they are not in the
way of progress.
When the generate-and-propagate-view-updates routine was rewritten
to allow partial results, one important validation got lost:
previously, an error which occured during update *generation*
was propagated to the user - as an example, the indexed column
value must be smaller than 64kB, otherwise it cannot act as primary
key part in the underlying view. Errors on view update *propagation*
are however ignored in this layer, because it becomes a background
process.
During the rewrite these two got mixed up and so it was possible
to ignore an error that should have been propagated.
This behavior is now fixed.
Fixes#9013Closes#9021
* github.com:scylladb/scylla:
cql-pytest: add a case for too large value in SI
table: stop ignoring view generation errors on write path
This series includes a comprehensive test suite for the DynamoDB API's
TTL (item expiration) feature described in issue #5060. Because we have
not yet implemented the TTL feature in Alternator, all of the tests
still xfail, but they all pass on DynamoDB and demonstrate exactly how
the TTL feature works and how it interacts with other features such as
LSI, GSI and Streams. The patch which introduces these tests is heavily
commented to explain exactly what it tests, and why.
Because DynamoDB only expires items some 10-30 minutes after their
expiration time (the documentation even suggests it can be delayed by 24
hours!), some of these tests are extremely long (up to 30 minutes!), so
we also introduce in this series a new marker for "verylong" tests.
verylong tests are skipped by default, unless the "--runverylong" option
is given. In the future, when we implement the TTL feature in Alternator
and start testing it, we may be able to configure it with a much shorter
expiration timeout and then we might be able to run these tests in a
reasonable time and make them run by default.
Closes#8564
* github.com:scylladb/scylla:
test/alternator: add tests for the Alternator TTL feature
test/alternator: add marker for "veryslow" tests
test/alternator: add new_test_table() utility function
Instead of passing down the querier_cache_ctx to table::mutation_query(),
handle the querier lookup/save on the level where the cache exists.
The real motivation behind this change however is that we need to move
the lookup outside the execution stage, because the current execution
stage will soon be replaced by the one provided by the semaphore and to
use that properly we need to know if we have a saved permit or not.
Instead of passing down the querier_cache_ctx to table::query(),
handle the querier lookup/save on the level where the cache exists.
The real motivation behind this change however is that we need to move
the lookup outside the execution stage, because the current execution
stage will soon be replaced by the one provided by the semaphore and to
use that properly we need to know if we have a saved permit or not.
The querier cache refuses to cache queriers that read in reverse. These
queriers are also not closed, with the caller having no way to determine
whether the querier it just moved into `insert()` needs a close
afterwards or not, requiring a `close()` on the moved-from querier just
to be sure.
Avoid this by consistently closing all passed-in queriers, including
those the cache refuses to save. For this, the internal
`insert_querier()` methods has to be made a member to be able to use the
closing gate.
This method is both a convenience method to obtain the permit, as well
as an abstraction to allow different implementations to get creative.
For example, the main implementation, the one in multishard mutation
query returns the permit of the saved reader one was successful. This
ensures that on a multi-paged read the same permit is used across as
much pages as possible. Much more importantly it ensures the evictable
reader wrapping the actual reader both use the same permit.
Ensure that when the reader has to be created anew the passed-in permit
is used to create it, instead of the one left over in remote-parts,
which is that of the already evicted reader.
This lays the groundwork to ensure the same permit is used across all
pages of a read, by a future patch which creates the wrapping reader
with the existing permit.
As a preparation for up-front admission, add a permit parameter to
`make_multishard_streaming_reader()`, which will be the admitted permit
once we switch to up-front admission. For now it has to be a
non-admitted permit.
A nice side-effect of this patch is that now permits will have a
use-case specific description, instead of the generic
"multishard-streaming-reader" one
As a preparation for up-front admission, add a permit parameter to
`make_streaming_reader()`, which will be the admitted permit once we
switch to up-front admission. For now it has to be a non-admitted
permit.
A nice side-effect of this patch is that now permits will have a
use-case specific description, instead of the generic "streaming" one.
A convenience method for obtaining an admitted permit for a read on a
given table.
For now it uses the nowait semaphore obtaining method, as all normal
reads still use the old admission method. Migrating reads to this method
will make the switch easier, as there will be one central place to
replace the nowait method with the proper one.
To be used for determining the base cost of reads used in admission. For
now it just returns the already used constant. This is a forward looking
change, to when this will be a real estimation, not just a hardcoded
number.
The execution stage functionality is exposed via two new member
functions, `with_permit()` and `with_ready_permit()`. Both accept a
function to be run. The former obtains a permit then runs the passed in
function through the execution stage. The latter allows an already
obtained permit to be passed in.
Three new methods are added for creating permits:
1) obtain_permit()
2) obtain_permit_nowait()
3) make_tracking_only_permit()
(1) is meant to replace `make_permit()` + `wait_admission()`, by
integrating the waiting for admission into the process of creating the
permit. This is the method meant to be used to create permits from here
on, ensuring that each read passes admission before even being started.
(2) is a bridge between the old and new world. Up-front admission cannot
coexist with the restricted reader in the same read, so those reads that
have a restricted reader in their stack can use this method to create a
non-admitted permit to be admitted by the restricted reader later. Once
we have migrated all reads to (1) or (2), we can get rid of the
restricted reader and just replace (1) with (2) in the codebase. (2)
returns a future to make this a simple rename, the churn of dealing with
a future<reader_permit> return type already having been dealt with by
then.
(3) is for reads that bypass admission, yet their resource usage does
participate in the admission of other reads. This is the equivalent of
reads that don't pass admission at all.
The following patches will gradually transition the codebase away from
the old permit API, and once the transition is complete, we can switch
over to do the admission up-front at once.
We want to make permits be admitted up-front, before even being created.
As part of this change, we will get rid of the `wait_admission()`
method on the permit, instead, the permit will be created as a result of
waiting for admission (just like back some time ago).
To allow evicted readers to wait for re-admission, a new method
`maybe_wait_readmission()` is created, which waits for readmission if
the permit is in evicted state.
Also refactor the internals of the semaphore to support and favor
up-front admission code. As up-front admission is the future we want the
permit code to be organized in such a way that it is natural to use with
it. This means that the "old-style" admission code might suffer but we
tolerate this as it is on its way out. To this end the following changes
were done:
* Add a _base_resources field to reader_permit which tracks the base
cost of said permit. This is passed in the constructor and is used in
the first and subsequent admissions.
* The base cost is now managed internally by the permit, instead of
relying on an external `resource_units` instance, though the old way
is still supported temporarily.
* Change the admission pipeline to favor the new permit-internally
managed base cost variant.
* Compatibility with old-style admission: permits are created with 0
base resources, base resources are set with the compatibility method
`set_base_resources()` right before admission, then externalized again
after admission with `base_resource_as_resource_units()`. These
methods will be gone when the old style admission is retired (together
with `wait_admission()`).
By enabling shared from this for impl and adding a reader permit
constructor which takes a shared pointer to an impl.
This allows impl members to invoke functions requiring a `reader_permit`
instance as a parameter.
Distinguish between permits that are blocked and those that are not.
Conceptually a blocked permit is one that needs to wait on either I/O or
a remote shard to proceed.
This information will be used by admission, which will only admit new
reads when all currently used ones are blocked. More on that in the
commit introducing this new admission type.
This patch only adds the infrastructure, block sites are not marked yet.
Distinguish between permits that are used and those that are not. These
are two subtypes of the current 'active' state (and replace it).
Conceptually a permit is used when any readers associated with it have a
pending call to any of their async methods, i.e. the consumer is
actively consuming from them.
This information will be used for admission, together with a new blocked
state introduced by a future patch.
This patch only adds the infrastructure, use sites are not marked yet.
We want to introduce more fine-grained states for permits than what we
have currently, splitting the current 'active' state into multiple
sub-states. As a preparatory step, introduce an evicted state too, to
keep track of permits that were evicted while being inactive. This will
be important in determining what permits need to re-wait admission, once
we keep permits across pages. Having an evicted state also aids
validating internal state transitions.
This patch adds a comprehensive test suite for the DynamoDB API's TTL
(item expiration) feature.
The tests check the two new API commands added by this feature
(UpdateTimeToLive and DescribeTimeToLive), and also how items are
expired in practice, and how item expiration interacts with other
features such as GSI, LSI and DynamoDB Streams.
Because DynamoDB has extremely long delays until items are expired, or
until expiration configuration may be changed, several of these tests
take up to 30 minutes to complete. We mark these tests with the
"verylong" marker, so they are skipped in ordinary test runs - use the
"--runverylong" option to run them.
All these tests currently pass on DynamoDB, but xfail on Alternator
because the two commands UpdateTimeToLive and DescribeTimeToLive are
currently rejected by Alternator.
Refs #5060
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
gdb is used for testing scylla-gdb.py (since 3c2e852dd), so it needs
to be listed as a dependency. Add it there. It was listed as a
courtesy dependency in the frozen toolchain (which is why it still
worked), so it's removed from there.
Closes#9034
the cluster
This change subscribes service_level_controller for nodes life cycle
notifications and uses the notification of leaving the cluster for
the current node to stop the configuration polling loop. If the loop
continues to run it's queries will fail consistently since the nodes
will not answers to queries. It is worth mentioning that the queries
failing in the current state of code is harmles but noisy since after
90 seconsd, if the scylla process is not shut down the failures will
start to generate failure logs every 90 seconds which is confusing for
users.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
This change removes the use of service::get_local_storage_service,
instead, the approach taken is similar to other modules, for example
storage_proxy where a reference to the `sharded` module is obtained
once and then the local() method in combination with capturing is
used.
Until now, Alternator test have all been very fast, taking milliseconds
or at worst seconds each - or a bit longer on DynamoDB. However,
sometimes we need to write tests which take a huge amount of time - for
example, tests for the TTL feature may take 10 minutes because the item
expiration is delayed by that much. Because a 10 minute test is
ridiculous (all 500 Alternator tests together take just one minute
today!), we would normally run such test once, and then mark it "skip"
so will never run again.
One annoying thing about skipped tests is that there is no way to
temporarily "unskip" them when we want to run such a test anyway.
So in this patch, we introduce a better option for these very slow tests
instead of the simple "skip":
The patch introduces a marker "@pytest.mark.veryslow". By default, a
test with this marker is skipped. However, an command-line option
"--runveryslow" is introduced which causes tests with the veryslow
mark to be run anyway, and not skipped.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a convenient function new_test_table() that Alternator tests
can use to safely create a temporary table, and be sure it is deleted in any
case. This function is used in a "with", as follows:
with new_test_table(dynamodb, ...) as table:
do_something(table)
# at this point table has already been deleted.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When the generate-and-propagate-view-updates routine was rewritten
to allow partial results, one important validation got lost:
previously, an error which occured during update *generation*
was propagated to the user - as an example, the indexed column
value must be smaller than 64kB, otherwise it cannot act as primary
key part in the underlying view. Errors on view update *propagation*
are however ignored in this layer, because it becomes a background
process.
During the rewrite these two got mixed up and so it was possible
to ignore an error that should have been propagated.
This behavior is now fixed.
Fixes#9013
Prepare for making semaphore_timed_out derived from timed_out_error
in seastar. When this happens in seastar, we would need to catch
the derived, more-specific exception first to avoid the following
warning:
```
service/storage_proxy.cc:2818:18: error: exception of type 'seastar::semaphore_timed_out &' will be caught by earlier handler [-Werror,-Wexceptions]
} catch (semaphore_timed_out&) {
^
service/storage_proxy.cc:2815:18: note: for type 'seastar::timed_out_error &'
} catch (timed_out_error&) {
^
```
Later on, after the seastar change is applied to the scylla repo,
we can eliminate the duplication and catch only timed_out_error.
Test: unit(dev) (w/ the seastar changes to semaphore_timed_out
and rpc::timeout_error to inherit from timed_out_error).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210713132858.294504-1-bhalevy@scylladb.com>
The purpose of an LSI (local secondary index) in Alternator is to allow
a different sort key for the existing partitions, keeping the same
division into partititions. So it doesn't make sense to create an LSI on
a table that did not originally have a sort key (i.e., single-item partitions).
DynamoDB indeed doesn't allow this case, and Alternator forgot to forbid
it - so this patch adds the missing check to the CreateTable operation.
This patch also adds a test case for this, test_lsi_wrong_no_sort_key,
which failed before the patch and passes after it (and also passes on
DynamoDB).
Also, the existing test_lsi_wrong tests for bad LSI creation attempts
by mistake used a base table without a sort key - so while they
encountered an error as expected, it was not the right error! So we fix
that test (and split it into two tests), adding the missing sort key
and exposing the actual errors that the tests were meant to expose.
That test passed before this patch and also afterwards - but at least
after the patch it is actually testing what it was meant to be testing.
Fixes#9018.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210713123747.1012954-1-nyh@scylladb.com>
This series unifies the interface for checking if CQL restrictions are satisfied. Previously, an additional mutation-based approach was added in the materialized views layer, but the decision was reached that it's better to have a single API based on partition slices. With that, the regular selection path gets simplified at the cost of more complicated view generation path, which is a good tradeoff.
Note that in order to unify the interface, the view layer performs ugly transformations in order to adjust the input for `is_satisfied_by`. Reviewers, please take a close look at this code (`matches_view_filter`, `clustering_prefix_matches`, `partition_key_matches`), because it looks error-prone and relies on dirty internals of our serialization layer. If somebody has a better suggestion on how to do the transformation, I'm all ears.
Tests: unit(release), manual(playing with materialized views with custom filters)
Fixes#7215Closes#8979
* github.com:scylladb/scylla:
db,view,table: drop unneeded time point parameter
cql3,expr: unify get_value
cql3,expr: purge mutation-based is_satisfied_by
db,view: migrate key checks from the deprecated is_satisfied_by
db,view: migrate checking view filter to new is_satisfied_by
db,view: add a helper result builder class
db,view: move make_partition_slice helper function up
Instead of calling `set_configuration` directly on a `raft::server`, the
caller will use the higher-level `reconfigure`. Similarly to `call`, the
function converts exceptions into return values (inside a `variant`) and
allows passing in a timeout parameter.
When storing a snapshot `snap`, if `snap.idx > e.idx` where `e`
is the last entry in the log (if any), we need to clear all previous
entries so that we don't create a gap in the log. The log must remain
contiguous.
One case is controversial: what to do if `snap.idx == e.idx + 1`.
Technically no gap would be created between the entry and the snapshot.
However, if we now want to store a new entry with index `e.idx + 2`,
that would create a gap between two entries which is illegal.
The usage of `template <..., State init_state>` in `persistence`
permitted using only a very restricted class of types (so called
"structural types").
Pass the initial state through `persistence`'s constructor instead. Also
modify the member functions so the State type doesn't need to have a
default constructor.
Inside `call`, if `add_entry` failed or the operation timed out,
the output channel promise would be dropped without setting a value, causing
a `broken_promise` exception. Furthermore the output future would be
dropped, so we get a discarded `broken_promise` future.
The fix:
1. When we drop a channel without a result (inside
`impure_state_machine::with_output_channel`), set an explicit
exception with a dedicated type.
2. Discard the channel future in a controlled way, explicitly handling
the `output_channel_dropped` exception.
The following are now passed to `environement` as parameters:
- network delay,
- failure detector convict threshold.
Environment passes them further down when constructing the underlying
objects.
Generalize the type of the callback: use a template parameter instead of
`noncopyable_function` and don't assume the return type of the callback.
This allows returning a result from `with_env_and_ticker`, e.g. for
performing analysis or logging the results after a part of the test that
used the environment and ticker have finished.
Extend the interface of `network` to allow introducing and removing "grudges"
which prevent the delivery of messages from one given server to another
(when the time comes to deliver a message but there's a grudge, the
message is dropped).
More specifically, return a future which is equivalent to the original
future (when the original future resolves, this future will contain its
result).
Thus we don't discard the future, the user gets it back.
Let them decide what to do with it.
Now that restriction checking is translated to the partition-slice-style
interface, checking the partition/clustering key restrictions for views
can be performed without the time point parameter.
The parameter is dropped from all relevant call sites.
Last two users of the mutation-based is_satisfied_by function
were in the partition/clustering key checks. These functions are now
translated to use the original API.
In order to unify the interfaces, the is_satisfied_by flavor
for mutations is getting deprecated. In order to be able to remove it,
one of its biggest users, the matches_view_filter() function,
is translated to the other variant.
In order to migrate from mutation-based restriction checks,
code in view.cc needs to have a way of translating results
to partition-slice-based representation.
A slightly simplified builder from multishard_mutation_query.cc
is injected into the view code.
As per 32bcbe59ad, the sl_controller is stopped after set_distributed_data_accessor is called.
However if scylla shuts down before that happens, the sl_controller still needs to be stopped.
We need to drain the service level controller before storage_proxy::drain_on_shutdown
is called to prevent queries by the update loop from starting after the
storage_proxy has been drained - leading to issues similar to #9009.
Fixes#9014
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210713055606.130901-1-bhalevy@scylladb.com>
"
Currently, when sstables are suspected to be corrupt, one has a few bad
choices on how to verify that they are indeed correct:
* Obtain suspect sstable files and manually inspect them. This is
problematic because it requires a scylla engineer to have direct access
to data, which is not always simple or even possible due to privacy
protection rules.
* Run sstable scrub in abort mode. This is enough to confirm whether
there is any corruption or not, but only in a binary manner. It is not
possible to explore the full scope of the corruption, as the scrub
will abort on the first corruption.
* Run sstable scrub in non-abort mode. Although this allows for
exploring the full scope of the corruption and it even gets rid of it,
it is a very intrusive and potentially destructive process that some
users might not be willing to even risk.
This patchset offers an alternative: validation compaction. This is a
completely non-intrusive compaction that reads all sstables in turn and
validates their contents, logging any discrepancies it can find. It does
not mutate their content, it doesn't even re-writes them. It is akin to
a dry-run mode for sstable scrub. The reason it was not implemented as
such is that the current compaction infrastructure assumes that input
sstables are replaced by output sstables as part of the compaction
process. Lifting this assumption seemed error-prone and risky, so
instead I snatched the unused "Validation" compaction type for this
purpose. This compaction type completely bypasses the regular compaction
infrastructure but only at the low-level. It still integrates fully
into compaction-manager.
Fixes: #7736
Refs: https://github.com/scylladb/scylla-tools-java/issues/263
Tests: unit(dev)
"
* 'validation-compaction/v5' of https://github.com/denesb/scylla:
test/boost/sstable_datafile_test: add test for validation compaction
test/boost/sstable_datafile_test: scrub tests: extract corrupt sst writer code into function
api: storage_service: expose validation compaction
sstables/compaction_manager: add perform_sstable_validation()
sstables/compaction_manager: rewrite_sstables(): resolve maintenance group FIXME
sstables/compaction_manager: add maintenance scheduling group
sstables/compaction_manager: drop _scheduling_group field
sstables/compaction_manager: run_custom_job(): replace parameter name with compaction type
sstables/compaction_manager: run_custom_job(): keep job function alive
sstables/compaction_descriptor: compaction_options: add validation compaction type
sstables/compaction: compaction_options::type(): add static assert for size of index_to_type
sstables/compaction: implement validation compaction type
sstables/compaction: extract compaction info creation into static method
sstables/compaction: extract sstable list formatting to a class
sstables/compaction: scrub_compaction: extract reporting code into static methods
position_in_paritition{_view}: add has_key()
mutation_fragment_stream_validator: add schema() accessor
This can catch mismatches between visual indication about
control flow and what the compiler actually does. Looks like
boost cleaned up its indentation since it was disabled in
7f38634080 ("dist/debian: Switch to g++-7/boost-1.63 on
Ubuntu 14.04/16.04"). It's unlikely to pop back since modern
compilers enable it by default.
Closes#9015
"
Currently we `assert(_stopped)` in the destructor, but this is too
harsh, especially on freshly created semaphore instances that weren't
even used yet. This basically mandates semaphores to be initialized at
the end of the constructor body, which is very cumbersome.
Further to that, this series relaxes the checks on destroying an
unstopped previously (but not currently) used semaphore. As destroying
such a semaphore without stop is risky an error is still logged.
Tests: unit(dev)
"
* 'reader-concurrency-semaphore-relax-stop-check/v1' of https://github.com/denesb/scylla:
reader_concurrency_semaphore: relax _stopped check when destroying a used semaphore
reader_concurrency_semaphore: allow destroying without stop() when not used yet
reader_concurrency_semaphore: add permit-stats
The default target (i.e. what gets executed under "ninja") excludes
sanitize and coverage modes (since they're useful in special cases
only), but the other multi-mode targets such as "ninja build" do not.
This means that two extra modes are built.
Make things consistent by always using default_modes (which default to
release,dev,debug). This can be overriden using the --mode switch
to configure.py.
Closes#8775
Further relax the conditions under which we abort on destroying a
unstopped semaphore. We already allow destroying completely unused
semaphores, this patch further relaxes this to allow destroying formerly
used but presently not used semaphores without stopping. We still call
`on_internal_error_noexcept()` even if destroying the semaphore is safe,
because without calling `stop()`, destroying the semaphore depends on
luck, which we shouldn't rely on.
To make it easier to construct objects with semaphore members. When the
construction of such object fails, they can now just destroy their
semaphore member as usual, without having to employ tricks to make sure
its is stopped before.
Which stores permit related stats. For now only total number of permits
is maintained which is useful to determine whether the semaphore was
used already or not.
We have been seeing rare failures of the cql-pytest (translated from
Cassandra's unit tests) for testing TTL in secondary indexes:
cassandra_tests/validation/entities/secondary_index_test.py::testIndexOnRegularColumnInsertExpiringColumn
The problem is that the test writes an item with 1 second TTL, and then
sleeps *exactly* 1.0 seconds, and expects to see the item disappear
by that time. Which doesn't always happen:
The problem with that assumption stems from Scylla's TTL clock ("gc_clock")
being based on Seastar's lowres clock. lowres_clock only has a 10ms
"granularity": The time Scylla sees when deciding whether an item expires
may be up to 10ms in the past - the arbitrary point when the lowres timer
happened to last run. In rare overload cases, the inaccuracy may be even
grater than 10ms (if the timer got delayed by other things running).
So when Scylla is asked to expire an item in 1 second - we cannot be
sure it will be expired in exactly 1 second or less - the expiration
can be also around 10ms later.
So in this patch we change the test to sleep with more than enough
margin - 1.1 seconds (i.e., 100ms more than 1 second). By that time
we're sure the item must have expired.
Before this patch, I saw the test failing once every few hundred runs,
after this patch I ran if 2,000 times without a single failure.
Fixes#9008
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210712100655.953969-1-nyh@scylladb.com>
Today our docker image is based on Centos7 ,Since centos will be EOL in
2024 and no longer has stable release stream. let's move our docker image to be based on ubuntu 20.04
Based on the work done in https://github.com/scylladb/scylla/pull/8730,
let's build our docker image based on local packages using buildah
Closes#8849
When a table with compact storage has no regular column (only primary
key columns), an artificial column of type empty is added. Such column
type can't be returned via CQL so CDC Log shouldn't contain a column
that reflects this artificial column.
This patch does two things:
1. Make sure that CDC Log schema does not contain columns that reflect
the artificial column from a base table.
2. When composing mutation to CDC Log, ommit the artificial column.
Fixes#8410
Test: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Closes#8988
This series fixes some issues that gcc 11 complains about. I believe all
are correct errors from the standard's view. Clang accepts the changed code.
Note that this is not enough to build with gcc 11, but it's a start.
Closes#9007
* github.com:scylladb/scylla:
utils: compact-radix-tree: detemplate array_of<>
utils: compact-radix-tree: don't redefine type as member
raft: avoid changing meaning of a symbol inside a class
cql3: lists: catch polymorphic exceptions by reference
rewrite_sstables() wants to be run in the maintenance group and soon we
will add another compaction type which also wants to be run in the
said group. To enable this propagate the maintenance scheduling group
(both CPU and IO) to the compaction manager.
All callers use it to do operations that are closely associated with one
of the standard compaction types, so no reason to pass in a custom
string instead of the compaction type enum.
Validation just reads all the passed-in sstables and runs the mutation
stream through a mutation fragment stream validator, logging all errors
found, and finally also logging whether all the sstables are valid or
not. Validation is not really a compaction as it doesn't write any
output. As such it bypasses most of the usual compaction machinery, so
the latter doesn't have to be adapted to this outlier.
This patch only adds the implementation, but it still cannot be started
via `compact_sstables()`, that will be implemented by the next patches.
All the error messages reporting about invalid bits found in the stream.
This allows reusing these messages in the soon-to-be-added validation
compaction. In the process, the error messages are made more
comprehensive and more uniform as well.
The radix tree template defines a nested class template array_of;
both a generic template and a fully specialized version. However,
gcc (I believe correctly) rejects the fully specialized template
that happens to be a member of another class template.
As it happens, we don't really need a template here at all. Define
a non-template class for each of the cases we need, and use
std::conditional_t to select the type we need.
The `direct_layout` and `indirect_layout` template classes accept
a template parameter named `Layout` of type `layout`, and re-export
`Layout` as a static data member named `layout`. This redefinition
of `layout` is disliked by gcc. Fix by renaming the static data member
to `this_layout` and adjust all references.
The construct
struct q {
a a;
};
Changes the meaning of `a` from a type to a data member. gcc dislikes
it and I agree. Fully qualify the type name to avoid an error.
Returning a function parameter guarantees copy elision and does not
require a std::move(). Enable -Wredundant-move to warn us that the
move is unneeded, and gain slightly more readable code. A few violations
are trivially adjusted.
Closes#9004
When creating an index-table query, we form its clustering-key
restrictions by picking the right restrictions from the WHERE clause.
But we skip the indexed column, which isn't in the index-table clutering
key. This is, however, both incorrect and unnecessary:
It is incorrect because we compare the column IDs from different schemas
(indexed table vs. base table). We should instead be comparing column
names.
It is unnecessary because this code is only executed when the whole
partition key plus a clustering prefix is specified in the WHERE clause.
In such cases, the index cannot possibly be on a member of the
clustering prefix, as such a query would be satisfied out of the base
table. Therefore, it is redundant to check for the indexed table among
the CK prefix elements.
A careful reader will note that this check was first introduced to fix
the issue #7888 in commit 0bd201d. But it now seems to me that that fix
was misguided. The root problem was the old code miscalculating the
clustering prefix by including too many columns in it; it should have
stopped before reaching the indexed column. The new code, introduced by
commit 845e36e76, calculates the clustering prefix correctly, never
reaching the indexed column.
(Details, for the curious: the old code invoked
clustering_key_restrictions::prefix_size(), which is buggy -- it
doesn't check the restriction operator. It will, for instance,
calculate the prefix of `c1=0 AND c2 CONTAINS 0 AND c3=0` as 3, because
it restricts c1, c2, and c3. But the correct prefix is clearly 1,
because c2 is not restricted by equality.)
Tests: unit (dev, debug)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8993
This warning prevents using std::move() where it can hurt
- on an unnamed temporary or a named automatic variable being
returned from a function. In both cases the value could be
constructed directly in its final destination, but std::move()
prevents it.
Fix the handful of cases (all trivial), and enable the warning.
Closes#8992
"
After this series one can use perf_fast_forward to generate the data set.
It takes a lot less time this way than to use scylla-bench.
"
* 'perf-fast-forward-scylla-bench-dataset' of github.com:tgrabiec/scylla:
tests: perf_fast_forward: Use data_source::make_ck()
tests: perf_fast_forward: Move declaration of clustered_ds up
tests: perf_fast_forward: Make scylla_bench_small_part_ds1 not included by default
tests: perf_fast_forward: Add data sets which conform to scylla-bench schema
This patch adds to cql-pytest/README.md a paragraph on where run /
run-cassandra expect to find Scylla or Cassandra, and how to override
that choice.
Also make a couple of trivial formatting changes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210708142730.813660-1-nyh@scylladb.com>
"
When obtaining a valid permit was made mandatory, code which now had to
create reader permits but didn't have a semaphore handy suddenly found
itself in a difficult situation. Many places and most prominently tests
solved the problem by creating a thread-local semaphore to source
permits from. This was fine at the time but as usual, globals came back
to haunt us when `reader_concurrency_semaphore::stop()` was
introduced, as these global semaphores had no easy way to be stopped
before being destroyed. This patch-set cleans up this wart, by getting
rid of all global semaphores, replacing them with appropriately scoped
local semaphores, that are stopped after being used. With that, the
FIXME in `~reader_concurrency_semaphore()` can be resolved and we an
finally `assert()` that the semaphore was stopped before being
destroyed.
This series is another preparatory one for the series which moves the
semaphore in front of the cache.
tests: unit(dev)
"
* 'reader-concurrency-semaphore-mandatory-stop/v2' of https://github.com/denesb/scylla: (26 commits)
reader_concurrency_semaphore: assert(_stopped) in the destructor
test/lib: remove now unused reader_permit.{hh,cc}
test/boost: migrate off the global test reader semaphore
test/manual: migrate off the global test reader semaphore
test/unit: migrate off the global test reader semaphore
test/perf: migrate off the global test reader semaphore
test/perf: perf.hh: add reader_concurrency_semaphore_wrapper
test/lib: migrate off the global test reader semaphore
test/lib/simple_schema: migrate off the global test reader semaphore
test/lib/sstable_utils: migrate off the global test reader semaphore
test/lib/test_services: migrate off the global test reader semaphore
test/lib/sstable_test_env: add reader_concurrency_semaphore member
test/lib/cql_test_env: add make_reader_permit()
test/lib: add reader_concurrency_semaphore.hh
test/boost/sstable_test: migrate row counting tests to seastar thread
test/boost/sstable_test: test_using_reusable_sst(): pass env to func
test/lib/reader_lifecycle_policy: add permit parameter to factory function
test/boost/mutation_reader_test: share permit between readers in a read
memtable: migrate off the global reader concurrency semaphore
mutation_writer: multishard_writer: migrate off the global reader concurrency semaphore
...
Now that there are no more global semaphore which are impossible to stop
properly we can resolve the related FIXME and arm the assert in the
semaphore destructor.
We can also remove all the other cleanup code from the destructor as
they are taken care of by stop(), which we now assert to have been run.
After commit 76227fa ("cql-pytest: use NetworkTopologyStrategy, not
SimpleStrategy"), the cql-pytest tests now NetworkTopologyStrategy instead
of SimpleStrategy in the test keyspaces. The tests continued to use the
"replication_factor" option. The support for this option is a relatively
recent, and was only added to Cassandra in the 4.0 release series
(see https://issues.apache.org/jira/browse/CASSANDRA-14303). So users
who happen to have Cassandra 3 installed and want to run a cql-pytest
against it will see the test failing when it can't create a keyspace.
This patch trivially fixes the problem by using the name of the current
DC (automatically determined) instead of the word 'replication_factor'.
Almost all tests are fixed by a single fix to the test_keyspace fixture
which creates one keyspace used by most tests. Additional changes were
needed in test_keyspace.py, for tests which explicitly create keyspaces.
I tested the result on Cassandra 3.11.10, Cassandra 4 (git master) and
Scylla.
Fixes#8990
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210708123428.811184-1-nyh@scylladb.com>
We know that today in Scylla concurrent schema changes done on different
coordinators are not safe - and we plan to address this problem with Raft.
However, the test in this patch - reproducing issue #8968 - demonstrates
that even on a single node concurrent schema changes are not safe:
The test involves one thread which constantly creates a keyspace and
then a table in it - and a second thread which constantly deletes this
keyspace. After doing this for a while, the schema reaches an inconsistent
state: The keyspace is at a state of limbo where it cannot be dropped
(dropping it succeeds, but doesn't actually drop it), and a new keyspace
cannot be created under the same name).
Note that to reproduce this bug, it was important that the test create
both a keyspace and a table. Were the test to just create an empty keyspace,
without a table in it, the bug would not be reproduced.
Refs #8968.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210704121049.662169-1-nyh@scylladb.com>
Supplying a convenience semaphore wrapper, which stops the contained
semaphore when destroyed. It also provides a more convenient
`make_permit()`. This class is intended to make the migration to local
semaphores less painful.
semaphore_timed_out errors should be ignored, similar to
rpc::timeout_error or seastar::timed_out_error, so that they
eventually be converted to `read_timeout_exception` via
the data/digest read resolver on_timeout() method.
Otherwise, the semaphore timeout is mistranslated to
read_failure_exception, via on_error().
Note that originally the intention was to change the exception
thrown by the reader_concurrency_semaphore expiry_handler, but
there are already several places in the code that catch and handle
the semaphore_timed_out exception that would need to be changed,
increasing the risk in this change.
Fixes#8958
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210708083252.1934651-2-bhalevy@scylladb.com>
In older versions of Cassandra (such as 3.11.10 which I tried), the
CQL server is not turned on by default, unless the configuration file
explicitly has "start_native_transport: true" - without it only the
Thrift server is started.
So fix the cql-pytest/run-cassandra to pass this option. It also
works correctly in Cassandra 4.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210708113423.804980-1-nyh@scylladb.com>
In order to avoid large allocations and too large mutations
generated from large view updates, granularity of the process
is broken down from per-partition to smaller chunks.
The view update builder now produces partial updates, no more
than 100 view rows at a time.
The series was tested manually with a particular scenario in mind -
deleting a large base partition, which results in creating
a view update per each deleted row - which, with sufficiently
large partitions, can reach millions. Before the series, Scylla
experienced an out-of-memory condition after the view update
generation mechanism tried to load too much data into a contiguous
buffer. Multiple large allocation warnings and reactor stalls were
observed as well. After the series, the operation is still rather slow,
but does not induce reactor stalls nor allocator problems.
A reduced version of the above test is added as a unit test -
it does not check for huge partitions, but instead uses a number
just large enough to cause the update generation process to be split
into multiple chunks.
Fixes#8852Closes#8906
* github.com:scylladb/scylla:
cql-pytest: add a test case for base range deletion
cql-pytest: add a test case for base partition deletion
table: elaborate on why exceptions are ignored for view updates
view: generate view updates in smaller parts
table: coroutinize generating view updates
db,view: move view_update_builder to the header
The test case checks that deleting a base table clustering range
works fine. This operation is potentially heavy, as it involves
generating a view update for every row. With large enough ranges,
the number can reach millions and beyond.
The test case checks that deleting a whole base table partition
works fine. This operation is potentially heavy, as it involves
generating a view update for every row. With large enough partitions,
the number can reach millions and beyond.
The factory method doesn't match the signature of
`reader_lifecycle_policy::make_reader()`, notably the permit is missing.
Add it as it is important that the wrapping evictable reader and
underlying reader share the permits.
Permits were designed such that there is one permit per read, being
shared by all readers in that read. Make sure readers created by tests
adhere to this.
Naming the concurrency semaphore is currently optional, unnamed
semaphores defaulting to "Unnamed semaphore". Although the most
important semaphores are named, many still aren't, which makes for a
poor debugging experience when one of these times out.
To prevent this, remove the name parameter defaults from those
constructors that have it and require a unique name to be passed in.
Also update all sites creating a semaphore and make sure they use a
unique name.
The generate_and_propagate_view_updates() function explicitly
ignores exceptions reported from the underlying view update
propagation layer. This decision is now explained in the comment.
In order to avoid large allocations and too large mutations
generated from large view updates, granularity of the process
is broken down from per-partition to smaller chunks.
The view update builder now produces partial updates, no more
than 100 view rows at a time.
Since compaction is layered on top of sstables, let's move all compaction code
into a new top-level directory.
This change will give me extra motivation to remove all layer violations, like
sstable calling compaction-specific code, and compaction entanglement with
other components like table and storage service.
Next steps:
- remove all layer violations
- move compaction code in sstables namespace into a new one for compaction.
- move compaction unit tests into its own file
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210707194058.87060-1-raphaelsc@scylladb.com>
This dataset exists for convenience, to be able to run scylla-bench
against the data set generated by perf_fast_forward. It doesn't
increase coverage. So do not include it by default to not waste
resources on it.
Cassandra 3.0 deprecated the 'sstable_compression' attribute and added
'class' as a replacement. Follow by supporting both.
The SSTABLE_COMPRESSION variable is renamed to SSTABLE_COMPRESSION_DEPRECATED
to detect all uses and prevent future misuse.
To prevent old-version nodes from seeing the new name, the
compression_parameters class preserves the key name when it is
constructed from an options map, and emits the same key name when
asked to generate an options map.
Existing unit tests are modified to use the new name, and a test
is added to ensure the old name is still supported.
Fixes#8948.
Closes#8949
"
The main goal of this series is to improve efficiency of reads from large partitions by
reducing amount of I/O needed to read the sstable index. This is achieved by caching
index file pages and partition index entries in memory.
Currently, the pages are cached by individual reads only for the duration of the read.
This was done to facilitate binary search in the promoted index (intra-partition index).
After this series, all reads share the index file page cache, which stays around even after reads stop.
The page cache is subject to eviction. It uses the same region as the current row cache and shares
the LRU with row cache entries. This means that LRU objects need to be virtualized. This series takes
an easy approach and does this by introducing a virtual base class. This adds an overhead to row cache
entry to store the vtable pointer.
SStable indexes have a hierarchy. There is a summary, which is a sparse partition key index into the
full partition index. This one is already kept in memory. The partition index is divided by the summary
into pages. Each entry in the partition index contains promoted index, which is a sparse index into atoms
identified by the clustering key (rows, tombstones).
In order to read the promoted index, the reader needs to read the partition index entry first.
To speed this up, this series also adds caching of partition index entries. This cache survives
reads and is subject to eviction, just like the index file page cache. The unit of caching is
the partition index page. Without this cache, each access to promoted index would have to be
preceded with the parsing of the partition index page containing the partition key.
Performance testing results follow.
1) scylla-bench large partition reads
Populated with:
perf_fast_forward --run-tests=large-partition-skips --datasets=sb-large-part-ds1 \
-c1 -m1G --populate --value-size=1024 --rows=10000000
Single partition, 9G data file, 4MB index file
Test execution:
build/release/scylla -c1 -m4G
scylla-bench -workload uniform -mode read -limit 1 -concurrency 100 -partition-count 1 \
-clustering-row-count 10000000 -duration 60m
TL;DR: after: 2x throughput, 0.5 median latency
Before (c1daf2bb24):
Results
Time (avg): 5m21.033180213s
Total ops: 966951
Total rows: 966951
Operations/s: 3011.997048812112
Rows/s: 3011.997048812112
Latency:
max: 74.055679ms
99.9th: 63.569919ms
99th: 41.320447ms
95th: 38.076415ms
90th: 37.158911ms
median: 34.537471ms
mean: 33.195994ms
After:
Results
Time (avg): 5m14.706669345s
Total ops: 2042831
Total rows: 2042831
Operations/s: 6491.22243800942
Rows/s: 6491.22243800942
Latency:
max: 60.096511ms
99.9th: 35.520511ms
99th: 27.000831ms
95th: 23.986175ms
90th: 21.659647ms
median: 15.040511ms
mean: 15.402076ms
2) scylla-bench small partitions
I tested several scenarios with a varying data set size, e.g. data fully fitting in memory,
half fitting, and being much larger. The improvement varied a bit but in all cases the "after"
code performed slightly better.
Below is a representative run over data set which does not fit in memory.
scylla -c1 -m4G
scylla-bench -workload uniform -mode read -concurrency 400 -partition-count 10000000 \
-clustering-row-count 1 -duration 60m -no-lower-bound
Before:
Time (avg): 51.072411913s
Total ops: 3165885
Total rows: 3165885
Operations/s: 61988.164024260645
Rows/s: 61988.164024260645
Latency:
max: 34.045951ms
99.9th: 25.985023ms
99th: 23.298047ms
95th: 19.070975ms
90th: 17.530879ms
median: 3.899391ms
mean: 6.450616ms
After:
Time (avg): 50.232410679s
Total ops: 3778863
Total rows: 3778863
Operations/s: 75227.58014424688
Rows/s: 75227.58014424688
Latency:
max: 37.027839ms
99.9th: 24.805375ms
99th: 18.219007ms
95th: 14.090239ms
90th: 12.124159ms
median: 4.030463ms
mean: 5.315111ms
The results include the warmup phase which populates the partition index cache, so the hot-cache effect
is dampened in the statistics. See the 99th percentile. Latency gets better after the cache warms up which
moves it lower.
3) perf_fast_forward --run-tests=large-partition-skips
Caching is not used here, included to show there are no regressions for the cold cache case.
TL;DR: No significant change
perf_fast_forward --run-tests=large-partition-skips --datasets=large-part-ds1 -c1 -m1G
Config: rows: 10000000, value size: 2000
Before:
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
1 0 36.429822 4 10000000 274500 62 274521 274429 153889.2 153883 19696986 153853 0 0 0 0 0 0 0 22.5%
1 1 36.856236 4 5000000 135662 7 135670 135650 155652.0 155652 19704117 139326 1 0 1 1 0 0 0 38.1%
1 8 36.347667 4 1111112 30569 0 30570 30569 155652.0 155652 19704117 139071 1 0 1 1 0 0 0 19.5%
1 16 36.278866 4 588236 16214 1 16215 16213 155652.0 155652 19704117 139073 1 0 1 1 0 0 0 16.6%
1 32 36.174784 4 303031 8377 0 8377 8376 155652.0 155652 19704117 139056 1 0 1 1 0 0 0 12.3%
1 64 36.147104 4 153847 4256 0 4256 4256 155652.0 155652 19704117 139109 1 0 1 1 0 0 0 11.1%
1 256 9.895288 4 38911 3932 1 3933 3930 100869.2 100868 3178298 59944 38912 0 1 1 0 0 0 14.3%
1 1024 2.599921 4 9757 3753 0 3753 3753 26604.0 26604 801850 15071 9758 0 1 1 0 0 0 14.6%
1 4096 0.784568 4 2441 3111 1 3111 3109 7982.0 7982 205946 3772 2442 0 1 1 0 0 0 13.8%
64 1 36.553975 4 9846154 269359 10 269369 269337 155663.8 155652 19704117 139230 1 0 1 1 0 0 0 28.2%
64 8 36.509694 4 8888896 243467 8 243475 243449 155652.0 155652 19704117 139120 1 0 1 1 0 0 0 26.5%
64 16 36.466282 4 8000000 219381 4 219385 219374 155652.0 155652 19704117 139232 1 0 1 1 0 0 0 24.8%
64 32 36.395926 4 6666688 183171 6 183180 183165 155652.0 155652 19704117 139158 1 0 1 1 0 0 0 21.8%
64 64 36.296856 4 5000000 137753 4 137757 137737 155652.0 155652 19704117 139105 1 0 1 1 0 0 0 17.7%
64 256 20.590392 4 2000000 97133 18 97151 94996 135248.8 131395 7877402 98335 31282 0 1 1 0 0 0 15.7%
64 1024 6.225773 4 588288 94492 1436 95434 88748 46066.5 41321 2324378 30360 9193 0 1 1 0 0 0 15.8%
64 4096 1.856069 4 153856 82893 54 82948 82721 16115.0 16043 583674 11574 2675 0 1 1 0 0 0 16.3%
After:
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
1 0 36.429240 4 10000000 274505 38 274515 274417 153887.8 153883 19696986 153849 0 0 0 0 0 0 0 22.4%
1 1 36.933806 4 5000000 135377 15 135385 135354 155658.0 155658 19704085 139398 1 0 1 1 0 0 0 40.0%
1 8 36.419187 4 1111112 30509 2 30510 30507 155658.0 155658 19704085 139233 1 0 1 1 0 0 0 22.0%
1 16 36.353475 4 588236 16181 0 16182 16181 155658.0 155658 19704085 139183 1 0 1 1 0 0 0 19.2%
1 32 36.251356 4 303031 8359 0 8359 8359 155658.0 155658 19704085 139120 1 0 1 1 0 0 0 14.8%
1 64 36.203692 4 153847 4249 0 4250 4249 155658.0 155658 19704085 139071 1 0 1 1 0 0 0 13.0%
1 256 9.965876 4 38911 3904 0 3906 3904 100875.2 100874 3178266 60108 38912 0 1 1 0 0 0 17.9%
1 1024 2.637501 4 9757 3699 1 3700 3697 26610.0 26610 801818 15071 9758 0 1 1 0 0 0 19.5%
1 4096 0.806745 4 2441 3026 1 3027 3024 7988.0 7988 205914 3773 2442 0 1 1 0 0 0 18.3%
64 1 36.611243 4 9846154 268938 5 268942 268921 155669.8 155705 19704085 139330 2 0 1 1 0 0 0 29.9%
64 8 36.559471 4 8888896 243135 11 243156 243124 155658.0 155658 19704085 139261 1 0 1 1 0 0 0 28.1%
64 16 36.510319 4 8000000 219116 15 219126 219101 155658.0 155658 19704085 139173 1 0 1 1 0 0 0 26.3%
64 32 36.439069 4 6666688 182954 9 182964 182943 155658.0 155658 19704085 139274 1 0 1 1 0 0 0 23.2%
64 64 36.334808 4 5000000 137609 11 137612 137596 155658.0 155658 19704085 139258 2 0 1 1 0 0 0 19.1%
64 256 20.624759 4 2000000 96971 88 97059 92717 138296.0 131401 7877370 98332 31282 0 1 1 0 0 0 17.2%
64 1024 6.260598 4 588288 93967 1429 94905 88051 45939.5 41327 2324346 30361 9193 0 1 1 0 0 0 17.8%
64 4096 1.881338 4 153856 81780 140 81920 81520 16109.8 16092 582714 11617 2678 0 1 1 0 0 0 18.2%
4) perf_fast_forward --run-tests=large-partition-slicing
Caching enabled, each line shows the median run from many iterations
TL;DR: We can observe reduction in IO which translates to reduction in execution time,
especially for slicing in the middle of partition.
perf_fast_forward --run-tests=large-partition-slicing --datasets=large-part-ds1 -c1 -m1G --keep-cache-across-test-cases
Config: rows: 10000000, value size: 2000
Before:
offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu
0 1 0.000491 127 1 2037 24 2109 127 4.0 4 128 2 2 0 1 1 0 0 0 157 80 3058208 15.0%
0 32 0.000561 1740 32 56995 410 60031 47208 5.0 5 160 3 2 0 1 1 0 0 0 386 111 113353 17.5%
0 256 0.002052 488 256 124736 7111 144762 89053 16.6 17 672 14 2 0 1 1 0 0 0 2113 446 52669 18.6%
0 4096 0.016437 61 4096 249199 692 252389 244995 69.4 69 8640 57 5 0 1 1 0 0 0 26638 1717 23321 22.4%
5000000 1 0.002171 221 1 461 2 466 221 25.0 25 268 3 3 0 1 1 0 0 0 638 376 14311524 10.2%
5000000 32 0.002392 404 32 13376 48 13528 13015 27.0 27 332 5 3 0 1 1 0 0 0 931 432 489691 11.9%
5000000 256 0.003659 279 256 69967 764 73130 52563 39.5 41 780 19 3 0 1 1 0 0 0 2689 825 93756 15.8%
5000000 4096 0.018592 55 4096 220313 433 234214 218803 94.2 94 9484 62 9 0 1 1 0 0 0 27349 2213 26562 21.0%
After:
offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu
0 1 0.000229 115 1 4371 85 4585 115 2.1 2 64 1 1 1 0 0 0 0 0 90 31 1314749 22.2%
0 32 0.000277 2174 32 115674 1015 128109 14144 3.0 3 96 2 1 1 0 0 0 0 0 319 62 52508 26.1%
0 256 0.001786 576 256 143298 5534 179142 113715 14.7 17 544 15 1 1 0 0 0 0 0 2110 453 45419 21.4%
0 4096 0.015498 61 4096 264289 2006 268850 259342 67.4 67 8576 59 4 1 0 0 0 0 0 26657 1738 22897 23.7%
5000000 1 0.000415 233 1 2411 15 2456 234 4.1 4 128 2 2 1 0 0 0 0 0 199 72 2644719 16.8%
5000000 32 0.000635 1413 32 50398 349 51149 46439 6.0 6 192 4 2 1 0 0 0 0 0 458 128 125893 18.6%
5000000 256 0.002028 486 256 126228 3024 146327 82559 17.8 18 1024 13 4 1 0 0 0 0 0 2123 385 51787 19.6%
5000000 4096 0.016836 61 4096 243294 814 263434 241660 73.0 73 9344 62 8 1 0 0 0 0 0 26922 1920 24389 22.4%
Future work:
- Check the impact on non-uniform workloads. Caching sstable indexes takes space away from the row cache
which may reduce the hit ratio.
- Reduce memory footprint of partition index cache. Currently, about 8x bloat over the on-disk size.
- Disable cache population for "bypass cache" reads
- Add a switch to disable sstable index caching, per-node, maybe per-table
- Better sstable index format. Current format leads to inefficiency in caching since only some elements of the cached
page can be hot. A B-tree index would be more efficient. Same applies to the partition index. Only some elements in
the partition index page can be hot.
- Add heuristic for reducing index file IO size when large partitions are anticipated. If we're bound by disk's
bandwidth it's wasteful to read the front of promoted index using 32K IO, better use 4K which should cover the
partition entry and then let binary search read the rest.
In V2:
- Fixed perf_fast_forward regression in the number of IOs used to read partition index page
The reader uses 32K reads, which were split by page cache into 4K reads
Fix by propagating IO size hints to page cache and using single IO to populate it.
New patch: "cached_file: Issue single I/O for the whole read range on miss"
- Avoid large allocations to store partition index page entries (due to managed_vector storage).
There is a unit test which detects this and fails.
Fixed by implementing chunked_managed_vector, based on chunked_vector.
- fixed bug in cached_file::evict_gently() where the wrong allocation strategy was used to free btree chunks
- Simplify region_impl::free_buf() according to Avi's suggestions
- Fit segment_kind in segment_descriptor::_free_space and lift requirement that _buf_pointers emptiness determines the kind
- Workaround sigsegv which was most likely due to coroutine miscompilation. Worked around by manipulating local object scope.
- Wire up system/drop_sstable_caches RESTful API
- Fix use-after-move on permit for the old scanning ka/la index reader
- Fixed more cases of double open_data() in tests leading to assert failure
- Adjusted cached_file class doc to account for changes in behavior.
- Rebased
Fixes#7079.
Refs #363.
"
* tag 'sstable-index-caching-v2' of github.com:tgrabiec/scylla: (39 commits)
api: Drop sstable index caches on system/drop_sstable_caches
cached_file: Issue single I/O for the whole read range on miss
row_cache: cache_tracker: Do not register metrics when constructed for tests
sstables, cached_file: Evict cache gently when sstable is destroyed
sstables: Hide partition_index_cache implementation away from sstables.hh
sstables: Drop shared_index_lists alias
sstables: Destroy partition index cache gently
sstables: Cache partition index pages in LSA and link to LRU
utils: Introduce lsa::weak_ptr<>
sstables: Rename index_list to partition_index_page and shared_index_lists to partition_index_cache
sstables, cached_file: Avoid copying buffers from cache when parsing promoted index
cached_file: Introduce get_page_units()
sstables: read: Document that primitive_consumer::read_32() is alloc-free
sstables: read: Count partition index page evictions
sstables: Drop the _use_binary_search flag from index entries
sstables: index_reader: Keep index objects under LSA
lsa: chunked_managed_vector: Adapt more to managed_vector
utils: lsa: chunked_managed_vector: Make LSA-aware
test: chunked_managed_vector_test: Make exception_safe_class standard layout
lsa: Copy chunked_vector to chunked_managed_vector
...
After 845e36e76 "cql3: Use expr for global-index partition slice",
there is actually more dead code than was initially dropped.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8981
Fixes#8952
In 5ebf5835b0 we added a segment
prune after flushing, to deal with deadlocks in shutdown.
This means that calls that issue sync/flush-like ops "for-all",
need to operate on a defensive copy of the list.
Closes#8980
Commit 5adb8e555c marked the ::feed_hash() and a visitor lambda of
digester::feed_hash() as noexcept. This was quite recklesl as the
appending_hash<>::operator()s called by ::feed_hash() are not all
marked noexcept. In particular, the appending_hash<row>() is not
such and seem to throw.
The original intent of the mentioned commit was to facilitate the
partition_hasher in repair/ code. The hasher itself had been removed
by the 0af7a22c21, so it no longer needs the feed_hash-s to be
noexcepts.
The fix is to inherit noexcept from the called hashers, but for the
digester::feed_hash part the noexcept is just removed until clang
compilation bug #50994 is fixed.
fixes: #8983
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210706153608.4299-1-xemul@scylladb.com>
`query_processor::execute_direct()` takes a non-const ref
to query options, meaning it's not safe to pass the same
instance to subsequent invocations of `execute_direct()`
in the tests.
Copy default query options at each invocation of `execute_cql()`
so no possible side-effects can occur.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210705094824.243573-2-pa.solodovnikov@scylladb.com>
In pull request #8568, the CDC API changed slightly, with preimage data
gaining extra "delete$k" values for columns whose preimage was missing.
In this new test, we verify that this change did not break Alternator.
We didn't expect it to break Alternator, because it just outputs the known
base-table columns and ignores the columns which weren't a real base-table
column - like this "delete$k".
In the test we set up a stream with preimages, ensure that a real column
(note that an LSI key is a real column instead of a map element) has a
null preimage - and see that the preimage is returned as expected,
without fake columns like "delete$k".
The test passes, showing that PR #8568 was ok.
The test also passes, as expected, on DynamoDB.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210504120121.915829-1-nyh@scylladb.com>
All tests in cql-pytest use a test keyspace created with the SimpleStrategy
replication strategy. This was never intentional. We are recommending to
users that they should use NetworkTopologyStrategy instead, and even
want to deprecate SimpleStrategy (this is #8586), so tests should stop
using SimpleStrategy and should start using the same strategy users would
use in their applications - NetworkTopologyStrategy.
Almost all tests are fixed by a single change in conftest.py which
changes how "test_keyspace" is created. But additionally, tests in
test_keyspace.py which explicitly create keyspaces (that's the point of
that test file...) also had to be modified to use NetworkTopologyStrategy.
Note that none of the tests relied on any special features or
implementation details of SimpleStrategy.
This patch is part of the bigger effort to remove reliance on
SimpleStrategy from all tests, of all types - which we will need to do if
we ever want to forbid SimpleStrategy by default. The issue of that effort:
Refs #8638
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210620102341.195533-1-nyh@scylladb.com>
logalloc has a nice leak/double-free sanitizer, with the nice
feature of capturing backtraces to make error reports easy to
track down. But capturing backtraces is itself very expensive.
This patch makes backtrace capture optional, reducing database_test
runtime from 30 minutes to 20 minutes on my machine.
Closes#8978
"
When a corrupted sstable fails to be read either on regular read or in
regular compaction, our logging is not useful as it can't pinpoint
the SSTable that was being read from, also it may not print useful
details about the corruption.
For example, when a compaction fails on data corruption, a cryptic
message as follow will be dumped:
compaction_manager - compaction failed: std::runtime_error (compressed chunk failed checksum): retrying
there are two problems with the log above:
1) it doesn't tell us which sstable is corrupted
2) it doesn't tell us detailed info about the checksum failure on compressed chunk
with those problems fixed, we'll now get a much more useful message:
compaction_manager - compaction failed: sstables::malformed_sstable_exception (Failed to read partition
from SSTable /home/.../md-74-big-Data.db due to compressed chunk of size 3735 at file offset 406491
failed checksum, expected=0, actual=1422312584): retrying
tests: mode(dev).
"
* 'log_data_corruption_v2.1' of github.com:raphaelsc/scylla:
sstables: Attach sstable name to exception triggered in sstable mutation reader
test/broken_sstable_test: Make test more robust
sstables: Make log more useful when compressed chunk fails checksum
sstables: Use correct exception when compressed chunk fails checksum
Another step towards dropping the `restrictions` class. When calculating the partition slice of a global-index table, use `expression` objects instead of a `restrictions` subclass.
Refs #7217.
Tests: unit (all dev, some debug)
Closes#8950
* github.com:scylladb/scylla:
cql3: Use expr for global-index partition slice
cql3: Fully explain statement_restrictions members
cql3: Pass schema reference not pointer
cql3: Replace count_if with find_atom
cql3: Fix _partition_range_is_simple calculation
cql3: Add test cases for indexed partition column
Listing /etc/systemd/system/*.mount as ghost file seems incorrect,
since user may want to keep using RAID volume / coredump directory after
uninstalling Scylla, or user may want to upgrade enterprise version.
Also, we mixed two types of files as ghost file, it should handle differently:
1. automatically generated by postinst scriptlet
2. generated by user invoked scylla_setup
The package should remove only 1, since 2 is generated by user decision.
However, just dropping .mount from %files section causes another
problem, rpm will remove these files during upgrade, instead of
uninstall (#8924).
To fix both problem, specify .mount files as "%ghost %config".
It will keep files both package upgrade and package remove.
See scylladb/scylla-enterprise#1780Closes#8810Closes#8924Closes#8959
An incorrect spelling of max-io-requests was used in creating the
warning message text. Due to conversion to unsigned, the code would
crash due to bad_lexical_cast exception. The spelling of this
configuration name was fixed in the past (44f3ad836b), but only in the
'if' condition.
Fix by using the correct spelling.
Closes#8963
Currently, reading a page range would issue I/O for each missing
page. This is inefficient, better to issue a single I/O for the whole
range and populate cache from that.
As an optimization, issue a single I/O if the first page is missing.
This is important for index reads which optimistically try to read
32KB of index file to read the partition index page.
Some tests will create two cache_tracker instances because of one
being embedded in the sstable test env.
This would lead to double registration of metrics, which raises run
time error. Avoid by not registering metrics in prometheus in tests at
all.
As part of this change, the container for partition index pages was
changed from utils::loading_shared_values to intrusive_btree. This is
to avoid reactor stalls which the former induces with a large number
of elements (pages) due to its use of a hashtable under the hood,
which reallocates contiguous storage.
Simplifies managing non-owning references to LSA-managed objects. The
lsa::weak_ptr is a smart pointer which is not invalidated by LSA and
can be used safely in any allocator context. Dereferenced will always
give a valid reference.
This can be used as a building block for implementing cursors into
LSA-based caches.
Example simple use:
// LSA-managed
struct X : public lsa::weakly_referencable<X> {
int value;
};
lsa::weak_ptr<X> x_ptr = with_allocator(region(), [] {
X* x = current_allocator().construct<X>();
return x->weak_from_this();
});
std::cout << x_ptr->value;
Will be needed later for reading a page view which cannot use
make_tracked_temporary_buffer(). Standardize on get_page_units(),
converting existing code to wrap the units in a deleter.
In preparation for caching index objects, manage them under LSA.
Implementation notes:
key_view was changed to be a view on managed_bytes_view instead of
bytes, so it now can be fragmented. Old users of key_view now have to
linearize it. Actual linearization should be rare since partition
keys are typically small.
Index parser is now not constructing the index_entry directly, but
produces value objects which live in the standard allocator space:
class parsed_promoted_index_entry;
calss parsed_partition_index_entry;
This change was needed to support consumers which don't populate the
partition index cache and don't use LSA,
e.g. sstable::generate_summary(). It's now consumer's responsibility
to allocate index_entry out of parsed_partition_index_entry.
Will be needed by index reader to ensure that destructor doesn't
invoke the allocator so that all is destroyed in the desried
allocation context before the object is destroyed.
index_entry will be an LSA-managed object. Those have to be accessed
with care, with the LSA region locked.
This patch hides most of direct index_entry accesses inside the
index_reader so that users are safe.
The file object which is currently stored there has per-request
tracing wrappers (permit, trace_state) attached to it. It doesn't make
sense once the entry is cached and shared. Annotate when the cursor is
created instead.
Index reads and promoted index reads are both using the same
cached_file now, so there's no need to pass the buffers between the
index parser and promoted index reader.
Makes the promoted_index structure easier to move to LSA.
After this patch, there is a singe index file page cache per
sstable, shared by index readers. The cache survives reads,
which reduces amount of I/O on subsequent reads.
As part of this, cached_file needed to be adjusted in the following ways.
The page cache may occupy a significant portion of memory. Keeping the
pages in the standard allocator could cause memory fragmentation
problems. To avoid them, the cache_file is changed to keep buffers in LSA
using lsa_buffer allocation method.
When a page is needed by the seastar I/O layer, it needs to be copied
to a temporary_buffer which is stable, so must be allocated in the
standard allocator space. We copy the page on-demand. Concurrent
requests for the same page will share the temporary_buffer. When page
is not used, it only lives in the LSA space.
In the subsequent patches cached_file::stream will be adjusted to also support
access via cached_page::ptr_type directly, to avoid materializating a
temporary_buffer.
While a page is used, it is not linked in the LRU so that it is not
freed. This ensures that the storage which is actively consumed
remains stable, either via temporary_buffer (kept alive by its
deleter), or by cached_page::ptr_type directly.
lsa_buffer is similar in spirit to std::unique_ptr<char[]>. It owns
buffers allocated inside LSA segments. It uses an alternative
allocation method which differs from regular LSA allocations in the
following ways:
1) LSA segments only hold buffers, they don't hold metadata. They
also don't mix with standard allocations. So a 128K segment can
hold 32 4K buffers.
2) objects' life time is managed by lsa_buffer, an owning smart
pointer, which is automatically updated when buffers are migrated
to another segment. This makes LSA allocations easier to use and
off-loads metadata management to the client (which can keep the
lsa_buffer wherever he wants).
The metadata is kept inside segment_descriptor, in a vector. Each
allocated buffer will have an entangled object there (8 bytes), which
is paired with an entabled object inside lsa_buffer.
The reason to have an alternative allocation method is to efficiently
pack buffers inside LSA segments.
Instead of creating a single_column_clustering_key_restrictions
object, create an equivalent vector of expr::expressions and calculate
from it the clustering ranges just like we do for base-table queries.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
... to get_single_column_clustering_bounds().
No need for the pointer; a reference is simpler and cleaner.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Was updated for every restriction instead of only for partition ones.
The only impact is on performance. The bug was introduced in 4661aa0
"cql3: Track IN partition-key restrictions".
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
We didn't have a case when a global index exists on a partition column
and the SELECT statement specifies the full partition key.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
make_sstable_containing() already calls open_data(), so does
load(). This will trigger assertion failure added in a later patch:
assert(!_cached_index_file);
There is no need to call load() here.
It's an adpator between seastar::file and cached_file. It gives a
seastar::file which will serve reads using a given cached_file as a
read-through cache.
We want buffers to be accounted only when they are used outside
cached_file. Cached pages should not be accounted because they will
stay around for longer than the read after subsequent commits.
In preparation for tracking different kinds of objects, not just
rows_entry, in the LRU, switch to the LRU implementation form
utils/lru.hh which can hold arbitrary element type.
The LRU can link objects of different types, which is achieved by
having a virtual base class called "evictable" from which the linked
objects should inherit. Whe the object is removed from the LRU,
evictable::on_evicted() is called.
The container is non-owning.
Rows might be large so free them gently by:
- add bytes_ostream.clear_gently that may yield in the chunk
freeing loop.
- use that in frozen_mutation_fragment, contained in repair_row.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To prevent a reactor stall as seen in #8926
Note: this patch doesn't use coroutines, to
faciliate backporting.
Coroutinization will be done in a follow-up patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Not only close it. Next patch will use clear_gently on stop
to prevent reactor stalls.
double-stop prevention code was added to stop()
since, in the error-free case, repair_meta master is
already stopped by `repair_row_level_stop` when auto_stop
master deferred action is called.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
clear_gently of the foreign_ptr needs to run on the owning
shard, so provide a specialization from the SmartPointer
implementation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Define a bunch of clear_gently methods that asynchronously
clear the contents of containers and allow yielding.
This replaces clear_gently(std::list<T>&) used by row level
repair by a more generic template implementation.
Note that we do not use coroutines in this patch
to facilitate backporting to releases that do not support coroutines
and since a miscompilation bug was hit with clang++ 11 when attempting
to coroutinize this patch (see
https://bugs.llvm.org/show_bug.cgi?id=50345).
Test: stall_free_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
This mini-series fixes two loosely related bugs around reader recreation
in the evictable reader (related by both being around reader
recreation). A unit test is also added which reproduces both of them and
checks that the fixes indeed work. More details in the patches
themselves.
This series replaces the two independent patches sent before:
* [PATCH v1] evictable_reader: always reset static row drop flag
* [PATCH v1] evictable_reader: relax partition key check on reader
recreation
As they depend on each other, it is easier to add a test if they are in
a series.
Fixes: #8923Fixes: #8893
Tests: unit(dev, mutation_reader_test:debug)
"
* 'evictable-reader-recreation-more-bugs/v1' of https://github.com/denesb/scylla:
test: mutation_reader_test: add more test for reader recreation
evictable_reader: relax partition key check on reader recreation
evictable_reader: always reset static row drop flag
This reverts commit a677c46672. It causes
upgrade from a version that did not have a commit to a version that
does have the commit to lose the .mount files, since they change
from being owned by the package (via %ghost) to not being owned.
Fixes#8924.
We never want to listen on port 0, even if configured so.
When the listen port is set to 0, the OS will choose the
port randomly, which makes it useless for communicating
with other nodes in the cluster, since we don't support that.
Also, it causes the listen_ports_conf_test internode_ssl_test
to fail since it expects to disable listening on storage_port
or ssl_storage_port when set to 0, as seen in
https://github.com/scylladb/scylla-dtest/issues/2174.
Fixes#8957
Test: unit(dev)
DTest: listen_ports_conf_test (modified)
Closes#8956
* github.com:scylladb/scylla:
messaging_service: do_start_listen: improve info log accuracy
messaging_service: never listen on port 0
The replication factor passed to NetworkTopologyStrategy (which we call
by the confusing name "auto expand") may or may not be used (see
explanation why in #8881), but regardless, we should validate that it's
a legal number and not some non-numeric junk, and we should report the error.
Before this patch, the two commands
CREATE KEYSPACE name WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }
ALTER KEYSPACE name WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'replication_factor' : 'foo' }
succeed despite the invalid replication factor "foo". After this patch,
the second command fails.
The problem fixed here is reproduced by the existing test
test_keyspace.py::test_alter_keyspace_invalid when switching it to use
NetworkTopologyStrategy, as suggested by issue #8638.
Fixes#8880
Refs #8881
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210620100442.194610-1-nyh@scylladb.com>
Make sure to log the info message when we actually
start listening.
Also, print a log message when listening on the
broadcast address.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We never want to listen on port 0, even if configured so.
When the listen port is set to 0, the OS will choose the
port randomly, which makes it useless for communicating
with other nodes in the cluster, since we don't support that.
Also, it causes the listen_ports_conf_test internode_ssl_test
to fail since it expects to disable listening on storage_port
or ssl_storage_port when set to 0, as seen in
https://github.com/scylladb/scylla-dtest/issues/2174.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When recreating the underlying reader, the evictable reader validates
that the first partition key it emits is what it expects to be. If the
read stopped at the end of a partition, it expects the first partition
to be a larger one. If the read stopped in the middle of a certain
partition it expects the first partition to be the same it stopped in
the middle of. This latter assumption doesn't hold in all circumstances
however. Namely, the partition it stopped in the middle of might get
compacted away in the time the read was paused, in which case the read
will resume from a greater partition. This perfectly valid cases however
currently triggers the evictable reader's self validation, leading to
the abortion of the read and a scary error to be logged. Relax this
check to accept any partition that is >= compared to the one the read
stopped in the middle of.
When the evictable reader recreates in underlying reader, it does it
such that it continues from the exact mutation fragment the read was
left off from. There are however two special mutation fragments, the
partition start and static row that are unconditionally re-emitted at
the start of a new read. To work around this, when stopping at either of
these fragments the evictable reader sets two flags
_drop_partition_start and _drop_static_row to drop the unneeded
fragments (that were already emitted by it) from the underlying reader.
These flags are then reset when the respective fragment is dropped.
_drop_static_row has a vulnerability though: the partition doesn't
necessarily have a static row and if it doesn't the flag is currently
not cleared and can stay set until the next fill buffer call causing the
static row to be dropped from another partition.
To fix, always reset the _drop_static_row flag, even if no static row
was dropped (because it doesn't exist).
Scylla doesn't allow unencrypted connections over encrypted CQL ports
(Cassandra does allow this, by setting "optional: true", but it's not
secure and not recommended). Here we add a test that in indeed, we can't
connect to an SSL port using an unencrypted connection.
The test passes on Scylla, and also on Cassandra (run it on Cassandra
with "test/cql-pytest/run-cassandra --ssl" - for which we added support
in a recent patch).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210629121514.541042-1-nyh@scylladb.com>
Fixtures in conftest.py (e.g., the test_keyspace fixture) can be shared by
all tests in all source files, so they are marked with the "session"
scope: All the tests in the testing session may share the same instance.
This is fine.
Some of test files have additional fixtures for creating special tables
needed only in those files. Those were also, unnecessarily, marked
"session" scope as well. This means that these temporary tables are
only deleted at the very end of test suite, event though they can be
deleted at the end of the test file which needed them - other test
source files don't have access to it anyway. This is exactly what the
"module" fixture scope is, so this patch changes all the fixtures that
are private to one test file to use the "module" scope.
After this patch, the teardown of the last test in the suite goes down
from 0.26 seconds to just 0.06 seconds.
Another benefit is that the peak disk usage of the test suite is
lower, because some of the temporary tables are deleted sooner.
This patch does not change any test functionality, and also does not
make any test faster - it just changes the order of the fixture
teardowns.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#8932
Previously, the disk block alignment of segments was hardcoded (due to
really old code). Now we use the value as declared in the actual file
opened. If we are using a previously written file (i.e. o_dsync), we
can even use the sometimes smaller "read" alignment.
Also allow config to completely override this with a disk alignment
config option (not exposed to global config yet, but can be).
v2:
* Use overwrite alignment if doing only overwrite
* Ensure to adjust actual alignment if/when doing file wrapping
v3:
* Kill alignment config param. Useless and unsafe.
Closes#8935
This patch adds support for the "--ssl" option in run-cassandra, which
will now be able, like run (which runs Scylla), to run Cassandra with
listening to a *SSL-encrypted* CQL connection. The "--ssl" option is also
passed to the tests, so they know to encrypt their CQL connections.
We already had support for this feature in the test/cql-pytest/run
script - which runs Scylla. Adding this also to the run-cassandra
script can help verify that a behavior we notice in Scylla's SSL support
and we want to add to a test - is also shared by Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210629082532.535229-1-nyh@scylladb.com>
When compaction fails due to a failure that comes from a specific
sstable, like on data corruption, the log isn't telling which
sstable contributed to that. Let's always attach the sstable name to
the exception triggered in sstable mutation reader.
Exceptions in la and mx consumer attached sst name, but now only sst
mutation reader will do it so as to avoid duplicating the sst name.
Now:
ERROR 2021-06-11 16:07:34,489 [shard 0] compaction_manager - compaction failed:
sstables::malformed_sstable_exception (Failed to read partition from SSTable
/home/.../md-74-big-Data.db due to compressed chunk of size 3735 at file
offset 406491 failed checksum, expected=0, actual=1422312584): retrying
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Test breaks very easily whenever there's a change in the message
formatted for malformed_sstable_exception. Make test more robust
by not checking exact message, but that the message contains
both the expected exception and the sstable filename.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The current log is useless when checksum fails, you don't know
which compressed chunk failed checksum, the expected and the
actual checksum, the size of chunk and so on.
Before:
compressed chunk failed checksum
Now:
compressed chunk of size 3735 at file offset 406491 failed checksum, expected=0, actual=1422312584
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Add architecture name for relocatable packages, to support distributing
both x86_64 version and aarch64 version.
Also create symlink from new filename to old filename to keep
compatibility with older scripts.
Fixes#8675Closes#8709
[update tools/python3 submodule:
* tools/python3 ad04e8e...afe2e7f (1):
> reloc: add arch to relocatable package filename
]
* seastar 0e48ba883...eaa00e761 (3):
> memory: reduce statistics TLS initialization even more
> Merge "Sanitize io-topology creation on start" from Pavel E
> doc/prometheus: note that metric family is passed by query name
The permit creation path enters the semaphore's permit gate in
on_permit_created(). Entering this gate can throw so this method is not
noexcept. Remove the noexcept specifier accordingly.
Also enter the gate before adding the permit to the permit list, to save
some work when this fails.
Fixes: #8933
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210628074941.32878-1-bdenes@scylladb.com>
Now that all supported versions write mc/md sstables, we can deprecate the MC_SSTABLE feature bit and consider it implicitly true, and with it the ability to write la/ka sstables.
We still need to support reading them, e.g. from restoring old snapshots or migrating data from legacy clusters.
Test: unit(dev, debug)
Fixes#8352Closes#8884
* github.com:scylladb/scylla:
compress: Remove unused make_compressed_file_k_l_format_output_stream
sstables: move sstable_writer to separate header
sstable_writer: remove get_metadata_collector
sstables: stop including metadata_collector.hh in sstables.hh
sstables: Remove duplicated friend declaration
sstables: remove unused KL writer
sstables: Always use MC/MD writer
sstable_datafile_test: switch tests to use latest sstables format
sstable_datafile_test: switch compaction_with_fully_expired_table to latest sstable version
test_offstrategy_sstable_compaction: test all writable sstables
compaction_with_fully_expired_table: Remove some LA specific code
sstable_mutation_test: test latest sstable format instead of LA
sstable_test: Test MX sstables instead of KA/LA
sstable_datafile_test: Fix schema used by check_compacted_sstables
sstables: Remove LA/KA sstable writting tests that check exact format
sstables: define writable_sstable_versions
features: assume MC_SSTABLE and UNBOUNDED_RANGE_TOMBSTONES are always enabled
"
query_singular() accepts a partition_range_vector, corresponding to an IN
query. But such queries are rare compared to single-partition queries.
Co-routinise the code and special case non-IN queries by avoiding
the call to map_reduce. Also replace executers array with small_vector
to avoid an allocation in the common case.
perf_simple_query --smp 1 --operations-per-shard 1000000 --task-quota-ms 10:
before: median 204545.04 tps ( 81.1 allocs/op, 15.1 tasks/op, 48828 insns/op)
after: median 219769.97 tps ( 74.1 allocs/op, 12.1 tasks/op, 46495 insns/op)
So, a ~7% improvement in tps and 5% improvement in instructions per op.
Also large reduction in tasks and allocations.
This is an alternative proposal to https://github.com/scylladb/scylla/pull/8909.
The benefit of this one is that it does not duplicate any code (almost).
"
* 'query_singular-coroutine' of github.com:scylladb/scylla-dev:
storage_proxy: avoid map_reduce in storage_proxy::query_singular if only one pk is queried
storage_proxy: use small_vector in storage_proxy::query_singular to store executors
storage_proxy: co-routinize storage_proxy::query_singular()
This class is used in only few places and does not have to be included
everywhere sstable class is needed.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This function is only called internally so it does not have to be
exposed and can be inlined instead.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Previous two patches removed the usage of KL writer so the code is now
dead and can be safely removed.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Previous patch made MC the lowest sstables format in use so
the removed check is always true now and we can remove the else
part.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
instead of LA. Ability to write LA and KA sstables will be removed
by the following patches so we need to switch all the tests to write
newer sstables.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
"
Stopping transport (cql, thrift, messaging, etc.) can happen from
several places -- drain, decommission, stop, isolation. Some of
them can even run in parallel. This patch makes transport stopping
bulletproof.
tests: unit(dev)
start-stop (dev)
start-drain-stop (dev)
fixes: #8911
"
* 'br-stop-transport-races' of https://github.com/xemul/scylla:
storage_service: Indentation fix
storage_service: Make stop_transport re-entrable
storage_service: Stop transport on decommission
Storage service install disk error handlers in constructor and these
connections are not unregistered. It's not a problem in real life,
because storage service is not stopped, but in some tests this can
lead to use-after-frees.
The sstables_datafile_test runs some of the testcases in cql_test_env
which starts and (!) stops the storage service. Other testcases are
run in a lightweight sstables_test_env which does not mess with the
storage service at all. Now, if a case of the 2nd kind is run after
the one of the 1st and (for whatever reason) generates a disk error
it will trigger use-after-free -- after the 1st testcase the storage
service disk error registration would remain, but the storage service
itself would already be stopped, thus triggering the disk error will
try to access stopped sharded storage service inside the .isolate().
The fix is to keep the scoped connection on the storage service list
of various listeners. On stop it will go away automagically.
tests: unit(dev), sstables_datafile_test with forced disk error
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210625062648.27812-1-xemul@scylladb.com>
std::copy_if runs without yielding.
See https://github.com/scylladb/scylla/issues/8897#issuecomment-867522480
Also, eliminate extraneous loop on merge
first1 will point to the inserted value which is a copy of *first2.
Since list2 is sorted in ascending order, the next item from list2
will never be less than the one we've just inserted,
so we waste an iteration to merely increment first1 again.
Fixes#8897
Test: unit(dev), stall_free_test(debug)
DTest: repair_additional_test.py:RepairAdditionalTest.{repair_same_row_diff_value_3nodes_diff_shard_count_test,repair_disjoint_row_3nodes_diff_shard_count_test} (dev)
Closes#8925
* github.com:scylladb/scylla:
utils: merge_to_gently: eliminate extraneous loop on merge
utils: merge_to_gently: prevent stall in std::copy_if
It was observed that as compaction progresses the backlog of compacting SSTable
is being reduced very slowly, which causes shares to be higher than needed, and
consequently compaction acts much more aggressively than it has to.
https://user-images.githubusercontent.com/1409139/120237819-93dfc080-c232-11eb-9042-68114e285ea0.png
The graph above shows the amount of backlog that is reduced from a SSTable
being compacted. The red line denotes the total backlog of the SSTable, before
it's selected for compaction. The expectation is that the more a SSTable is
compacted the more backlog will be reduced from it. However, in the current
implementation, it can be seen that the backlog to be reduced, from the SSTable
being compacted, starts being inversely proportional to the amount of data
already compacted.
Turns out that this problem happens because the implementation of backlog
formula becomes incorrect when the SSTable is being compacted.
Backlog for a sstable is currently defined as:
Bi = Ei * log (T / Ei)
where Ei = Si - Ci (bytes left to be compacted)
and Si = size of SStable
and Ci = total bytes compacted
and T = total size of table
The formula above can also be rewritten as follows:
Bi = Ei * log (T) - Ei * log (Ei)
the second term `Ei * log (Ei)` can be rewritten as:
= (Si - Ci) * log (Ei)
= Si * log (Ei) - Ci * log (Ei)
However, digging backlog implementation, turns out that we're incorrectly
implementing that second term as:
= Si * log (Si) - Ci * log (Ei)
Given that Si > Ei, for a SSTable being compacted, the backlog will be higher
than it should.
the following table shows how the backlog of a SSTable being compacted behaves
now versus how it's supposed to behave:
https://gist.github.com/raphaelsc/42e14be0d7d4ed264e538c2d217c8f95
Turns out that this is not the only problem. It was a mistake to change the
formula from `Ei * log(T / Si)` to `Ei * log(T / Ei)`, when fixing the
shrinking table issue, because that also causes the backlog of a compacting
SSTable to be incorrectly reduced.
With the formula rewritten as follows:
Bi = Ei * log (T) - Ei * log (Ei)
It becomes clear that the more a SSTable is compacted, the slower it becomes
for backlog to be reduced, as T / Ei can increase considerably over time.
So we're reverting the formula back to `Ei * log(T / Si)`.
The graph below shows a better backlog behavior when table is shrinking:
https://user-images.githubusercontent.com/1409139/123495186-06a54700-d5f9-11eb-9386-3fcf4dd8e4d3.png
While analyzing the problem when table is shrinking, realized that it's because
T in the formula is implemented as the effective size (total + partial -
compacted).
With the new formula rewritten as follows:
Bi = Ei * log (T) - Ei * log (Si)
It becomes clearer that T cannot be lower than Si whatsoever, otherwise the
backlog becomes negative. Also, while table is shrinking, it can happen that
the backlog will be so low that compaction will barely make any progress.
To fix both issues, let's implement T as total size (sum of all Si) rather than
effective size (sum of all Ei).
The graph below shows that this change prevents the backlog from going negative
while still providing similar and expected behavior as before, see:
https://user-images.githubusercontent.com/1409139/123495185-060cb080-d5f9-11eb-89f7-ed445729702a.pngFixes#8768.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210626003133.3011007-1-raphaelsc@scylladb.com>
It may happen that disk error opccurs and subsequent isolation runs
in parallel with drain or decommission or shutdown. In this case the
stop_transport method would be running two times in parallel. Also
the drain after decommission is not disabled, so it may happen that
stop_transport will be called two times in a row (however -- not in
parallel).
Using shared_promise solves all the possible reentrances.
(the indentation is deliberately left broken)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The stop_transport sequence is:
- stop client services (cql, thrift, alternator)
- stop gossiping
- stop messaging
- stop stream manager
The decommissioning goes very similarly
- stop client services
- stop batchlog manager
- stop gossiping
- stop messaging
So this change makes decommission stop all networking _before_
batchlog, like it's already done on drain, and additionally stop
the streaming manager.
This change is prerequisite for fixing race between transport
stop and isolation (on disk error).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Following patches will switch all sstable writing tests to use
the latest sstables format. compaction_with_fully_expired_table
contains some test for a LA specific behaviour so let's remove it
to make the switch possible.
For more context see https://github.com/scylladb/scylla/issues/2620
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Replace calls to make_compressed_file_k_l_format_input_stream
with calls to make_compressed_file_m_format_input_stream.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
check_compacted_sstables is used in compact_02 test which uses sstables
created by compact_sstables. The problem is that schema used in
check_compacted_sstables and compact_sstables is not the same.
The type of r1 column is different. This was not a problem when the
test was running on LA sstables but following patches will switch
all the tests to use MC and then sstable schema becomes validated
when reading the sstable and the test will fail such validation.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Those tests check that created sstables have exactly the expected bytes
inside. This won't work with other sstable formats and writting LA/KA
sstables will be removed by the following patches so there's nothing
we can do with those tests but to remove them. Otherwise they will be
failing after LA/KA writting capability is removed.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
and use it instead of all_sstable_versions in tests that check
writting of sstables. Following patches remove LA/KA writer so we
want tests to be ready for that and not break by trying to write LA/KA
sstables.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
These features have been around for over 2 years and every reasonable
deployment should have them enabled.
The only case when those features could be not enabled is when the user
has used enable_sstables_mc_format config flag to disable MC sstable
format. This case has been eliminated by removing
enable_sstables_mc_format config flag.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The netw command tries to access the netw::_the_messaging_service that
was removed long ago. The correct place for the messaging service is
in debug:: namespace.
The scylla-gdb test checks that, but the netw command sees that the ptr
in question is not initialized, thinks it's not yet sharded::start()-ed
and exits without errors.
tests: unit(gdb)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210624135107.12375-1-xemul@scylladb.com>
DateTieredCompactionStrategy (DTCS) has been un-recommended for a long time
(users should use TimeWindowCompactionStrategy, TWCS, instead). This
patch adds a new configuration option - restrict_dtcs - which can be used
to restrict the ability to use DTCS in CREATE TABLE or ALTER TABLE
statements. This is part of a "safe mode" effort to allow an installation
to restrict operations which are un-recommended or dangerous.
The new restrict_dtcs option has three values: "true", "false", and "warn":
For the time being, "false" is still the default, and means DTCS is not
restricted and can still be used freely. We can easily change this
default in a followup patch.
Setting a value of "true" means that DTCS *is* restricted -
trying to create a a table or alter a table with it will fail with an error.
Setting a value of "warn" will allow the create or alter operation, but
will warn the user - both with a warning message which will immediately
appear in cqlsh (for example), and with a log message.
Fixes#8914.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210624122411.435361-1-nyh@scylladb.com>
first1 will point to the inserted value which is a copy of *first2.
Since list2 is sorted in ascending order, the next item from list2
will never be less than the one we've just inserted,
so we waste an iteration to merely increment first1 again.
Note that the standard states that no iterators or references are invalidated
on insert so we can safely keep looking at `first1` after inserting a copy of
`*first2` before it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Check if the timeout has expired before issuing I/O.
Note that the sstable reader input_stream is not closed
when the timeout is detected. The reader must be closed anyhow after
the error bubbles up the chain of readers and before the
reader is destroyed. This might already happen if the reader
times out while waiting for reader_concurrency_semaphore admission.
Test: unit(dev), auth_test.test_alter_with_timeouts(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210624073232.551735-1-bhalevy@scylladb.com>
The bootstrap procedure starts by "waiting for range setup", which means
waiting for a time interval specified by the `ring_delay` parameter (30s
by default) so the node can receive the tokens of other nodes before
introducing its own tokens.
However it may sometimes happen that the node doesn't receive the
tokens. There are no explicit checks for this. But the code may crash in
weird ways if the tokens-received assuption is false, and we are lucky
if it does crash (instead of, for example, allowing the node to
incorrectly bootstrap, causing data loss in the process).
Introduce an explicit check-and-throw-if-false: a bootstrapping node now
checks that there's at least one NORMAL token in the token ring, which
means that it had to have contacted at least one existing node
in the cluster, which means that it received the gossip application
states of all nodes from that node; in particular the tokens of all
nodes.
Also add an assert in CDC code which relies on that assumption
(and would cause weird division-by-zero errors if the assumption
was false; better to crash on assert than this).
Ref #8889.
Closes#8896
1) Start node n1, n2, n3
2) Bootstrap n4 and kill n4 in the middle of bootstrap
3) Wipe data on n4 and start n4 again
After step 2, n1, n2 and n3 will remove n4 from gossip after
fat_client_timeout and put n4 in quarantine for quarantine_delay().
If n4 bootstraps again in step 3 before the quarantine finishes, n1, n2
and n3 will ignore gossip updates from n4, and n4 will not learn gossip
updates from the cluster.
After PR #8896, the bootstrap will be rejected.
This patch promotes the gossip quarantine over log to info level, so
that dtest can wait for the log to bootstrap the node again.
Refs #8889
Refs #8890Closes#8905
Make some improvements to docs/alternator.md as suggested by a user who
had trouble understanding the previous version, and also a few other
random cleanups.
Closes#8910
* github.com:scylladb/scylla:
docs/alternator/alternator.md: improve "Running Alternator" section
docs/alternator/alternator.md: correct minor typos
docs/alternator/alternator.md: fix link format
Fixes#8270
If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk footprint that is over limit causing new segment allocation to stall.
We need to take a few things into account:
1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here.
2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task.
3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit.
4.) (v2) Must ensure discard/delete routines are executed. Because we can race with background disk syncs, we may need to
issue segment prunes from end_flush() so we wake up actual file deletion/recycling
5.) (v2) Shutdown must ensure discard/delete is run after we've disabled background task etc, otherwise we might fail waking up replenish and get stuck in gate
6.) (v2) Recycling or deleting segments must be consistent, regardless of shutdown. For same reason as above.
7.) (v3) Signal recycle/delete queues/promise on shutdown (with recognized marker) to handle edge case where we only have a single (allocating) segment in the list, and cannot wake up replenisher in any more civilized way.
Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything).
New attempt at this, should fix intermittent shutdown deadlocks in commitlog_test.
Closes#8764
* github.com:scylladb/scylla:
commitlog_test: Add test case for usage/disk size threshold mismatch
commitlog_test: Improve test assertion
commitlog: Add waitable future for background sync/flush
commitlog: abort queues on shutdown
commitlog: break out "abort" calls into member functions
commitlog: Do explicit discard+delete in shutdown
commitlog: Recycle or not should not depend on shutdown state
commitlog: Issue discard_unused_segments on segment::flush end IFF deletable
commitlog: Flush all segments if we only have one.
commitlog: Always force flush if segment allocation is waiting
commitlog: Include segment wasted (slack) size in footprint check
commitlog: Adjust (lower) usage threshold
The helper is used to walk the tree key-by-key destroying it
in the mean time. Current implementation of this method just
uses the "regular" erasing code which actually rebalances the
tree despite the name.
The biggest problem with removing the rebalancing is that at
some point non-balanced tree may have the left-most key on an
inner node, so to make 100% rebalance-less unlink every other
method of the tree would have to be prepared for that. However,
there's an option to make "light rebalance" (as it's called in
this patch) that only maintains this crucial property of the
tree -- the left-most key is on the leaf.
Some more tech details. Current rebalancer starts when the
node population falls below 1/2 of its capacity and tries to
- grab a key from one of the siblings if it's balanced
- merge two siblings together if they are small enough
The light rebalance is lighter in two ways. First, it leaves
the node unbalanced until it becomes empty. And then it goes
ahead and replaces it with the next sibling.
This change removes ~60% of the keys movements on random test.
Keys still move when the sibling replace happens because in
this case the separation key needs to be placed at the right
sibling 0 position which means shifting all its keys right.
tests: unit(debug)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210623083836.27491-1-xemul@scylladb.com>
_sstables_opened_but_not_loaded was needed because the old loader would
open sstables from all shards before loading them.
In the new loader, introduced with reshape, make_sstables_available()
is called on each shard after resharding and reshape finished, so
there's no need whatsoever for that mess.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210618200026.1002621-1-raphaelsc@scylladb.com>
Move `raft_sys_table_storage_test` and `raft_address_map_test` to
`test/raft` folder since they naturally belong here, not in
`test/boost` folder.
Tests: unit(dev)
* manmanson/move_some_raft_tests_to_raft_folder:
test: raft: move `raft_address_map_test` to `raft` folder
test: raft: move `raft_sys_table_storage_test` to `raft` folder
configure: add extended raft testing dependencies
The process of getting a queue pointer is quite tricky here.
First, it checks if the vptr resolves into 'vtable for async_work_item'
and puts a None mark into known_vptrs dict.
Then, if the entry is present there are two options. First, if it's NOT
None, it's translated directly into the queue object. But if it's None,
then a loop over an offset starts that tries to check is the vptr + offset
maps to a queue. And here's the problem -- if no offsets were mapped to
any specific queues the last checked offset is put into the known vptrs
dict, so the next vptrs will miss the 2nd offset checking, but will go
ahead and use the "random" offset that had failed previously.
tests: unit(gdb)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210624085723.7156-1-xemul@scylladb.com>
A user complained that the "Running Alternator" section was confusing.
It didn't say outright which two configurations are necessary and you
had to read a few paragraph to reach it, and it mixed the YAML names
of options and the command-line names, which are subtly different.
This patch tries to improve this.
Sometimes an ability to force a leader change is needed. For instance
if a node that is currently serving as a leader needs to be brought
down for maintenance. If it will be shutdown without leadership
transfer the cluster will be unavailable for leader election timeout at
least.
* scylla-dev/raft-stepdown-v4:
raft: test: test leadership transfer timeout
raft: allow to initiate leader stepdown process
Rename `scylla_raft_dependencies` to `scylla_minimal_raft_dependencies`
and introduce `scylla_raft_dependencies` that contains
`scylla_core` (i.e., all scylla source files).
The new `scylla_raft_dependencies` variable will be used
for `raft_address_map_test` and `raft_sys_table_storage_test`,
which use a lot of machinery from scylla.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
"
This series prevents broken_promise or dangling reader errors
when (resharding) compaction is stopped, e.g. during shutdown.
At the moment compaction just closes the reader unilaterally
and this yanks the reader from under the queue_reader_handle feet,
causing dangling queue reader and broken_promise errors
as seen in #8755.
Instead, fix queue_reader::close to set value on the
_full/_not_full promises and detach from the handle,
and return _consume_fut from bucket_writer::consume
if handle is terminated.
Fixes#8755
Test: unit(dev)
DTest: materialized_views_test.py:TestMaterializedViews.interrupt_build_process_and_resharding_half_to_max_test(debug)
"
* tag 'propagate-reader-abort-v3' of github.com:bhalevy/scylla:
mutation_writer: bucket_writer: consume: propagate _consume_fut if queue_reader_handle is_terminated
queue_reader_handle: add get_exception method
queue_reader: close: set value on promises on detach from handle
Unfortunately the scylla.docs.scylladb.com formatter which generates
https://scylla.docs.scylladb.com/master/alternator/alternator.html
doesn't know how to recognize HTTP URLs and convert them into proper
HTML links (something which github's formatter does).
So convert the two URLs we had in alternator.md into markdown links
which both github and our formatter recognize.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Fixes#8749
if a table::clear() was issued while we were flushing a memtable,
the memtable is already gone from list. We need to check this before
erase. Otherwise we get random memory corruption via
std::vector::erase
v2:
* Make interface more set-like (tolerate non-existance in erase).
Closes#8904
Previously, hinted handoff had a hardcoded concurrency limit - at most
128 hints could be sent from a single shard at once. This commit makes
this limit configurable by adding a new configuration option:
`max_hinted_handoff_concurrency_per_shard`. This option can be updated
in runtime. Additionally, the default concurrency per shard is made
lower and is now 8.
The motivation for reducing the concurrency was to mitigate the negative
impact hints may have on performance of the receiving node due to them
not being properly isolated with respect to I/O.
Tests:
- unit(dev)
- dtest(hintedhandoff_additional_test.py)
Refs: #8624Closes#8646
Test that if leadership transfer cannot be done in configured time frame
fsm cancels the leadership transfer process. Also check that timeout_now
message is resent on each tick while leadership transfer is in progress.
Sometimes an ability to force a leader change is needed. For instance
if a node that is currently serving as a leader needs to be brought
down for maintenance. If it will be shutdown without leadership
transfer the cluster will be unavailable for leader election timeout at
least.
We already have a mechanism to transfer the leadership in case an active
leader is removed. The patch exposes it as an external interface with a
timeout period. If a node is still a leader after the timeout the
operation will fail.
This is a more informative name. Helps see that, say, group0
is a separate service and not bundle all raft services together.
Message-Id: <20210619211412.3035835-3-kostja@scylladb.com>
Fixes#8733
If a memtable flush is still pending when we call table::clear(),
we can end up doing a "discard-all" call to commitlog, followed
by a per-segment-count (using rp_set) _later_. This will foobar
our internal usage counts and quite probably cause assertion
failures.
Fixed by always doing per-memtable explicit discard call. But to
ensure this works, since a memtable being flushed remains on
memtable list for a while (why?), we must also ensure we clear
out the rp_set on discard.
v3:
* Fix table::clear to discard rp_sets before memtables
Closes#8894
Listing /etc/systemd/system/*.mount as ghost file seems incorrect,
since user may want to keep using RAID volume / coredump directory after
uninstalling Scylla, or user may want to upgrade enterprise version.
Also, we mixed two types of files as ghost file, it should handle differently:
1. automatically generated by postinst scriptlet
2. generated by user invoked scylla_setup
The package should remove only 1, since 2 is generated by user decision.
See scylladb/scylla-enterprise#1780
Closes#8810
Commitlog timer issues un-waited syncs on all segments. If such
a sync takes too long we can end up keeping a segment alive across
a shutdown, causing the file to be left on disk, even if actually
clean.
This adds a future in segment_manager that is "chained" with all
active syncs (hopefully just one), and ensures we wait for this
to complete in shutdown, before pruning and deleting segments
Currently we capture the snapshot mutation_source by reference
for calling create_underlying_reader after closing the reader.
However, if close_reader yields, the snapshot reference passed
may be gone, so capture it by value instead.
Fixes#8848
Test: unit(dev)
DTest: restore_snapshot_using_old_token_ownership_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210613104232.634621-1-bhalevy@scylladb.com>
* tools/java 599b2368d6...5013321823 (4):
> cassandra-stress: fix failure due to the assert exception on disconnect when test is completed
> node_probe: toppartitions: Fix wrong class in getMethod
> Fix NullPointerException in SettingsMode
> cassandra-stress: Remove maxPendingPerConnection default
* tools/jmx a7c4c39...5311e9b (2):
> storage_service: takeSnapshot: support the skipFlush option
> build(deps): bump snakeyaml from 1.16 to 1.26 in /scylla-apiclient
"
The storage service is carried along storage proxy, hints
resource manager and hints managers (two of them) just to
subscribe the hints managers on lifecycle events (and stop
the subscription on shutdown) emitted from storage service.
This dependency chain can be greatly simplified, since the
storage proxy is already subscribed on lifecycle events and
can kick managers directly from its hooks.
tests: unit(dev),
dtest.hintedhandoff_additional_test.hintedhandoff_basic_check_test(dev)
"
* 'br-remove-storage-service-from-hints' of https://github.com/xemul/scylla:
hints: Drop storage service from managers
hints: Do not subscribe managers on lifecycle events directly
The existing test_ssl.py which tests for Scylla's support of various TLS
and SSL versions, used a deprecated and misleading Python API for
choosing the protocol version. In particular, the protocol version
ssl.PROTOCOL_SSLv23 is *not*, despite it's name, SSL versions 2 or 3,
or SSL at all - it is in fact an alias for the latest TLS version :-(
This misunderstanding led us to open the incorrect issue #8837.
So in this patch, we avoid the old Python APIs for choosing protocols,
which were gradually deprecated, and switch to the new API introduced
in Python 3.7 and OpenSSL 1.1.0g - supplying the minimum and maximum
desired protocol version.
With this new API, we can correctly connect with various versions of
the SSL and TLS protocol - between SSLv3 through TLSv1.3. With the
fixed test, we confirm that Scylla does *not* allow SSLv3 - as desired -
so issue #8837 is a non-issue.
Moreover, after issue #8827 was already fixed, this test now passes,
so the "xfail" mark is removed.
Refs #8837.
Refs #8827.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210617134305.173034-1-nyh@scylladb.com>
* seastar 813eee3e...0e48ba88 (5):
> net/tls: on TLS handshake failure, send error to client
> net/dns: fix build on gcc 11
> core: fix docstring for max_concurrent_for_each
> test: alien_test: replace deprecated call to alien::submit_to() with new variant
> alien: prepare for multi-instance use
The fix "net/tls: on TLS handshake failure, send error to client"
fixes#8827.
The test
test/cql-pytest/run --ssl test_ssl.py
now xpasses, so I'll remove the "xfail" mark in a followup patch.
The storage service pointer is only used so (un)subscribe
to (from) lifecycle events. Now the subscription is gone,
so can the storage service pointer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Managers sit on storage proxy which is already subscribed on
lifecycle events, so it can "notify" hints managers directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The patch set introduces a few leadership transfer tests, some of them
are adaptations of corresponding etcd tests (e.g.
`test_leader_transfer_ignore_proposal` and `test_transfer_non_member`).
Others test different scenarios ensuring that pending leadership
transfer doesn't disrupt the rest of the cluster from progressing:
Lost `timeout_now` messages` (`test_leader_transfer_lost_timeout_now` and
`test_leader_transferee_dies_upon_receiving_timeout_now`) as well as
lost `vote_request(force)` from the new candidate
(test_leader_transfer_lost_force_vote_request) don't impact the
election process following that and the leader is elected as normal.
* manmanson/leadership_transfer_tests_v3:
raft: etcd_test: test_transfer_non_member
raft: etcd_test: test_leader_transfer_ignore_proposal
raft: fsm_test: test_leader_transfer_lost_force_vote_request
raft: fsm_test: test_leader_transfer_lost_timeout_now
raft: fsm_test: test_leader_transferee_dies_upon_receiving_timeout_now
Turns out the DELETE statement already supports attributes
like timestamp, so it's ridiculously easy to add USING TIMEOUT
support - it's just the matter of accepting it in the grammar.
Fixes#8855Closes#8876
This series addresses #8852 by:
* migrating to chunked_vector in view update generation code to avoid large allocations
* reducing the number of futures kept in mutate_MV, tracking how many view updates were already sent
Combined with #8853 I was able to only observe large partition warnings in the logs for the reproducing code, without crashes, large allocation or reactor stall warnings. The reproducing code itself is not part of cql-pytest because I haven't yet figured out how to make it fast and robust.
Tests: unit(release)
Refs #8852Closes#8856
* github.com:scylladb/scylla:
db,view: limit the number of simultaneous view update futures
db,view: use chunked_vector for view updates
The write paths in storage_proxy pass replica sets as
std::unordered_set<gms::inet_address>. This is a complex type, with
N+1 allocations for N members, so we change it to a small_vector (via
inet_address_vector_replica_set) which requires just one allocation, and
even zero when up to three replicas are used.
This change is more nuanced than the corresponding change to the read path
abe3d7d7 ("Merge 'storage_proxy: use small_vector for vectors of
inet_address' from Avi Kivity"), for two reasons:
- there is a quadratic algorithm in
abstract_write_response_handler::response(): it searches for a replica
and erases it. Since this happens for every replica, it happens N^2/2
times.
- replica sets for writes always include all datacenters, while reads
usually involve just one datacenter.
So, a write to a keyspace that has 5 datacenters will invoke 15*(15-1)/2
=105 compares.
We could remove this by sending the index of the replica in the replica
set to the replica and ask it to include the index in the response, but
I think that this is unnecessary. Those 105 compares need to be only
105/15 = 7 times cheaper than the corresponding unordered_set operation,
which they surely will. Handling a response after a cross-datacenter round
trip surely involves L3 cache misses, and a small_vector reduces these
to a minimum compared to an unordered_set with its bucket table, linked
list walking and managent, and table rehashing.
Tests using perf_simple_query --write --smp 1 --operations-per-shard 1000000
--task-quota-ms show two allocations removed (as expected) and a nice
reduction in instructions executed.
before: median 204842.54 tps ( 54.2 allocs/op, 13.2 tasks/op, 49890 insns/op)
after: median 206077.65 tps ( 52.2 allocs/op, 13.2 tasks/op, 49138 insns/op)
Closes#8847
the_merge_lock is global, which is fine now because it is only used
in shard 0. However, if we run multiple nodes in a single process,
there will be multiple shard 0:s, and the_merge_lock will be accessed
from multiple threads. This won't work.
To fix, make it thread_local. It would be better to make it a member
of some controlling object, but there isn't one.
Closes#8858
The option is provided by nodetool snapshot
https://docs.scylladb.com/operating-scylla/nodetool-commands/snapshot/
```
nodetool [(-h <host> | --host <host>)] [(-p <port> | --port <port>)]
[(-pp | --print-port)] [(-pw <password> | --password <password>)]
[(-pwf <passwordFilePath> | --password-file <passwordFilePath>)]
[(-u <username> | --username <username>)] snapshot
[(-cf <table> | --column-family <table> | --table <table>)]
[(-kc <kclist> | --kc.list <kclist>)]
[(-sf | --skip-flush)] [(-t <tag> | --tag <tag>)] [--] [<keyspaces...>]
-sf / –skip-flush Do not flush memtables before snapshotting (snapshot will not contain unflushed data)
```
But is currently ignored by scylla-jmx (scylladb/scylla-jmx#167)
and not supported at the api level.
This patch adds support for the option in advance
from the api service level down via snapshot_ctl
to the table class and snapshot implementation.
In addition, a corresponding unit test was added to verify
that taking a snapshot with `skip_flush` does not flush the memtable
(at the table::snapshot level).
Refs #8725Closes#8726
* github.com:scylladb/scylla:
test: database_test: add snapshot_skip_flush_works
api: storage_service/snapshots: support skip-flush option
snapshot: support skip_flush option
table: snapshot: add skip_flush option
api: storage_service/snapshots: add sf (skip_flush) option
This is a translation of Cassandra's CQL unit test source file
validation/entities/StaticColumnsTest.java into our our cql-pytest framework.
This test file checks various features of static columns. All these tests
pass on Cassandra, and all but one pass on Scylla. The xfailing test,
testStaticColumnsWithSecondaryIndex, exposes a query that Cassandra
allows but we don't. The new issue about that is:
Refs #8869.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210616141633.114325-1-nyh@scylladb.com>
Consider two nodes with almost-100% cache hit ratio, but not exactly
100%: one has 99.9% cache hits, the second 99.8%. Normally in HWLB we
want to equalize the miss rate in both nodes. So we send the first node
twice the number of requests we send to the second. But unless the disks
are extremely limited, this doesn't make sense: As a numeric example,
consider that we send 2000 requests to the first node and 1000 to the
second, just so the number of misses will be the same - 2 (0.1% and 0.2%
misses, respectively). At such low miss numbers, the assumption that the
disk reads are the slowest part of the operation is wrong, so trying to
equalize only this part is wrong.
So above some threshold hit rate, we should treat all hit rates as
equivalent. In the code we already had such a threshold - max_hit_rate,
but it was set to the incredibly high 0.999. We saw in actual user
runs (see issue #8815) that this threshold was too high - one node
received twice the amount of requests that another did - although both
had near-100% cache hit rates.
So in this patch we lower the max_hit_rate to 0.95. This will have two
consequences:
1. Two nodes with hit rates above 0.95 will be considered to have the
same hit rate, so they will get equal amount of work - even if one
has hit rate 0.98 and the other 0.99.
2. A cold node with it rate 0.0 will get 5% of the work of a node with
the perfect hit rate limited to 0.95. This will allow the cold node to
slowly warm up its cache. Before this patch, if the hot node happened
to have a hit rate of 0.999 (the previous maximum), the cold node would
get just 0.1% of the work and remain almost idle and fill its cache
extremely slowly - which is a waste.
Fixes#8815.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210616180732.125295-1-nyh@scylladb.com>
Previously the view update code generated a continuation for each
view update and stored them all in a vector. In certain cases
the number of updates can grow really large (to millions and beyond),
so it's better to only store a limited amount of these futures
at a time.
Currently, column names can only appear in a boolean binary expression,
but not on their own. This means that in the statement
SELECT a FROM tab WHERE a > 3;
We can represent the WHERE clause as an expression, but not the selector.
To pave the way for using expressions in selector contexts, we promote
the elements of binary_operator::lhs (column_value, column_value_tuple,
token) to be expressions in their own right. binary_operator::lhs
becomes an expression (wrapped in unique_ptr, because variants can't
contain themselves).
Note that all three new possibilities make sense in a selector:
SELECT column FROM tab
SELECT token(pk) FROM tab
SELECT function_that_accepts_a_tuple((col1, col2)) FROM tab
There is some fallout from this:
- because binary_operator contains a unique_ptr, it is no longer
copyable. We add a copy constructor and assignment operator to
compensate.
- often, the new elements don't make sense when evaluating a boolean
expression, which is the only context we had before. We call
on_internal_error in these cases. The parser right now prevents such
cases from being constructed in the first place (this is equivalent to
if (some_struct_value) in C).
- in statement_restrictions.cc, we need to evalute the lhs in the context
of the full binary operator. I introduced with_current_binary_operator()
for this; an alternative approach is to create a new sub-visitor.
Closes#8797
Miscellaneous preparatory patches for group 0 discovery.
* scylla-dev/raft-group-0-part-2-v4:
raft: (service) servers map is gid -> server, not sid -> server
system_keyspace: raft.group_id and raft_snapshots.group_id are TIMEUUID
raft: (server) wait for configuration transition to complete
raft: (server) implement raft::server::get_configuration()
raft: (service) don't throw from schema state machine
raft: (service) permit some scylla.raft cells to be empty
raft: (service) properly handle failure to add a server
raft: implement is_transient_error()
In commit 11a8912093 (gossiper:
get_gossip_status: return string_view and make noexcept)
get_gossip_status returns a pointer to an endpoint_state in
endpoint_state_map.
After commit 425e3b1182 (gossip: Introduce
direct failure detector), gossiper::mark_dead and gossiper::real_mark_alive
can yield in the middle of the function. It is possible that
endpoint_state can be removed, causing use-after-free to access it.
To fix, make a copy before we yield.
Fixes#8859Closes#8862
Raft Group registry should map Raft Group Id to Raft Server,
not Raft Server ID (which is identical for all groups) to Raft server.
Raft Group 0 ID works as a cluster identifier, so is generated when a
new cluster is created and is shared by all nodes of the same cluster.
Implement a helper to get raft::server by group id.
Consistently throw a new raft_group_not_found exception
if there is no server or rpc for the specified group id.
In case we only have a single segment active when shutting down,
the replenisher can be blocked even though we manually flush-deleted.
Add a signal type and abort queues using this to wake up waiter and
force them to check shutdown status.
If we are using recycling, we should always use recycle in
delete_segments, otherwise we can cause deadlock with replenish
task, since it will be waiting for segment, then shutdown is set,
and we are called, and can't fulfil the alloc -> deadlock
If a segments, when finishing a flush call, is deletable, we should issue
a manual call to discard function (which moves deleteable segments off
segment list) asap, since we otherwise are dependent on more calls
from flush handlers (memtable flush). And since we could have blocked
segment allocation, this can cause deadlocks, at least in tests.
Refs #8270
Since segment allocation looks at actual disk footprint, not active,
the threshold check in timer task should include slack space so we
don't mistake sparse usage for space left.
Refs #8270
Try to ensure we issue a flush as soon as we are allocating in the
last allowable segment, instead of "half through". This will make
flushing a little more eager, but should reduce latencies created
by waiting for segment delete/recycle on heavy usage.
"
The LSA small objects allocation latency is greatly affected by
the way this allocator encodes the object descriptor in front of
each allocated slot.
Nowadays it's one of VLE variants implemented with the help of a
loop. Re-implementing this piece with less instructions and without
a loop allows greatly reducing the allocation latency.
The speed-up mostly comes from loop-less code that doesn't confuse
branch predictor. Also the express encoder seems to benefit from
writing 8 bytes of the encoded value in one go, rather than byte-
-by-byte.
Perf measurements:
1. (new) logallog test shows ~40% smaller times
2. perf_mutation in release mode shows ~2% increase in tps
3. the encoder itself is 2 - 4 times faster on x86_64 and
1.05 - 3 times faster on aarch64. The speed-up depends on
the 'encoded length', old encoder has linear time, the
new one is constant
tests: unit(dev), perf(release), just encoder on Aarch64
"
* 'br-lsa-alloc-latency-4' of https://github.com/xemul/scylla:
lsa: Use express encoder
uleb64: Add express encoding
lsa: Extract uleb64 code into header
test: LSA allocation perf test
To make it possible to use the express encoder, lsa needs to
make sure that the value is below express supreme value and
provide the size of the gap after the encoded value.
Both requirements can be satisfied when encoding the migrator
index on object allocation.
On free the encoded value can be larger, so the extended
express encoder will need more instructions and will not be
that efficient, so the old encoder is used there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Standard encoding is compiled into a loop that puts values
into memory byte-by-byte. This works slowly, but reliably.
When allocating an object LSA uses ubel64 encoder with 2
features that allow to optimize the encoder:
1. the value is migrator.index() which is small enough
to fit 2 bytes when encoded
2. After the descriptor there usually comes an object
which is of 8+ bytes in size
Feature #1 makes it possible to encode the value with just
a few instructions. In O3 level clang makes it like
mov %esi,%ecx
and $0xfc0,%ecx
and $0x3f,%esi
lea (%rsi,%rcx,4),%ecx
add $0x40,%ecx
Next, the encoder needs to put the value into a gap whose
size depends on the alignment of previous and current objects,
so the classical algo loops through this size. Feature #2
makes it possible to put the encoded value and the needed
amount of zeros by using 2 64-bit movs. In this case the
encoded value gets off the needed size and overwrites some
memory after. That's OK, as this overwritten memory is where
the allocated object _will_ be, so the contents there is not
of any interest.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The LSA code encodes an object descriptor before the object
itself. The descriptor is 32-bit value and to put it in an
efficient manner it's encoded into unsigned little-endian
base-64 sequence.
The encoding code is going to be optimized, so put it into a
dedicated header in advance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
The close path of the multishard combining reader is riddled with
workarounds the fact that the flat mutation reader couldn't wait on
futures when destroyed. Now that we have a close() method that can do
just that, all these workarounds can be removed.
Even more workarounds can be found in tests, where resources like the
reader concurrency semaphore are created separately for each tested
multishard reader and then destroyed after it doesn't need it, so we had
to come up with all sorts of creative and ugly workarounds to keep
these alive until background cleanup is finished.
This series fixes all this. Now, after calling close on the multishard
reader, all resources it used, including the life-cycle policy, the
semaphores created by it can be safely destroyed. This greatly
simplifies the handling of the multishard reader, and makes it much
easier to reason about life-cycle dependencies.
Tests: unit(dev, release:v2, debug:v2,
mutation_reader_test:debug -t test_multishard,
multishard_mutation_query_test:debug,
multishard_combining_reader_as_mutation_source:debug)
"
* 'multishard-combining-reader-close-cleanup/v3' of https://github.com/denesb/scylla:
mutation_reader: reader_lifecycle_policy: remove convenience methods
mutation_reader: multishard_combining_reader: store shard_reader via unique ptr
test/lib/reader_lifecycle_policy: destroy_reader: cleanup context
test/lib/reader_lifecycle_policy: get rid of lifecycle workarounds
test/lib/reader_lifecycle_policy: destroy_reader(): stop the semaphore
test/lib/reader_lifecycle_policy: use a more robust eviction mechanism
reader_concurrency_semaphore: wait for all permits to be destroyed in stop()
test/lib/reader_lifcecycle_policy: fix indentation
mutation_reader: reader_lifecycle_policy::destroy_reader(): require to be called on native shard
reader_lifecycle_policy implementations: fix indentation
mutation_reader: reader_lifecycle_policy::destroy_reader(): de-futurize reader parameter
mutation_reader: shard_reader::close(): wait on the remote reader
multishard_mutation_query: destroy remote parts in the foreground
mutation_reader: shard_reader::close(): close _reader
mutation_reader: reader_lifcecycle_policy::destroy_reader(): remove out-of-date comment
When the queue_reader_handle is terminated it was
either explicitly aborted or the reader was closed prematurely.
In this case _consume_fut should hold the root-cause error
(e.g. when compaction is stopped). Return it instead
of trying to push the mutation fragment.
If no error is returned from _consume_fut, make to sure to
return either the queue_reader_handle error, if available,
or a generic error since the writer.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To prevent broken_promise exception.
Since close() is manadatory the queue_reader destructor,
that just detaches the reader from the handle, is not
needed anymore, so remove it.
Adjust the test_queue_reader unit test accordingly.
Test: test_queue_reader(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
By default, wait for the server to leave the joint configuration
when making a configuration change.
When assembling a fresh cluster Scylla may run a series of
configuration changes. These changes would all go through the same
leader and serialize in the critical section around server::cas().
Unless this critical section protects the complete transition from
C_old configuration to C_new, after the first configuration
is committed, the second may fail with exception that a configuration
change is in progress. The topology changes layer should handle
this exception, however, this may introduce either unpleasant
delays into cluster assembly (i.e. if we sleep before retry), or
a busy-wait/thundering herd situation, when all nodes are
retrying their configuration changes.
So let's be nice and wait for a full transition in
server::set_configuration().
When loading raft state from scylla.raft, permit some cells
to be empty. Indeed, the server is not obliged to persist
all vote, term, snapshot once it starts. And the log can be
empty.
The test measures the time it takes to allocate a bunch
of small objects on LSA inside single segment.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently `require()` throws an exception when the condition fails. The
problem with this is that the error is only printed at the end of the
test, with no trace in the logs on where exactly it happened, compared
to other logged events. This patchs also adds an error-level log line to
address this.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210616065711.46224-1-bdenes@scylladb.com>
These convenience methods are not used as much anymore and they are not
even really necessary as the register/unregister inactive read API got
streamlined a lot to the point where all of these "convenience methods"
are just one-liners, which we can just inline into their few callers
without loosing readability.
No need for a shared pointer anymore, as we don't have to potentially
keep the shard reader alive after the multishard reader is destroyed, we
now do proper cleanup in close().
We still need a pointer as the shard reader is un-movable but is stored
in a vector which requires movable values.
Now that we don't rely on any external machinery to keep the relevant
parts of the context alive until needed as its life-cycle is effectively
enclosed in that of the life-cycle policy itself, we can cleanup the
context in `destroy_reader()` itself, avoiding a background trip back to
this shard.
The lifecycle of the reader lifecycle policy and all the resources the
reads use is now enclosed in that of the multishard reader thanks to its
close() method. We can now remove all the workarounds we had in place to
keep different resources as long as background reader cleanup finishes.
So that when this method returns the semaphore is safe to destroy. This
in turn will enable us to get rid of all the machinery we have in place
to deal with the semaphore having to out-live the lifecycle policy
without a clear time as to when it can be safe to destroy.
The test reader lifecycle policy has a mode in which it wants to ensure
all inactive readers are evicted, so tests can stress reader recreation
logic. For this it currently employs a trick of creating a waiter on the
semaphore. I don't even know how this even works (or if it even does)
but it sure complicates the lifecycle policy code a lot.
So switch to the much more reliable and simple method of creating the
semaphore with a single count and no memory. This ensures that all
inactive reads are immediately evicted, while still allows a single read
to be admitted at all times.
Currently shard_reader::close() (its caller) goes to the remote shard,
copies back all fragments left there to the local shard, then calls
`destroy_reader()`, which in the case of the multishard mutation query
copies it all back to the native shard. This was required before because
`shard_reader::stop()` (`close()`'s) predecessor) couldn't wait on
`smp::submit_to()`. But close can, so we can get rid of all this
back-and-forth and just call `destroy_reader()` on the shard the reader
lives on, just like we do with `create_reader()`.
This series partially addresses #8852 and its problems caused by deleting large partitions from tables with materialized views. The issue in question is not fixed by this series, because a full fix requires a more complex rewrite of the view update mechanism.
This series makes calculating affected clustering ranges for materialized view updates more resilient to large allocations and stalls. It does so by futurizing the function which can potentially involve large computations and makes it use non-contiguous storage instead of std::vector to avoid large allocations.
Tests: unit(release)
Closes#8853
* github.com:scylladb/scylla:
db,view,table: futurize calculating affected ranges
table: coroutinize do_push_view_replica_updates
db,view: use chunked vector for view affected ranges
interval: generalize deoverlap()
The shard reader is now able to wait on the stopped reader and pass the
already stopped reader to `destroy_reader()`, so we can de-futurize the
reader parameter of said method. The shard reader was already patched to
pass a ready future so adjusting the call-site is trivial.
The most prominent implementation, the multishard mutation query, can
now also drop its `_dismantling_gate` which was put in place so it can
wait on the background stopping if readers.
A consequence of this move is that handling errors that might happen
during the stopping of the reader is now handled in the shard reader,
not all lifecycle policy implementations.
We now have a future<> returning close() method so we don't need to
do the cleanup of the remote reader in the background, detaching it
from the shard-reader under destruction. We can now wait for the
cleanup properly before the shard reader is destroyed and just pass the
stopped reader to reader_lifecycle_policy::destroy_reader(). This patch
does the first part -- moving the cleanup to the foreground, the API
change of said method will come in the next patch.
Currently the foreign fields of the reader meta are destroyed in the
background via the foreign pointer's destructor (with one exception).
This makes the already complicated life-cycle of these parts and their
dependencies even harder to reason about, especially in tests, where
even things like semaphores live only within the test.
This patch makes sure to destroy all these remote fields in the
foreground in either `save_reader()` or `stop()`, ensuring that once
`stop()` returns, everything is cleaned up.
The reason we got away without closing _reader so far is that it is an
`std::unique_ptr<evictable_reader>` which is a
`flat_mutation_reader::impl` instance, without the
`flat_mutation_reader` wrapper, which contains the validations for
close.
"
This series introduces a new version of the mutation fragment stream (called v2)
which aims at improving range tombstone handling in the system.
When compacting a mutation fragment stream (e.g. for sstable compaction, data query, repair),
the compactor needs to accumulate range tombstones which are relevant for the yet-to-be-processed range.
See range_tombstone_accumulator. One problem is that it has unbounded memory footprint because the
accumulator needs to keep track of all the tombstoned ranges which are still active.
Another, although more benign, problem is computational complexity needed to maintain that data structure.
The fix is to get rid of the overlap of range tombstones in the mutation fragment stream. In v2 of the
stream, there is no longer a range_tombstone fragment. Deletions of ranges of rows within a given
partition are represented with range_tombstone_change fragments. At any point in the stream there
is a single active clustered tombstone. It is initially equal to the neutral tombstone when the
stream of each partition starts. The range_tombstone_change fragment type signify changes of the
active clustered tombstone. All fragments emitted while a given clustered tombstone is active are
affected by that tombstone. Like with the old range_tombstone fragments, the clustered tombstone
is independent from the partition tombstone carried in partition_start.
The memory needed to compact a stream is now constant, because the compactor needs to only track the
current tombstone. Also, there is no need to expire ranges on each fragment because the stream emits
a fragment when the range ends.
This series doesn't convert all readers to v2. It introduces adaptors which can convert
between v1 and v2 streams. Each mutation source can be constructed with either v1 or v2 stream factory,
but it can be asked any version, performing conversion under the hood if necessary.
In order to guarantee that v1 to v2 conversion produces a well-formed stream, this series needs to
impose a constraint on v1 streams to trim range tombstones to clustering restrictions. Otherwise,
v1->v2 converted could produce range tombstone changes which lie outside query restrictions, making
the stream non-canonical.
The v2 stream is strict about range tombstone trimming. It emits range tombstone changes which reflect
range tombstones trimmed to query restrictions, and fast-forwarding ranges. This makes the stream
more canonical, meaning that for a given set of writes, querying the database should produce the
same stream of fragments for a given restrictions. There is less ambiguity in how the writes
are represented in the fragment stream. It wasn't the case with v1. For example, A given set
of deletions could be produced either as one range_tombstone, or may, split and/or deoverlapped
with other fragments. Making a stream canonical is easier for diff-calculating.
The mc sstable reader was converted to v2 because it seemed like a comparable effort to do that
versus implementing range tombstone trimming in v1.
The classes related to mutation fragment streams were cloned:
flat_mutation_reader_v2, mutation_fragment_v2, related concepts.
Refs #8625. To fully fix#8625 we need to finish the transition and get rid of the converters.
Converters accumulate range tombstones.
Tests:
- unit [dev]
"
* tag 'flat_mutation_reader_range_tombstone_split-v3.2' of github.com:tgrabiec/scylla: (26 commits)
tests: mutation_source_test: Run tests with conversions inserted in the middle
tests: mutation_source_tests: Unroll run_flat_mutation_reader_tests()
tests: Add tests for flat_mutation_reader_v2
flat_mutation_reader: Update the doc to reflect range tombstone trimming
sstables: Switch the mx reader to flat_mutation_reader_v2
row_cache: Emit range tombstone adjacent to upper bound of population range
tests: sstables: Fix test assertions to not expect more than they should
flat_mutation_reader: Trim range tombstones in make_flat_mutation_reader_from_fragments()
clustering_ranges_walker: Emit range tombstone changes while walking
tests: flat_mutation_reader_assertions_v2: Adapt to the v2 stream
Clone flat_reader_assertions into flat_reader_assertions_v2
test: lib: simple_schema: Reuse new_tombstone()
test: lib: simple_schema: Accept tombstone in delete_range()
mutation_source: Introduce make_reader_v2()
partition_snapshot_flat_reader: Trim range tombstones to query ranges
mutation_partition: Trim range tombstones to query ranges
sstables: reader: Inline specialization of sstable_mutation_reader
sstables: k_l: reader: Trim range tombstones to query ranges
clustering_ranges_walker: Introduce split_tombstone()
position_range: Introduce contains() check for ranges
...
We check for the existence of the option using one spelling,
then read it using another, so we crash with bad_lexical_cast
if it's present when casting the empty string to unsigned.
Fix by using the correct spelling.
Closes#8866
The one was added to smothly switch tri-comparing stuff from int
to strong-ordering. As for today only tests still need it and the
conversion is pretty simple, plus operator<<(ostream&) for the
std::strong_ordering type.
* xemul/br-remove-int-or-strong-ordering-2:
util: Drop int_or_strong_ordering concept
tests: Switch total-order-check onto strong_ordering
to_string: Add formatter for strong_ordering
tests: Return strong-ordering from tri-comparators
When a new node bootstraps to join the cluster, it will be set in
bootstrap gossip status. If the node is gone in the middle, the node
will be removed by gossip after the new node fails to update gossip
after fat_client_timeout, which reverts the new node as pending node.
However, if the new node is slow to update gossip and it finishes
bootstrapping after existing nodes have removed the new node after
fat_client_timeout. In handle_state_normal handler, the existing nodes
will fail to find the host id for the new node and throw and in turn
terminate the scylla process.
To mitigate the problem, we set fat_client_timeout which is half of
quarantine_delay to a minimum value if users set a small ring_delay
value.
Refs #8702
Refs #8859Closes#8860
The main difficulty was in making sure that emitted range tombstone
changes reflect range tombstones trimmed to clustering restrictions.
This is handled by mutation_fragment_filter and
clustering_ranges_walker. They return a list of range_tombstone_change
fragments to emit for each hop as the reader walks over the clustering
domain.
Tests which were using a normalizing reader expected range tombstones
to be split around rows. Drop this an adjust the tests accoridngly. No
reader splits range tombstones around rows now.
Cache populating reader was emitting the row entry which stands for
the upper bound of the population range, but did not emit range
tombstones for the clustering range corresponding to:
[ before(key), after(key) ).
This surfaces after sstable readers are changed to trim emitted range
tombstones to the fast-forwarding range. Before, it didn't cause
problems, because that range tombstone part would be emitted as part
of the sstable read.
The fix is to drop the optimization which pushes the row after
population is done, and let the regular handling for
copy_from_cache_to_buffer() take care of emitting the row and
tombstones for the remaining range.
A unit test is added which covers population from all sstable
versions.
Before this patch, the tests expected readers to emit range tombstones
which are outside clustering restrictions. Readers do not have to emit
range tombstones outside clustering restrictions, so fix tests to only
expect the part which overlaps with query ranges.
This is a preparatory patch before changing readers to trim range
tombstones to clustering ranges.
This is needed to change the guarantees of flat_mutation_reader v1 to
produce only range tombstones trimmed to clustering restrictions. The
reason for this is so that v2 has a canonical representation in which
all fragments have position inside clustering restrictions. Conversion
from v1 to v2 can guarantee that only if v1 trims range tombstones.
The walker will now emit range tombstone change fragments while
walking. This is in order to support the guarantee of
flat_mutation_reader_v2 saying that clustering range tombstone
information must be trimmed to clustering key restrictions.
For example, for ranges:
[1, 3) [5, 9) [10, 11)
advancing generates the following changes:
using rtc = range_tombstone_change;
advance_to(0, {}) -> []
advance_to(2, t1) -> [ rtc(2, t1) ]
advance_to(4, t2) -> [ rtc(3, {}) ]
advance_to(15, t3) -> [ rtc(5, t2), rtc(9, {}), rtc(10, t2), rtc(11, {}) ]
Mutation sources can now produce natively either v1 or v2 streams. We
still have both v1 and v2 make_reader() variants, which wrap in
appropriate converters under the hood.
This is needed to change the guarantees of flat_mutation_reader v1 to
produce only range tombstones trimmed to clustering restrictions. The
reason for this is so that v2 has a canonical representation in which
all fragments have position inside clustering restrictions. Conversion
from v1 to v2 can guarantee that only if v1 trims range tombstones.
Current code was only selecting overlapping range tombstones.
We will need range tombstones to be trimmed. This is needed to change
the semantics of flat_mutation_reader v1 to produce only range
tombstones trimmed to clustering restrictions. This constructor is
used in unit tests which verify what reader produces.
Needed before converting the mx reader to flat_mutation_reader_v2
because now it and the k_l reader cannot share the reader
implementation. They derive from different reader impl bases and push
different fragment types.
Test that a node outside configuration, that receives `timeout_now`
message, doesn't disrupt operation of the rest of the cluster.
That is, `timeout_now` has no effect and the outsider stays in
the follower state without promoting to the candidate.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Test that a leader which has entered leader stepdown mode rejects
new append requests.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
3-node cluster (A, B, C). A is initially elected a leader.
The leader adds a new configuration entry, that removes it from the
cluster (B, C).
Wait up until the former leader commits the new configuration and starts
leader transfer procedure, sending out the `timeout_now` message to
one of the remaining nodes. But at that point it haven't received it yet.
Deliver the `timeout_now` message to the target but lose all the
`vote_request(force)` messages it attempts to send.
This should halt the election process.
Then wait for election timeout so that candidate node starts another
normal election (without `force` flag for vote requests).
Check that this candidate further makes progress and is elected a
leader.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
3-node cluster (A, B, C). A is initially elected a leader.
The leader adds a new configuration entry, that removes it from the
cluster (B, C).
Wait up until the former leader commits the new configuration and starts
leader transfer procedure, sending out the `timeout_now` message to
one of the remaining nodes. But at that point it haven't received it yet.
Lose this message and verify that the rest of the cluster (B, C)
can make progress and elect a new leader.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
4-node cluster (A, B, C, D). A is initially elected a leader.
The leader adds a new configuration entry, that removes it from the
cluster (B, C, D).
Communicate the cluster up to the point where A starts to resign
its leadership (calls `transfer_leadership()`).
At this point, A should send a `timeout_now` message to one
the remaining nodes (B, C or D) and the new configuration should be
committed. But no nodes actually have received the `timeout_now` message
yet.
Determine on which node the message should arrive, accept the
`timeout_now` message and disconnect the target from the rest of the
group.
Check that after that the cluster, which has only two live members,
could progress and elect a new leader through a normal election process.
tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This is needed to change the guarantees of flat_mutation_reader v1 to
produce only range tombstones trimmed to clustering restrictions. The
reason for this is so that v2 has a canonical representation in which
all fragments have position inside clustering restrictions. Conversion
from v1 to v2 can guarantee that only if v1 trims range tombstones.
The transition to v2 will be incremental. To support that, we need
adaptors between v1 and v2 which will be inserted at places which are
boundaries of conversion.
The v1 -> v2 converter needs to accumulate range tombstones, so has unbounded worst case memory footprint.
The v2 -> v1 converter trims range tombstones around clustering rows,
so generates more fragments than necessary.
Because of that, adpators are a temporary solution and we should not
release with them on the produciton code paths.
When compacting a mutation fragment stream (e.g. for sstable
compaction, data query, repair), the compactor needs to accumulate
range tombstones which are relevant for the yet-to-be-processed range.
See range_tombstone_accumulator. One problem is that it has unbounded
memory footprint because the accumulator needs to keep track of all
the tombstoned ranges which are still active.
Another, although more benign, problem is computational complexity
needed to maintain that data structure.
The fix is to get rid of the overlap of range tombstones in the
mutation fragment stream. In v2 of the stream, there is no longer a
range_tombstone fragment. Deletions of ranges of rows within a given
partition are represented with range_tombstone_change fragments. At
any point in the stream there is a single active clustered
tombstone. It is initially equal to the neutral tombstone when the
stream of each partition starts. The range_tombstone_change fragment
type signify changes of the active clustered tombstone. All fragments
emitted while a given clustered tombstone is active are affected by
that tombstone. Like with the old range_tombstone fragments, the
clustered tombstone is independent from the partition tombstone
carried in partition_start.
The v2 stream is strict about range tombstone trimming. It emits range
tombstone changes which reflect range tombstones trimmed to query
restrictions, and fast-forwarding ranges. This makes the stream more
canonical, meaning that for a given set of writes, querying the
database should produce the same stream of fragments for a given
restrictions. There is less ambiguity in how the writes are
represented in the fragment stream. It wasn't the case with v1. For
example, A given set of deletions could be produced either as one
range_tombstone, or may, split and/or deoverlapped with other
fragments. Making a stream canonical is easier for diff-calculating.
The classes related to mutation fragment streams were cloned:
flat_mutation_reader_v2, mutation_fragment_v2, and related concepts.
Refs #8625.
abstract_read_executor::make_requests() calls make_{data,digest}_request(),
which loop over endpoints in a parallel_for_each(), then collects the
result of the parallel_for_each()es with when_all_succeed(), then
a handle_execption() (or discard_result() in related callers). The
caller of make_requests then attaches a finally() block to keep `this`
alive, and discards the remaining future.
So, a lot of continuations are generated to merge the results, all in
order to keep a reference count alive.
Remove those excess continuations by having individual make_*_request()
variants elevate the reference count themselves. They all already have
a continuation to uncorporate the result into the executor, all they need
is an extra shared_from_this() call. The parallel_for_each() loops
are converted to regular for loops.
Note even a local request that hits cache ends up with a non-ready future
due to an execution_stage for replica access, so these continuations
generate reactor tasks.
perf_simple_query reports:
before: median 203905.19 tps ( 87.1 allocs/op, 20.1 tasks/op, 50860 insns/op)
after: median 214646.89 tps ( 81.1 allocs/op, 15.1 tasks/op, 48604 insns/op)
There were large allocation reportsa from vectors used for
calculating affected ranges. In order to reduce the pressure
on the allocator, chunked vector is used for storing intermediate
results.
Instead of working only for std::vector, deoverlap is now capable
of using other structures - including chunked_vector, which will
help split large allocations into smaller ones.
Previously `ticker` would use a single function, `on_tick`, which it
called in a loop with yields in-between. In `basic_test` we would use
this to tick every object in synchrony.
However, to closely simulate a production environment, we want the
tick ratios to be different. For example Raft servers should be ticked
rarely compared to the network.
We may also want to give the Seastar reactor more space between the
function calls (e.g. if they cause a bunch of work to be created for the
reactor that needs more than one tick to complete).
To support these use cases we first generalize `ticker` to take a set of
functions with associated numbers. These numbers are the call periods of
their corresponding functions: given {n, f}, `f` will be called each
`n`th tick.
We use this new functionality to tick Raft servers less often than the
network in basic_test.
This patchset effectively reverts 01b6a2eb38
which caused the ticker to call `on_tick` only when the Seastar reactor
had no work to do. This approach is unfortunately incompatible with the
approach taken there. We *do* want the ticker to race with other work,
potentially producing more work while already scheduled work is executing,
and we want to see in tests what happens when we adjust the ticking ratios
of different subsystems.
The previous approach also had a problem where if there was an infinite task
loop executing, the ticker wouldn't ever tick.
The previous fix was introduced since the ticker caused too much work to
be produced (so the reactor couldn't keep up) due to ticking the Raft
servers too often (after each yield). This commit deals with the problem
in a different way, by ticking the servers rarely, which also resembles
"real-life" scenarios better.
* kbr/tick-network-often-v4:
raft: randomized_nemesis_test: generalize `ticker` to take a set of functions
raft: randomized_nemesis_test: split `environment::tick` into two functions
raft: randomized_nemesis_test: fix potential use-after-free in basic_test
... with associated calling periods and use the new API in `basic_test`.
Previously `ticker` would use a single function, `on_tick`, which it
called in a loop with yields in-between. In `basic_test` we would use
this to tick every object in synchrony.
However, to closely simulate a production environment, we may want the
tick ratios to be different. For example Raft servers should be ticked
rarely compared to the network.
We may also want to give the Seastar reactor more space between the
function calls (e.g. if they cause a bunch of work to be created for the
reactor that needs more than one tick to complete).
To support these use cases we generalize `ticker` to take a set of
functions with associated numbers. These numbers are the call periods of
their corresponding functions: given {n, f}, `f` will be called each
`n`th tick.
We also modify `basic_test` to use this new approach: we tick Raft
servers once per 10 network ticks (in particular, once per 10 reactor
yields).
This commit effectively reverts 01b6a2eb38
which caused the ticker to call `on_tick` only when the Seastar reactor
had no work to do. This approach is unfortunately incompatible with the
approach taken there. We *do* want the ticker to race with other work,
potentially producing more work while already scheduled work is executing,
and we want to see in tests what happens when we adjust the ticking ratios
of different subsystems.
The previous approach also had a problem where if there was an infinite task
loop executing, the ticker wouldn't ever tick.
The previous fix was introduced since the ticker caused too much work to
be produced (so the reactor couldn't keep up) due to ticking the Raft
servers too often (after each yield). This commit deals with the problem
in a different way, by ticking the servers rarely, which also resembles
"real-life" scenarios better.
With this change we must also wait a bit longer for the first node to
elect itself as a leader at the beginning of the test.
The test starts by waiting a certain number of ticks for the first node
to elect itself as a leader.
If this wait times out - i.e. the number of ticks passes before the node
manages to elect itself - the future associated with the task which checks
for the leader condition becomes discarded (it is passed to
`with_timeout`) and the task may keep using the `environment` (which it
has a reference to) even after the `environment` is destroyed.
Furthermore, the aforementioned task is a coroutine which uses lambda
captures in its body. Leaving `with_timeout` destroys the lambda object,
causing the coroutine to refer to no-longer-existing captures.
We fix the problems by:
- making `environment` `weakly_referencable` and checking if its alive
before it's used inside the task,
- not capturing anything in the lambda but passing whatever's needed as
function arguments (so these things get allocated inside the coroutine
frame).
Merged patch series by Pavel Emelyanov:
Alternator start and stop code is sitting inside the main()
and it's a big piece of code out there. Havig it all in main
complicates rework of start-stop sequences, it's much more
handy to have it in alternator/.
This set puts the mentioned code into transport- and thrift-
like controller model. While doing it one more call for global
storage service goes away.
* 'br-alternator-clientize' of https://github.com/xemul/scylla:
alternator: Move start-stop code into controller
alternator: Move the whole starting code into a sched group
alternator: Dont capture db, use cfg
alternator: Controller skeleton
alternator: Controller basement
alternator: Drop storage service from executor
The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.
Fixes#8830
Tests: unit(release),
dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)
Closes#8834
* github.com:scylladb/scylla:
view: fix use-after-move when handling view update failures
db,view: explicitly move the mutation to its helper function
db,view: pass base token by value to mutate_MV
This patch also fixes rare hangs in debug mode for drops_04 without
prevote.
Branch URL: https://github.com/alecco/scylla/tree/raft-fixes-05-v2-dueling
Tests: unit ({dev}), unit ({debug}), unit ({release})
Changes in v2:
- Fixed commit message @kostja
Whithout prevote, a node disconnected for long enough becomes candidate.
While disconnected (A) it keeps increasing its term.
When it rejoins it disrupts the current leader (C) which steps down due
to the higher term in (A)'s append_entries_reply and (C) also increases
its term.
Meanwhile followers (B) and (D) don't know (C) stepped down but see it
alive according to the current failure detecture implementation, and
also (A) has shorter log than them.
So they reject (A)'s vote requests (Raft 4.2.3 Disruptive servers).
Then (C) rejects voting for (A) because it has shorter log.
And (C) becomes candidate but even though (A) votes for (C), the
previous followers (B) and (D) ignore a vote request while leader (C) is
still alive and election timeout has not passed.
(A) and (C) alone can't reach quorum 2/4. So elections never succeed.
This patch addresses this problem by making followers not ignore vote
requests from who they think is the current leader even though
election timout was not reached.
As @kostja noted, if failure detector would consider a leader alive only
as long as it sends heartbeats (append requests) this patch is no longer
needed.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20210611172734.254757-1-alejo.sanchez@scylladb.com>
This patch fixes the following compilation warning:
database.cc:430:33: warning: 'update_shares_for_class' is deprecated:
Use io_priority_class.update_shares [-Wdeprecated-declarations]
_inflight_update = engine().update_shares_for_class(_io_priority,
uint32_t(shares));
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Closes#8751
The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.
Refs #8830
Tests: unit(release),
dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)
The `apply_to_remote_endpoints` helper function used to take
its `mut` parameter by reference, but then moved the value from it,
which is confusing and prone to errors. Since the value is moved-from,
let's pass it to the helper function as rvalue ref explicitly.
The base token is passed cross-continuations, so the current way
of passing it by const reference probably only works because the token
copying is cheap enough to optimize the reference out.
Fix by explicitly taking the token by value.
* seastar 4506b878...813eee3e (12):
> reactor: fix race with boost::barrier destructor during smp initialialization
> Merge "Merge io-group and io-queue configs" from Pavel E
> tests: add test for skipping data from a socket
> tests: transform socket_test into a test suite
> .gitignore: Add tags
> tls: retain handshake error and return original problem on repeated failures
> iostream: fix skipping from closed sockets
> gitignore .cooking_memory
> Merge 'metrics: Fix dtest->ulong conversion error' from Benny Halevy
> io_priority_class: Make update_shares const
> Remove <seastar/core/apply.hh>
> smp: allow having multiple instances of the smp class
The fix to make io_priority::update_shares() const will allow getting
rid of one of the compilation warnings.
This patch fixes following warnings:
service/priority_manager.cc:30:36: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations]
: _commitlog_priority(engine().register_one_priority_class("commitlog", 1000))
service/priority_manager.cc:31:35: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations]
, _mt_flush_priority(engine().register_one_priority_class("memtable_flush", 1000))
service/priority_manager.cc:32:36: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations]
, _streaming_priority(engine().register_one_priority_class("streaming", 200))
service/priority_manager.cc:33:36: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations]
, _sstable_query_read(engine().register_one_priority_class("query", 1000))
service/priority_manager.cc:34:37: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations]
, _compaction_priority(engine().register_one_priority_class("compaction", 1000))
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This patch fixes the following warning:
main.cc:307:53: warning: 'capacity' is deprecated: modern I/O queues
should use a property file [-Wdeprecated-declarations]
auto capacity = engine().get_io_queue().capacity();
It's fine to just check --max-io-requests directly because seastar
sets io_queue::capacity to the value of this parameter anyway.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Incremental selection may not work properly for LCS and ICS due to an
use-after-free bug in partitioned set which came into existence after
compound set was introduced.
The use-after-free happens because partitioned set wasn't taking into
account that the next position can become the current position in the
next iteration, which will be used by all selectors managed by
compound set. So if next position is freed, when it were being used
as current position, subsequent selectors would find the current
position freed, making them produce incorrect results.
Fix this by moving ownership of next pos from incremental_selector_impl
to incremental_selector, which makes it more robust as the latter knows
better when the selection is done with the next pos. incremental_selector
will still return ring_position_view to avoid copies.
Fixes#8802.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210611130957.156712-1-raphaelsc@scylladb.com>
To handle a read request from a client, the coordinator node must send
data and digest requests to replicas, reconcile the obtained results
(by merging the obtained mutations and comparing digests), and possibly
send more requests to replicas if the digests turned out to be different
in order to perform read repair and preserve consistency of observed reads.
In contrast to writes, where coordinators send their mutation write requests
to all replicas in the replica set, for reads the coordinators send
their requests only to as many replicas as is required to achieve
the desired CL.
For example consider RF=3 and a CL=QUORUM read. Then the coordinator sends
its request to a subset of 2 nodes out of the 3 possible replicas. The
choice of the 2-node subset is random; the distribution used for the
random roll is affected by certain things such as the "cache hitrate"
metric. The details are not that relevant for this discussion.
If not all of the the initially chosen replicas
answer within a certain time period, the coordinator may send an
additional request to one more replica, hoping that this replica helps
achieving the desired CL so the entire client request succeeds. This
mechanism is called "speculative retry" and is enabled by default.
This time period - call it `T` - is chosen based on keyspace
configuration. The default value is "99.0PERCENTILE", which means that
`T` is roughly equal to the 99th percentile of the latency distribution
of previous requests (or at least the most recent requests; the
algorithm uses an exponential decay strategy to make old request less
relevant for the metric). The latencies used are the durations of whole
coordinator read requests: each such duration measurement starts before
the first replica request is sent and ends after the last replica
request is answered, among the replica requests whose results were used
for the reconciled result returned to the client (there may be more
requests sent later "in the background" - they don't affect the client
result and are not taken into account for the latency measurement).
This strategy, however, gives an undesired effect which appears
when a significant part of all requests require a speculative retry to
succeed. To explain this effect it's best to consider a scenario which
takes this to the extreme - where *all* requests require a speculative retry.
Consider RF=3 and CL=QUORUM so each read request initially uses 2
replicas. Let {A, B, C} be the set of replicas. We run a uniformly
distributed read workload.
Initially the cluster operates normally. Roughly 1/3 of all requests go
to replicas {A, B}, 1/3 go to {A, C}, and 1/3 go to {B, C}. The 99th
percentile of read request latencies is 50ms. Suppose that the average
round-trip latency between a coordinator and any replica is 10ms.
Suddenly replica C is hard-killed: non-graceful shutdown, e.g. power
outage. This means that other nodes are initially not aware that C is down,
they must wait for the failure detector to convict C as unavailable
which happens after a configurable amount of time. The current default
is 20s, meaning that by default coordinators will still attempt to send
requests to C for 20s after it is hard-killed.
During this period the following happens:
- About 2/3 of all requests - the ones which were routed to {A, C} and
{B, C} - do not finish within 50ms because C does not answer. For
these requests to finish, the coordinator performs a speculative retry
to the third replica which finishes after ~10ms (the average round-trip
latency). Thus the entire request, from the coordinator's POV, takes ~60ms.
- Eventually (very quickly in fact - assuming there are many concurrent
requests) the P99 latency rises to 60ms.
- Furthermore, the requests which initially use {A, C} and {B, C} start
taking more than 2/3 of all requests because they are stuck in the foreground
longer than the {A, B} requests (since their latencies are higher).
- These requests do not finish within 60ms. Thus coordinators perform
speculative retries. Thus they finish after ~70ms.
- Eventually the P99 latency rises to 70ms.
- These bad requests take an even longer portion of all requests.
- These requests do not finish within 70ms. They finish after ~80ms.
- Eventually the P99 latency rises to 80ms.
- And so on.
In metrics, we observe the following:
- Latencies rise roughly linearly. They rise until they hit a certain limit;
this limit comes from the fact that `T` is upper-bounded by the
read request timeout parameter divided by 2. Thus if the read request
timeout is `5s` and P99 latencies are `3s`, `T` will be `2.5s`, not `3s`.
Thus eventually all requests will take about `2.5s + 10ms` to finish
(`2.5s` until speculative retry happens, `10ms` for the last round-trip),
unless the node is marked as DOWN before we reach that limit.
- Throughput decreases roughly proportionally to the y = 1/x function, as
expected from Little's law.
Everything goes back to normal when nodes mark C as DOWN, which happens
after ~20s by default as explained above. Then coordinators start
routing all requests to {A, B} only.
This does not happen for graceful shutdowns, where C announces to the
cluster that it's shutting down before shutting down, causing other
nodes to mark it as DOWN almost immediately.
The root cause of the issue is a feedback loop in the metric used to
calculate `T`: we perform a speculative retry after `T` -> P99 request
latencies rise above `T + 10ms` -> `T` rises above `T + 10ms` -> etc.
We fix the problem by changing the measurements used for calculating
`T`. Instead of measuring the entire coordinator read latency, we
measure each replica request separately and take the maximum over these
measurements. We only take into account the measurements for requests
that actually contributed to the request's result.
The previous statistic would also measure failed requests latencies. Now we
measure only latencies of successful replica requests. Indeed this makes
sense for the speculative retry use case; the idea behind speculative retry
is that we assume that requests usually succeed within a certain time
period, and we should perform the retry if they take longer than that.
To measure this time period, taking failed requests into account doesn't
make much sense.
In the scenario above, for a request that initially goes to {A, C}, the
following would happen after applying the fix:
- We send the requests to A and C.
- After ~10ms A responds. We record the ~10ms measurement.
- After ~50ms we perform speculative retry, sending a request to B.
- After ~10ms B responds. We record the ~10ms measurement.
The maximum over recorded measurements is ~10ms, not ~60ms.
The feedback loop is removed.
Experiments show that the solution is effective: in scenarios like
above, after C is killed, latencies only rise slightly by a constant
amount and then maintain their level, as expected. Throughput also drops
by a constant amount and maintains its level instead of continuously
dropping with an asymptote at 0.
Fixes#3746.
Fixes#7342.
Closes#8783
This series adds a new configuration option -
restrict_replication_simplestrategy - which can be used to restrict the
ability to use SimpleStrategy in a CREATE KEYSPACE or ALTER KEYSPACE
statement. This is part of a new effort (dubbed "safe mode") to allow an
installation to restrict operations which are un-recommended or dangerous
(see issue #8586 for why SimpleStrategy is bad).
The new restrict_replication_simplestrategy option has three values:
"true", "false", and "warn":
For the time being, the default is still "false", which means SimpleStrategy is not
restricted, and can still be used freely.
Setting a value of "true" means that SimpleStrategy *is* restricted -
trying to create a a table with it will fail:
cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' :
'SimpleStrategy', 'replication_factor': 1 };
ConfigurationException: SimpleStrategy replication class is not
recommended, and forbidden by the current configuration. Please use
NetworkToplogyStrategy instead. You may also override this restriction
with the restrict_replication_simplestrategy=false configuration
option.
Trying to ALTER an existing keyspace to use SimpleStrategy will
similarly fail.
The value "warn" allows - like "false" - SimpleStrategy to be used, but
produces a warning when used to create a keyspace. This warning appears
in the CREATE/ALTER KEYSPACE statement's response (an interactive cqlsh
user will see this warning), and also in Scylla's logs. For example:
cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' :
'SimpleStrategy', 'replication_factor': 1 };
Warnings :
SimpleStrategy replication class is not recommended, but was used for
keyspace try1. The restrict_replication_simplestrategy configuration
option can be changed to silence this warning or make it into an error.
Fixes#8586Closes#8765
* github.com:scylladb/scylla:
cql: create_keyspace_statement: move logger out of header file
cql: allow restricting SimpleStrategy in ALTER KEYSPACE
cql: allow restricting SimpleStrategy in CREATE KEYSPACE
config: add configuration option restrict_replication_simplestrategy
config: add "tri_mode_restriction" type of configurable value
utils/enum_option.hh: add implicit converter to the underlying enum
Move the logger declaration from the header file into the only source
file that uses it.
This is just a small cleanup similar to what the previous patch did in
alter_keyspace_statement.{cc,hh}.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In the previous patch we made CREATE KEYSPACE honor the
"restrict_replication_simplestrategy" option. In this patch we do the
same for ALTER KEYSPACE.
We use the same function check_restricted_replication_strategy()
used in CREATE KEYSPACE for the logic of what to allow depending on the
configuration, and what errors or warnings to generate.
One of the non-self-explanatory changes in this patch is to execute():
Previosuly, alter_keyspace_statement inherited its execute() from
schema_altering_statement. Now we need to override it to check if the
operation is forbidden before running schema_altering_statement's execute()
or to warn after it is run. In the previous patch we didn't need to add
a new execute() for create_keyspace_statement because we already had one.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch uses the configuration option which we added in the previous
patch, "restrict_replication_simplestrategy", to control whether a user
can use the SimpleStrategy replication strategy in a CREATE KEYSPACE
operation. The next patch will do the same for ALTER KEYSPACE.
As a tri_mode_restriction, the restrict_replication_simplestrategy option
has three values - "true", "false", and "warn":
The value "false", which today is still the default, means that
SimpleStrategy is not restricted, and can still be used freely.
The value "true" means that SimpleStrategy *is* restricted - trying to
create a a table with it will fail:
cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' :
'SimpleStrategy', 'replication_factor': 1 };
ConfigurationException: SimpleStrategy replication class is not
recommended, and forbidden by the current configuration. Please use
NetworkToplogyStrategy instead. You may also override this restriction
with the restrict_replication_simplestrategy=false configuration
option.
The value "warn" allows - like "false" - SimpleStrategy to be used, but
produces a warning when used to create a keyspace. This warning appears
in the CREATE KEYSPACE statement's response (an interactive cqlsh
user will see this warning), and also in Scylla's logs. For example:
cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' :
'SimpleStrategy', 'replication_factor': 1 };
Warnings :
SimpleStrategy replication class is not recommended, but was used for
keyspace try1. The restrict_replication_simplestrategy configuration
option can be changed to silence this warning or make it into an error.
Because we plan to use the same checks and the same error messages
also for ALTER TABLE (in the next patch), we encapsulate this logic in
a function check_restricted_replication_strategy() which we will use for
ALTER TABLE as well.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a configuration option to choose whether the
SimpleStrategy replication strategy is restricted. It is a
tri_mode_restriction, allowing to restrict this strategy (true), to allow
it (false), or to just warn when it is used (warn).
After this patch, the option exists but doesn't yet do anything.
It will be used in the following two patches to restrict the
CREATE KEYSPACE and ALTER KEYSPACE operations, respectively.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a new type of configurable value for our command-line
and YAML parsers - a "tri_mode_restriction" - which can be set to three
values: "true", "false", or "warn".
We will use this value type for many (but not all) of the restriction
options that we plan to start adding in the following patches.
Restriction options will allow users to ask Scylla to restrict (true),
to not restrict (false) or to warn about (warn) certain dangerous or
undesirable operations.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add an implicit converter of the enum_option to the underyling enum
it is holding. This is needed for using switch() on an enum_option.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We check that the number of open files is sufficent for normal
work (with lots of connections and sstables), but we can improve
it a little. Systemd sets up a low file soft limit by default (so that
select() doesn't break on file descriptors larger than 1023) and
recommends[1] raising the soft limit to the more generous hard limit
if the application doesn't use select(), as ours does not.
Follow the recommendation and bump the limit. Note that this applies
only to scylla started from the command line, as systemd integration
already raises the soft limit.
[1] http://0pointer.net/blog/file-descriptor-limits.htmlCloses#8756
Add test coverage inspired by etcd for non-voter servers,
and fix issues discovered when testing.
* scylla-dev/raft-learner-test-v4:
raft: (testing) test non-voter can vote
raft: (testing) test receiving a confchange in a snapshot
raft: (testing) test voter-non-voter config change loop
raft: (testing) test non-voter doesn't start election on election timeout
raft: (testing) test what happens when a learner gets TimeoutNow
raft: (testing) implement a test for a leader becoming non-voter
raft: style fix
raft: step down as a leader if converted to a non-voter
raft: improve configuration consistency checks
raft: (testing) test that non-voter stays in PIPELINE mode
raft: (testing) always return fsm_debug in create_follower()
This is a reproducer for issue #8827, that checks that a client which
tries to connect to Scylla with an unsupported version of SSL or TLS
gets the expected error alert - not some sort of unexpected EOF.
Issue #8827 is still open, so this test is still xfailing. However,
I verified that with a fix for this issue, the test passes.
The test also prints which protocol versions worked - so it also helps
checking issue #8837 (about the ancient SSL protocol being allowed).
Refs #8837
Refs #8827
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210610151714.1746330-1-nyh@scylladb.com>
The recent seastar update moved the variable again, so to have a
proper support for it we'd need to have 2 try-catch attempts and
a default. Or 1 try-catch, but make sure the maintainer commits
this patch AND seastar update in one go, so that the intermediate
variable doesn't creep into an intermediate commit. Or bear the
scylla-gdb test is not bisect-safe a little bit.
Instead of making this complex choise I suggest to just drop the
volatile variable from the script at all. This thing is actually
a constant derived from the latency goal and io-properties.yaml
file, so it can be calculated without gdb help (unlike run-time
bits like group rovers or numbers of queued/executing resources).
To free developers from doing all this math by hands there's an
"ioinfo" tool that (when run with correct options) prints the
results of this math on the screen.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210610120151.1135-1-xemul@scylladb.com>
This move is not "just move", but also includes:
- putting the whole thing into seastar::async()
- switch from locally captured dependencies into controller's
class members
- making smp_service_groups optional because it doesn't have
default contructor and should somehow survive on constructed
controller until its start()
Also copy few bits from main that can be generalized later:
- get_or_default() helper from main
- sharded_parameter lambda for cdc
- net family and preferred thing from main
( this also fixed the indentation broken by previous patch )
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The controller won't have the database_config at hands to get
the sched group from. All other client services run the whole
controller start in the needed sched group, so prepare the
alternator controller for that.
To make it compile (and while-at-it) also move up the sharded
server and executor instances and the smp_service_group. All
of these will migrate onto the controller in the next patch.
( the indentation is deliberately left broken )
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When .init()ing the server one needs to provide the
max_concurrent_requests_per_shard value from config.
Instead of carrying the database around for it -- use the
db::config itself which is at hand. All the shards share
its instance anyway.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add the controller class with all the needed dependencies. For
now completely unused (thus a bunch of (void)-s here and there).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add header and source file for transport- (and thrift-) like controller
that'll do all the bookkeeping needed to start and stop this client
service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When a non-voter is requested a vote, it must vote
to preserve liveness. In Raft, servers respond
to messages without consulting with their current configuration,
and the non-voter may not have the latest configuration
when it is requested to vote.
Once learner receives TimeoutNow it becomes a candidate, discovers it
can't vote, doesn't increase its term and converts back to a
follower. Once entries arrive from a new leader it updates its
term.
If the leader becomes a non-voter after a configuration change,
step down and become a follower.
Non-voting members are an extension to Raft, so the protocol spec does
not define whether they can be leaders. I can not think of a reason
why they can't, yet I also can not think of a reason why it's useful,
so let's forbid this.
We already do not allow non-voters to become candidates, and
they ignore timeout_now RPC (leadership transfer), so they
already can not be elected.
Isolate the checks for configuration transitions in a static function,
to be able to unit test outside class server.
Split the condition of transitioning to an empty configuration
from the condition of transitioning into a configuration with
no voters, to produce more user-friendly error messages.
*Allow* to transfer leadership in a configuration when
the only voter is the leader itself. This would be equivalent
to syncing the leader log with the learner and converting
the leader to the follower itself. This is safe, since
the leader will re-elect itself quickly after an election timeout,
and may be used to do a rolling restart of a cluster with
only one voter.
A test case follows.
Recently, the logic of elect_new_leader was changed to allow the old
leader to vote for the new candidate. But the implementation is wrong as
it re-connects the old leader in all cases disregarding if the nodes
were already disconnected.
Check if both old leader and the requested new leader are connected
first and only if it is the case then the old leader can participate in
the election.
There were occasional hangs in the loop of elect_new_leader because
other nodes besides the candidate were ticked. This patch fixes the
loop by removing ticks inside of it.
The loop is needed to handle prevote corner cases (e.g. 2 nodes).
While there, also wait log on all followers to avoid a previously
dropped leader to be a dueling candidate.
And update _leader only if it was changed.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20210609193945.910592-3-alejo.sanchez@scylladb.com>
Original code which introduced enforcing page limits for indexed
statements created a new constant for max result size in bytes.
Botond reported that we already have such a constant, so it's now
used instead of reinventing it from scratch.
Closes#8839
The query tracing tests in test/alternator's test_tracing.py had one
timeout of 30 seconds to find the trace, and one unclearly-coded timeout
for finding the right content for the trace. We recently saw both
timeouts exceeded in tests, but only rarely and only in debug mode,
in a run 100 times slower than normal.
This patch increases both timeouts to 100 seconds. Whatever happens then,
we win: If the test stops failing, we know the new timeout was enough.
If the test continues to fail, we will be able to conclude that we have a
real bug - e.g., perhaps one of the LWT operations has a bug causing it
to hang indefinitely.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210608205026.1600037-1-nyh@scylladb.com>
We're only moving the other reader without the
other's exception (as it maybe already be abandoned
or aborted).
While at it, mark the constructor noexcept.
Fixes#8833
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210609135925.270883-1-bhalevy@scylladb.com>
Both controller and server only need database to get config from.
Since controller creation only happens in main() code which has the
config itself, we may remove database mentioning from transport.
Previous attempt was not to carry the config down to the server
level, but it stepped on an updateable_value landmine -- the u._v.
isn't copyable cross-shard (despite the docs) and to properly
initialize server's max_concurrent_requests we need the config's
named_value member itself.
The db::config that flies through the stack is const reference, but
its named_values do not get copied along the way -- the updateable
value accepts both references and const references to subscribe on.
tests: start-stop in debug mode
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210607135656.18522-1-xemul@scylladb.com>
The Red Hat packages were missing two things, first the metapackage
wasn't dependant at all in the python3 package and second, the
scylla-server package dependencies didn't contain a version as part
of the dependency which can cause to some problems during upgrade.
Doing both of the things listed here is a bit of an overkill as either
one of them separately would solve the problem described in #XXXX
but both should be applied in order to express the correct concept.
Fixes#8829Closes#8832
When a node is removed from the _live_endpoints list directly, e.g., a
node being decommissioned, it is possible the node might not be marked
as down in gossiper::failure_detector_loop_for_node loop before the loop
exits. When the gossiper::failure_detector_loop loop starts again, the
node will not be considered because it is not present in _live_endpoints
list any more. As a result, the node will not be marked as down though
gossiper::failure_detector_loop_for_node loop.
To fix, we mark the nodes that are removed from _live_endpoints
lists as down in the gossiper::failure_detector_loop loop.
Fixes#8712Closes#8770
In #8772, an assert validating first token <= last token
failed in leveled_manifest::overlapping.
It is unclear how we got to that state, so add validation
in sstable::set_first_and_last_keys() that the to-be-set
first and last keys are well ordered.
Otherwise, throw malformed_sstable_exception.
set_first_and_last_keys is called both on the write path
from the sstable writer before the sstable is sealed,
and on the open/load path via update_info_for_opened_data().
This series also fixes issues with unit tests with
regards to first/last keys so they won't fail the
validation.
Refs #8772
Test: unit(dev)
DTest: next-gating(dev), materialized_views_test:TestMaterializedViews.interrupt_build_process_and_resharding_half_to_max_test(debug)
* tag 'validate-first-and-last-keys-ordering-v1':
sstable: validate first and last keys ordering
test: lib: reusable_sst: save unexpected errors
test: sstable_datafile_test: stcs_reshape_test: use token_generation_for_current_shard
test: sstable_test: define primary key in schema for compressed sstable
Currently, if e.g. find_column_family throws an error,
as seen in #8776 when the table was dropped during repair,
the reader is not closed.
Use a coroutine to simplify error handling and
close the reader if an exception is caught.
Also, catch an error inside the lambda passed to make_interposer_consumer
when making the shared_sstable for streaming, and close the reader
their and return an exceptional future early, since
the reader will not be moved to sst->write_components, that assumes
ownership over it and closes it in all cases.
Fixes#8776
Test: unit(dev)
DTest: repair_additional_test.py:RepairAdditionalTest.repair_while_table_is_dropped_test (dev, debug) w/ https://github.com/scylladb/scylla/pull/8635#issuecomment-856661138
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#8782
* github.com:scylladb/scylla:
streaming: make_streaming_consumer: close reader on errors
streaming: make_streaming_consumer: coroutinize returned function
* scylla-dev/raft-group-0-part-1-rebase:
raft: (service) pass Raft service into storage_service
raft: (service) add comments for boot steps
raft: add ordering for raft::server_address based on id
raft: (internal) simplify construction of tagged_id
raft: (internal) tagged_id minor improvements
`ring_range()`/`tokens_iterator` are more complicated than they need to be. The `include_min` parameter is not used anywhere, and `tokens_iterator` is pimplified without a good reason. Simplify that.
Closes#8805
* github.com:scylladb/scylla:
locator: token_metadata: depimplify tokens_iterator
locator: token_metadata: remove _ring_pos from tokens_iterator_impl
locator: token_metadata: remove tokens_end()
locator: token_metadata: remove `include_min` from tokens_iterator_impl
locator: token_metadata: remove the `include_min` parameter from `ring_range()`
Raft group 0 initialization and configuration changes
should be integrated with Scylla cluster assembly,
happening when starting the storage service and joining
the cluster. Prepare for this.
Since Raft service depends on query processor, and query
processor depends on storage service, to break a dependency
loop split Raft initialization into two steps: starting
an under-constructed instance of "sharded" Raft service,
accepting an under-constructed instance of "sharded"
query_processor, and then passed into storage service start
function, and then the local state of Raft groups from system
tables once query processor starts.
Consistently abbreviate raft_services instance raft_svcs, as
is the convention at Scylla.
Update the tests.
Introduce a syntax helper tagged_id::create_random_id(),
used to create a new Raft server or group id.
Provide a default ordering for tagged ids, for use
in Raft leader discovery, which selects the smallest
id for leader.
Insteadof std::runtime_error with a message that
resembles no_such_column_family, throw a
no_such_column_family given the keyspace and table uuid.
The latter can be explicitly caught and handled if needed.
Refs #8612
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210608113605.91292-1-bhalevy@scylladb.com>
This patch adds a "--ssl" option to test/cql-pytest's pytest, as well as
to the run script test/cql-pytest/run. When "test/cql-pytest/run --ssl"
is used, Scylla is started listening for encrypted connections on its
standard port (9042) - using a temporary unsigned certificate. Then, the
individual tests connect to this encrypted port using TLSv1.2 (Scylla
doesn't support earlier version of SSL) instead of TCP.
This "--ssl" feature allows writing test which stress various aspects of
the connection (e.g., oversized requests - see PR #8800), and then be
able to run those tests in both TCP and SSL modes.
Fixes#8811
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210607200329.1536234-1-nyh@scylladb.com>
The methods `make_mutation_data_request`, `make_data_request`
and `make_digest_request` were marked as protected, but weren't used by
deriving classes. The "API" for deriving classes is encapsulated through
plural versions of these functions, such as `make_mutation_data_requests`
(note the "s" at the end), which send a request to a set of replicas
(rather than a single replica) but also do other important things - like
gathering statistics - hence we don't want the deriving classes to use
them directly.
Marking these singular methods as private communicates the intent more
clearly.
The off-strategy compaction is now enabled for repair based node
operations. It is not bound to repair based node operations though. It
makes sense to enable it for streaming based node operations too.
Fixes#8820Closes#8821
This helper uses int_or_strong_ordering to facilitate vector
ordering checks. The rework is straightforward.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Compound set's incremental selector isn't needed when only one set
contains sstables, which is the common case because secondary set
will only contain data during maintenance operations.
From now on, if only primary set contains data, its selector will
be built directly without compound set's selector acting as an
interposer.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210607193651.126937-1-raphaelsc@scylladb.com>
Currently, if e.g. find_column_family throws an error,
as seen in #8776 when the table was dropped during repair,
the reader is not closed.
Use a coroutine to simplify error handling and
close the reader if an exception is caught.
Also, catch an error inside the lambda passed to make_interposer_consumer
when making the shared_sstable for streaming, and close the reader
their and return an exceptional future early, since
the reader will not be moved to sst->write_components, that assumes
ownership over it and closes it in all cases.
Fixes#8776
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Indexed select statements fetch primary key information from
their internal materialized views and then use it to query
the base table. Unfortunately, the current mechanism for retrieving
base table rows makes it easy to overwhelm the replicas with unbounded
concurrency - the number of concurrent ops is increased exponentially
until a short read is encountered, but it's not enough to cap the
concurrency - if data is fetched row-by-row, then short reads usually
don't occur and as a result it's easy to see concurrency of 1M or
higher. In order to avoid overloading the replicas, the concurrency
of indexed queries is now capped at 4096 and additionally throttled
if enough results are already fetched. For paged queries it means that
the query returns as soon as 1MB of data is ready, and for unpaged ones
the concurrency will no longer be doubled as soon as the previous
iteration fetched 1MB of results.
The fixed 4096 value can be subject to debate, its reasoning is as follows:
for 2KiB rows, so moderately large but not huge, they result in
fetching 10MB of data, which is the granularity used by replicas.
For 200B rows, which is rather small, the result would still be
around 1MB.
At the same time, 4096 separate tasks also means 4096 allocations,
so increasing the number also strains the allocator.
Fixes#8799
Tests: unit(release),
manual: observing metrics of modified index_paging_test
Closes#8814
* github.com:scylladb/scylla:
cql3: limit the transitional result size for indexed queries
cql3: return indexed pages after 1MB worth of data
cql3: limit the concurrency of indexed statements
Currently if append_message cannot be sent to one of the followers the
entire io_fiber will block which eventually stop the replication. The
patch changes message sending part of io_fiber to be non blocking. The
code adds a hash table that is used to keep track of append_request
sending status per destination. All the remaining futures are waited for
during abort.
Message-Id: <20210606140305.2930189-2-gleb@scylladb.com>
Currently each tick of the virtual clock immediately schedules the next one
at the end of the task queue, but this is too aggressive. If a tick
generates work that need two tasks to be scheduled one after another
such implementation will make the task queue grow to infinity. Considering
that in the debug mode even ready future causes preemption and task
queue shuffling may cause two or more ticks to be executed without any
other work done in the middle it is very easy to get to such situation.
The patch changes the virtual clock to tick only when a shard is idle.
Message-Id: <20210606140305.2930189-1-gleb@scylladb.com>
Unpaged indexed queries already have a concurrency limit of 4096,
but now the concurrency is further limited by previous number of bytes
fetched. Once this number reached 1MB, the concurrency will not be
increased in consecutive queries to avoid overload.
Currently there's no practical limit of the resulting page size
for an indexed query, because it simply translates a page worth
of base primary keys into base rows. In order to avoid sending
too large pages, the result is returned after hitting a 1MB limit.
Indexed select statements fetch primary key information from
their internal materialized views and then use it to query
the base table. Unfortunately, the current mechanism for retrieving
base table rows makes it easy to overwhelm the replicas with unbounded
concurrency - the number of concurrent ops is increased exponentially
until a short read is encountered, but it's not enough to cap the
concurrency - if data is fetched row-by-row, then short reads usually
don't occur and as a result it's easy to see concurrency of 1M or
higher. In order to avoid overloading the replicas, the concurrency
of indexed queries is now capped at 4096.
The number can be subject to debate, its reasoning is as follows:
for 2KiB rows, so moderately large but not huge, they result in
fetching 10MB of data, which is the granularity used by replicas.
For 200B rows, which is rather small, the result would still be
around 1MB.
At the same time, 4096 separate tasks also means 4096 allocations,
so increasing the number also strains the allocator.
Fixes#8799
Tests: unit(release),
manual: observing metrics of modified index_paging_test
Put the reader_consumer declaration in flat_mutation_reader.hh
and include it instead of declaring the same `using reader_consumer`
declaration in several places.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210607075020.31671-1-bhalevy@scylladb.com>
This is a set of a few cosmetic changes in dht/token. Mostly some comments and a simplification of `midpoint()`.
Closes#8803
* github.com:scylladb/scylla:
dht: token: add a comment excusing the `const bytes&` constructor
dht: token: simplify midpoint()
dht: token: add a comment to normalize()
dht: token: use {read,write}_unaligned instead of std::copy_n
dht: token-sharding: fix a typo in a comment
Too large requests are currently handled by the CQL server by
skipping them and sending back an error response.
That's however wasteful and dangerous: bogus request sizes
will force Scylla to potentially skip gigabytes of data
- and skipping is done by simply reading from the socket,
so it may results in gigabytes of bandwidth wasted.
Even if the request size is not bogus, closing the connection
forces users to adjust their request sizes, which should be done
anyway.
Originally, there was a bug in handling too large requests which
only read their headers and then left the connection in a broken,
undefined state, trying to interpret the rest of the large request
as a next CQL header. It was later fixed to skip the request, but
closing the connection is a safer thing to do.
Fixes#8798Closes#8800
"
There's a bunch of issues with starting and stopping of cql_server with
the help of cql_controller.
fixes: #8796
tests: manual(start + stop,
start + exception on cql_set_state()
)
unit not run, they don't mess with transport controller
"
* 'br-transport-stop-fixes' of https://github.com/xemul/scylla:
transport/controller: Stop server on state change failure too
transport/controller: Rollback server start on state change failure too
transport/controller: Do not leave _server uninitialized
transport/controller: Rework try-catch into defers
_ring_pos is slightly confusing. I thought at first that it doesn't do anything
since operator== doesn't use it.
This cosmetic patch tries to improve the readability, and also removes
operator!= which is generated automatically in C++20.
Feature requests, fixes, and OOP refactor of replication_test.
Note: all known bugs and hangs are now fixed.
A new helper class "raft_cluster" is created.
Each move of a helper function to the class has its own commit.
New helpers are provided
To simplify code, for now only a single apply function can be set per
raft_cluster. No tests were using in any other way. In the future,
there could be custom apply functions per server dynamically assigned,
if this becomes needed.
* alejo/raft-tests-replication-02-v3-30: (66 commits)
raft: replication test: wait for log for both index and term
raft: replication test: reset network at construction
raft: replication test: use lambda visitor for updates
raft: replication test: move structs into class
raft: replication test: move data structures to cluster class
raft: replication test: remove shared pointers
raft: replication test: move get_states() to raft_cluster
raft: replication test: test_server inside raft_cluster
raft: replication test: rpc declarative tests
raft: replication test: add wait_log
raft: replication test: add stop and reset server
raft: replication test: disconnect 2 support
raft: replication test: explicit node_id naming
raft: replication test: move definitions up
raft: replication test: no append entries support
raft: replication test: fix helper parameter
raft: replication test: stop servers out of config
raft: replication test: wait log when removing leader from configuration
raft: replication test: only manipulate servers in configuration
raft: replication test: only cancel rearm ticker for removed server
...
In this small series, I rewrite test/alternator/run to Python using the utility
functions developed for test/cql-pytest. In the future, we should do the same to
test/redis/run and test/scylla-gdb/run.
The benefit of this rewrite is less code duplication (all run scripts start with
the same duplicate code to deal with temporary directories, to run Scylla IP
addresses, etc.), but most importantly - in the future fixes we do to cql-pytest
(e.g., parameters needed to start Scylla efficiently, how to shut down Scylla,
etc.) will appear automatically in alternator test without needing to remember
to change both.
Another benefit is that test/alternator/run will now be Python, not a shell
script. This should make it easier to integrate it into test.py (refs #6212) in
the future - if we want to.
Closes#8792
* github.com:scylladb/scylla:
test/alternator: rewrite test/alternator/run script in Python
test/cql-pytest: make test run code more general
In the last year, four new features were added to DynamoDB which we
don't yet support - Kinesis Streams, PartiQL, Contributor Insights and
Export to S3. Let's document them as missing Alternator features, and
point to the four newly-created issues about these features.
Refs #8786
Refs #8787
Refs #8788
Refs #8789
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210603125825.1179171-1-nyh@scylladb.com>
scylladb/seastar@e6463df8a0 ("smp: allow
having multiple instances of the smp class") changes the type of
seastar::smp::_qs from a unique_ptr to a regular pointer. Adjust for
that change, with a fallback to support older versions.
Closes#8784
Shutdown must never fail, otherwise it may cause hangs
as seen in https://github.com/scylladb/scylla/issues/8577.
This change wraps the file created in `allocate_segment_ex` in `make_checked_file` so that scylla will abort when failing to write to the commitlog files.
In case other errors are seen during shutdown, just log them and continue with shutting down to prevent scylla from hanging.
Fixes#8577
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#8578
* github.com:scylladb/scylla:
commitlog: segment_manager::shutdown: abort on errors
commitlog: allocate_segment_ex: make_checked_file
std::bind() copies the bound parameters for safekeeping. Here this
includes expr, which can be quite heavyweight. Use std::ref() to
prevent copying. This is safe since the bound expression is executed
and discarded before has_supporting_index() returns.
Closes#8791
The current pull_github_pr.sh hard-codes the project name "scylladb/scylla".
Let's determine it automaticaly, from the git origin url.
This will allow using exactly the same script in other Scylla subprojects,
e.g., Seastar.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210318142624.1794419-1-nyh@scylladb.com>
Eliminate not used includes and replace some more includes
with forward declarations where appropriate.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Most RAFT packets are sent very rarely during special phases of the
protocol (like election or leader stepdown). The protocol itself does
not care if a packet is sent or dropped, so returning futures from their
send function does not serve any purpose. Change the raft's rpc interface
to return void for all packet types but append_request. We still want to
get a future from sending append_request for backpressure purposes since
replication protocol is more efficient if there is no packet loss, so
it is better to pause a sender than dropping packets inside the rpc. Rpc
is still allowed to drop append_requests if overloaded.
IO and applier fibers may update waiters and start new snapshot
transfers, so abort() needs to wait for them to stop before proceeding
to abort waiters and snapshot transfers,
In https://github.com/scylladb/scylla/pull/7807 we added support for
Azure instance in Scylla.
The following changes are required in order machine-image to work:
1) fix wrong metadata URL and updating metadata path values (was
intreduce in
f627fcbb0c)
2) fix function naming which been used my machine image
3) add missing function which are reuqired by mahcine-image
4) cleanup unused functions
Closes#8596
The _node_ops_metrics is thread local, it is constructed when it
is first accessed.
If there are no node operations, the metrics will not be shown. To make the
metrics more consistent, init during startup.
Refs #8311Closes#8780
In commit 425e3b1182 (gossip: Introduce
direct failure detector), the call to notify_failure_detector inside ack
and ack2 msg handler was removed since there is no need to update the
old failure detector anymore. However, the timestamp for endpoit_state
is also updated inside notify_failure_detector. With the new failure
detector we still need the timestamp for endpoit_state. Otherwise, nodes
might be removed from gossip wrongly.
For example, as we saw in issue #8702:
INFO 2021-05-24 22:45:24,713 [shard 0] gossip - FatClient 127.0.60.2
has been silent for 5000ms, removing from gossip
To fix, update the timestamp as we do before in ack and ack2 msg
handler.
Fixes#8702Closes#8777
Now that it returns a future that always waits on
pending flushes there is no point in calling it `request_flush`.
`flush()` is simpler and better describes its function.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Fixes#8773
When refactored for cdc, properties -> extensions merge
was modified so it did not handle _removal_ (i.e. an
extension function returning null -> no entry in new map).
This causes certain enterprise extensions to not be able
to disable themselves.
Fixed by filtering existing extensions by property keywords.
Unit test added.
Closes#8774
In test_tracing.py::test_slow_query_log, the was what looked like
an innocent 30-second timeout, but this was in fact a 8 minute
timeout - because it started with sleeping 1 second, then 2 seconds,
then 3, ... until 30 seconds. Such a high timeout is frustrating when
trying to debug failures in the test - which is only expected to take
2 seconds (and all of it because of an artificial timeout).
So fix the loop to stop iterating after 60 seconds (a compromise
between 30 seconds and 8 minutes...), sleeping a constant amount
between iterations.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210601150631.1037158-1-nyh@scylladb.com>
Also drop a single violation in transport/server.cc. This helps
prevent dead code from piling up.
Three functions in row_cache_test that are not used in debug mode
are moved near their user, and under the same ifdef, to avoid triggering
the error.
Closes#8767
Make sure to apply the mutation under the table's _async_gate.
Fixes#8790
Test: unit(dev), view_build_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#8794
Fixes#8733
If a memtable flush is still pending when we call table::clear(),
we can end up doing a "discard-all" call to commitlog, followed
by a per-segment-count (using rp_set) _later_. This will foobar
our internal usage counts and quite probably cause assertion
failures.
Fixed by always doing per-memtable explicit discard call. But to
ensure this works, since a memtable being flushed remains on
memtable list for a while (why?), we must also ensure we clear
out the rp_set on discard.
Closes#8766
Both streaming and repair call the distributed sstables writing with
equal lambdas each being ~30 lines of code. The only difference between
them is repair might request offstrategy compaction for new sstable.
Generalization of these two pieces save lines of codes and speeds the
release/repair/row_level.o compilation by half a minute (out of twelve).
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210531133113.23003-1-xemul@scylladb.com>
Instead of attempting to universally set the proper environment
necessary for tests to generate profiling data such that coverage.py can
process it, allow each Test subclass to set up the environment as needed
by the specific Test variant.
With this we now have support for all current test types, including cql,
cql-pytest and alternator tests.
Yet another modifier for `--run`, allowing running the same executable
multiple times and then generating a coverage report across all runs.
This will also be used by test.py for those test suites (cql test) which
run the same executable multiple times, with different inputs.
Another modifier for `--run`, allowing to override the test executable
path. This is useful when the real test is ran through a run-script,
like in the case of cql-pytest.
If on stop the set_cql_state() throws the local sharded<cql_server>
will be left not stopped and will fail the respective assertion on
its destruction.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If set_cql_state() throws the cserver remains started. If this
happens on start before the controller stop defer action is
scheduled the destruction of controller will fain on assertion
that checks the _server must be stopped.
Effectively this is the fix of #8796
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If an exception happens after sharded<cql_server>.start() the
controller's _server pointer is left pointing to stopped sharded
server. This makes it impossible to start the server again (via
API) since the check for if (_server) will always be true.
This is the continuation of the ae4d5a60 fix.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Waiting on index alone does not guarantee leader correct leader log
propagation. This patch add checking also the term of the leader's last
log entry.
This was exposed with occasional problems with packet drops.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
We already wrote the test/cql-pytest/run script in Python in a way
it can be reusable for the other test/*/run scripts.
So this patch replaces the test/alternator/run shell script with Python
code which does the same thing (safely runs Scylla with Alternator and
pytest on it in a temporary directory and IP address), but sharing most
of the code that cql-pytest uses.
The benefit of reusing the test/cql-pytest/run.py library goes beyond
shorter code - the main benefit will be that we can't forget to fix one
of the test/*/run scripts (e.g., add more command line options or fix a
bug) when fixing another one.
To make the test/cql-pytest/run.py library reusable for running
Alternator, I needed to generalize a few things in this patch (e.g.,
the way we check and wait for Scylla to boot with the different APIs we
intend to check). There is also one bug-fix on how interrupts are
handled (they are now better guaranteed to kill pytest) - and now fixing
this bug benefits all runners using run.py (cql-pytest/run,
cql-pytest/run-cassandra and alternator/run).
In the future, we can port the runners which are still duplicate shell
scripts - test/redis/run and test/scylla-gdb/run - to Python in a
similar manner to what we did here for test/alternator/run.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Change the cql-pytest-specific run_cql_pytest() function to a more
general function to run pytest in any directory. Will be useful for
reusing the same code for other test runners (e.g., Alternator), and
is also clearer.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The option is provided by nodetool snapshot
https://docs.scylladb.com/operating-scylla/nodetool-commands/snapshot/
```
nodetool [(-h <host> | --host <host>)] [(-p <port> | --port <port>)]
[(-pp | --print-port)] [(-pw <password> | --password <password>)]
[(-pwf <passwordFilePath> | --password-file <passwordFilePath>)]
[(-u <username> | --username <username>)] snapshot
[(-cf <table> | --column-family <table> | --table <table>)]
[(-kc <kclist> | --kc.list <kclist>)]
[(-sf | --skip-flush)] [(-t <tag> | --tag <tag>)] [--] [<keyspaces...>]
-sf / –skip-flush Do not flush memtables before snapshotting (snapshot will not contain unflushed data)
```
But is currently ignored by scylla-jmx (scylladb/scylla-jmx#167)
and not supported at the api level.
This patch wires the skip_flush option support to the
REST API.
Fixes#8725
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Note: I tried adding the option and calling it "skip_flush"
but I couldn't make it work with scylla-jmx, hence it's
called by the abbreviated name - "sf".
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In #8772, an assert validating first token <= last token
failed in leveled_manifest::overlapping.
It is unclear how we got to that state, so add validation
in sstable::set_first_and_last_keys() that the to-be-set
first and last keys are well ordered.
Otherwise, throw malformed_sstable_exception.
set_first_and_last_keys is called both on the write path
from the sstable writer before the sstable is sealed,
and on the open/load path via update_info_for_opened_data().
While at it, change the exception type thrown when
the key in the summary is empty from runtime_error
to malformed_sstable_exception, sice the function
is called from the read path and the corruption
may already be present in the sstable.
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
reusable_sst tries openeing an sstable using
all sstable format versions in descending order.
It is expected to see "file not found" if the
actual sstable version is not the latest one.
That said, we may hit other error if the sstable
is malformed in any way, so do not override
this kind of error if "file not found" errors
are hit after it, and return the unexpected error
instead.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the test is using "first_key", "last_key"
literals for the first and last keys and expects them
to sort properly with the murmur3 partitioner.
Also it does that for all generated sstables
which is less interesting for reshape.
Use token_generation_for_current_shard to
generate random, properly ordered keys.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Otherwise, the primary_key will be considered as composite,
as its length does not equal 1. That hampers token caluclation
when decorating the dirst and last keys in the summary file.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move auxiliary classes connection and hash_connection out of
raft_cluster and into connected class.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Move state_machine, persistence, connection, hash_connection, connected,
failure_detector, and rpc inside raft_cluster.
This commit moves declaration of class raft_cluster up.
(Minimize changed lines)
Moves apply_fn definition from state_machine to raft_cluster.
Fixes namespace in declarations
Keeps static rpc::net outside for now to keep this commit simple.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Following gleb, tomek, and kamil's suggestion, remove unnecessary use of
lw_shared_ptr.
This also solves the problem of constructing a lw_shared_ptr from a
forward declaration (connected) in a subsequent patch.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Since there are no more external users of test_server, move it to
raft_cluster and remove member access operator.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Convert rpc replication tests to declarative form.
This will enable moving remaining parts inside raft_cluster.
For test stability, add support for checking rpc config of a node
eventually changes to the expected configuration.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Use vector instead of initializer_list for function helper parameter.
This is not a constructor and it complicates usage.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
As requested by @gleb-cloudious, stop servers taken out of
configuration.
Adjust other parts of code relying on all servers being active.
Remove temporary stop on rpc server.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
If leader is removed from configuration wait log first.
Remove wait_log_all for every case as it was too broad fix.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Only start/stop, init/start/reamr tickers, wait log, elapse_election,
run free election, check for leader, and verify servers in current
configuration.
This is necessary for having servers out of configuration not
present/stopped.
Temporarily stop a server in rpc test until we truly stop servers out of
configuration in next commit.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
When changing configuration, don't pause and restart all tickers.
Only do it for the specific server(s) being removed.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Only pause and restart tickers for servers in configuration.
Currently when a server is taken out it's reset and a new one is set up,
but out of configuration. @gleb-cloudious requested to have fully
stopped servers when out of configuration, until they are re-added.
This change is needed to allow that or else restart would arm tickers on
servers no longer present.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Do verifications in raft_cluster::verify().
This will enable having persisted snapshots inside the class and
de-clutter caller code.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Keep snapshots inside raft_cluster, removing this need outside.
If this is needed later, a const getter can be implemented.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Since create_server() is in raft_cluster, there's no need for
change_configuration() to pass total values anymore. Remove it.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Do wait_log() for the next leader always in elect_new_leader.
Only wait log for new leader if it's connected to the old leader.
Pause and restart tickers when creating a candidate to avoid another
node to become a dueling candidate.
Remove pause and restart tickers around calls to elect_new_leader.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Remove the global create_raft_server() and replace with a
create_server() helper in replication_test().
This will allow not requiring the user of raft_cluster to create special
objects.
Note this does not move(apply) anymore as it's kept in raft_cluster.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Add a helper to reset a server in raft_cluster.
Besides simplifying code and preventing errors, this will help move
create_raft_server logic to raft_cluster.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Move tickers to raft_cluster helper class. Ticker initialization and
pause is done automatically at start_all() and stop_all().
Add temporary helpers to manage specific tickers. These might be removed
later once proper node abort and reset are implemented.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
raft_cluster at the moment only allows sequential 0 based ids.
The code was generating ids over this and causing problems for code
changes.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Style.
Move definition of add_entry and add_remaining_entries with the rest of
raft_cluster definitions.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Helper to calculate what's the value number to be added after snapshot
and leader initial log.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
For rpc tests, use raft_cluster::disconnect() instead of the local
connected reference.
This removes connected object use outside raft_cluster.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Add connectivity helpers disconnect(server, except) and connect_all() to
so users of raft_cluster don't need to keep the a connectivity object
pointer.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
When there's a defined next leader, only wait for log propagation for
this follower.
Splits wait_log() to waiting for one follower with wait_log() and
waiting for all followers with wait_log().
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Remove log wait after adding entries. It was added to handle some debug
hangs but it is not good for testing.
There are already wait logs at proper code locations.
(e.g. elect_new_leader, partition)
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
In this patch, we port validation/entities/secondary_index_test.java,
resulting in 41 tests for various aspects of secondary indexes.
Some of the original Java tests required direct access to the Cassandra
internals not available through CQL, so those tests were omitted.
In porting these tests, I uncovered 9 previously-unknown bugs in Scylla:
Refs #8600: IndexInfo system table lists MV name instead of index name
Refs #8627: Cleanly reject updates with indexed values where value > 64k
Refs #8708: Secondary index is missing partitions with only a static row
Refs #8711: Finding or filtering with an empty string with a secondary
index seems to be broken
Refs #8714: Improve error message on unsupported restriction on partition
key
Refs #8717: Recent fix accidentally broke CREATE INDEX IF NOT EXISTS
Refs #8724: Wrong error message when attempting index of UDT column with
a duration
Refs #8744: Index-creation error message wrongly refers to "map" - it can
be any collection
Refs #8745: Secondary index CREATE INDEX syntax is missing the "values"
option
These tests also provide additional reproducers for already known issues:
Refs #2203: Add support for SASI
Refs #2962: Collection column indexing
Refs #2963: Static column indexing
Refs #4244: Add support for mixing token, multi- and single-column
restrictions
Due to these bugs, 15 out of the 41 tests here currently xfail. We actually
had more failing tests, but we fixed a few of the above issues before this
patch went in, so their tests are passing at the time of this submission.
All 41 tests pass when running against Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210531112354.970028-1-nyh@scylladb.com>
This patch is not backward compatible with its original,
but it's considered fine, since the original workload types were not
yet part of any release.
The changes include:
- instead of using 'unspecified' for declaring that there's no workload
type for a particular service level, NULL is used for that purpose;
NULL is the standard way of representing lack of data
- introducing a delete marker, which accompanies NULL and makes it
possible to distinguish between wanting to forcibly reset a workload
type to unspecified and not wanting to change the previous value
- updating the tests accordingly
These changes come in as a single patch, because they're intertwined
with each other and the tests for workload types are already in place;
an attempt to split them proved to be more complicated than it's worth.
Tests: unit(release)
Closes#8763
When user requests repair to be forcefully aborted, the `_abort_all_as`
abort source could be modified from multiple shards in parallel by the
`tracker::abort_all_repairs()` function, which can lead to undefined
behavior and to a crash. This commit makes sure that `_abort_all_as` is
used only from shard 0 when repair is aborted.
Fixes#8693Closes#8734
The new process has the following differences from the Dockerfile
based image:
- Using buildah commands instead of a Dockerfile. This is more flexible
since we don't need to pack everything into a "build context" and
transfer it to the container; instead we interact with the container
as we build it.
- Using packages instead of a remote yum repository. This makes it
easy to create an image in one step (no need to create a repository,
promote, then download the packages back via yum. It means that
the image cannot be upgraded via yum, but container images are
usually just replaced with a new version.
- Build output is an OCI archive (e.g. a tarball), not a docker image
in a local repoistory. This means the build process can later be
integrated into ninja, since the artifact is just a file. The file
can be uploaded into a repository or made available locally with
skopeo.
- any build mode is supported, not just release. This can be used
for quick(er) testing with dev mode.
I plan to integrate it further into the build system, but currently
this is blocked on a buildah bug [1].
[1] https://github.com/containers/buildah/issues/3262Closes#8730
The value of a frozen collection may only be indexed (using a secondary
index) in full - it is not allowed to index only the keys for example -
"CREATE INDEX idx ON table (keys(v))" is not allowed.
The error message referred to a frozen<map>, but the problem can happen
on any frozen collection (e.g., a frozen set), not just a frozen map,
so can be confusing to a user who used a frozen set, and getting an
error about a frozen map.
So this patch fixes the error message to refer to a "frozen collection".
Note that the Cassandra error message in this case is different - it
reads: "Frozen collections are immutable and must be fully indexed".
Fixes#8744.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210529094056.825117-1-nyh@scylladb.com>
Compaction manager can start tons of compaction of fully expired sstable in
parallel, which may consume a significant amount of resources.
This problem is caused by weight being released too early in compaction, after
data is all compacted but before table is called to update its state, like
replacing sstables and so on.
Fully expired sstables aren't actually compacted, so the following can happen:
- compaction 1 starts for expired sst A with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 2 starts for expired sst B with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 3 starts for expired sst C with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 1 is done updating table state, so it finally completes and
releases all the resources.
- compaction 2 is done updating table state, so it finally completes and
releases all the resources.
- compaction 3 is done updating table state, so it finally completes and
releases all the resources.
This happens because, with expired sstable, compaction will release weight
faster than it will update table state, as there's nothing to be compacted.
With my reproducer, it's very easy to reach 50 parallel compactions on a single
shard, but that number can be easily worse depending on the amount of sstables
with fully expired data, across all tables. This high parallelism can happen
only with a couple of tables, if there are many time windows with expired data,
as they can be compacted in parallel.
Prior to 55a8b6e3c9, weight was released earlier in compaction, before
last sstable was sealed, but right now, there's no need to release weight
earlier. Weight can be released in a much simpler way, after the compaction is
actually done. So such compactions will be serialized from now on.
Fixes#8710.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210527165443.165198-1-raphaelsc@scylladb.com>
[avi: drop now unneeded storage_service_for_tests]
It is currently possible that _memtables->add_memtable()
will throw after _memtables->clear(), leaving the memtables
list completely empty. However, we do rely on always
having at least one allocated in the memtables list
as active_memtable() references a lw_shared_ptr<memtable>
at the back of the memtables vector, and it expected
to always be allocated via add_memtable() upon construction
and after clear().
This change moves the implementation of this convention
to memtable_list::clear() and makes the latter exception safe
by first allocating the to-be-added empty memtable and
only then clearing the vector.
Refs #8749
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210530100232.2104051-1-bhalevy@scylladb.com>
user_defined_function_test fails sporadically in debug mode
due to lua timeout. Raise the timeout to avoid the failure, but
not so much that the test that expects timout becomes too slow.
Fixes#8746.
Closes#8747
This is another boring patch.
One of schema constructors has been deprecated for many years now but
was used in several places anyway. Usage of this constructor could
lead to data corruption when using MX sstables because this constructor
does not set schema version. MX reading/writing code depends on schema
version.
This patch replaces all the places the deprecated constructor is used
with schema_builder equivalent. The schema_builder sets the schema
version correctly.
Fixes#8507
Test: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <4beabc8c942ebf2c1f9b09cfab7668777ce5b384.1622357125.git.piotr@scylladb.com>
Sometimes the cql-pytest tests run extremely slowly. This can be
a combination of running the debug build (which is naturally slow)
and a test machine which is overcommitted, or experiencing some
transient swap storm or some similar event. We don't want tests, which
we run on a 100% reliable setups, to fail just because they run into
timeouts in Scylla when they run very slowly.
We already noticed this problem in the past, and increased the CQL client
timeout in conftest.py from the default of 10 seconds to 120 seconds -
the old default of 10 seconds was not enough for some long operations
(such as creating a table with multiple views) when the test ran very
slowly.
However, this only fixed the client-side timeout. We also have a bunch
of server-side timeouts, configured to all sorts of arbitrary (and
fairly small) numbers. For example, the server has a "write request
timeout" option, which defaults to just 2 seconds. We recently saw
this timeout exceeded in a slow run which tried to do a very large
write.
So this patch configures all the configurable server-side timeouts we
have to default to 300 seconds. This should be more than enough for even
the slowest runs (famous last words...). This default is not a good idea
on real multi-node clusters which are expected to deal with node loss,
but this is not the case in cql-pytest.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210529213648.856503-1-nyh@scylladb.com>
"
Right now storage service is used as "provider" of another
services -- database, feature service and tokens. This set
unexports the first pair. This dropps a bunch of calls for
global storage service instances from the places that don't
really need it.
tests: unit(dev), start-stop
"
* 'br-pupate-storage-service' of https://github.com/xemul/scylla:
storage-service: Don't export features
api: Get features from proxy
storage-service: Don't export database
storage-service: Turn some global helpers into methods
storage-service: Open-code simple config getters
view: Get database from stprage_proxy
main: Use local database instance
api: Use database from http_ctx
Now storage service uses the feature service instance internally
and doesn't need to provide getter for it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The reset_local_schema call needs proxy and feature service to do its
job. Right now the features are retrived from global storage service,
but they are present on the proxy as well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now storage service uses the database instance internally and
doesn't need to provide getter for it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two static helpers used by storage service that grab
global storage service. To simplify these two turn both into
storage service methods and use 'this' inside.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two db::config getters in storage_service.cc that
are used only once. Both call for global storage service, but
since they are called from storage service it's simpler to break
this loop and make storage service get needed config options
directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The db::view code already uses proxy rather actively, so instead of
depending on the storage service to be at hands it's better to make
db::view require the proxy. For now -- via global instance.
There's one dependency on storage service left after this patch --
to get the tokens. This piece is to be fixed later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All start-stop code in main has the sharded<database> at hands, there's
no need in getting it from global storage service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Instead of getting database from global storage service it's simpler
and better to grab it from the http context at hands.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Consider the following procedure:
- n1, n2, n3
- n3 is network partitioned from the cluster
- n4 replaces n3
- n3 has the network partition fixed
- n1 learns n3 as NORMAL status and calls
storage_service::handle_state_normal which in turn calls
update_peer_info, all columns except tokens column in system.peers are
written
- n1 restarts before figure out n4 is the new owner and deletes the
entry for n3 in system.peers
- n3 is removed from gossip by all the nodes in the cluster
automatically because they detect the collision and removes n3
- n1 restarts, leaving the entry in system.peers for n3 forever
To fix, we can update peer tables only if the node is part of the ring.
Fixes#8729Closes#8742
> Merge "memory: optimize thread-local initialization" from Avi
> Merge "Move priority classes manipulations from io-queue" from Pavel E
> gate: add default move assignment operator
Starting from seastar commit 5dae0cf3c48159990f51e5d38495af5ae224c2f8
all the registered classes info was moved into io_priority_class::_infos
array.
tests: scylla-gdb(release, old and new seastars)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210528083941.27990-1-xemul@scylladb.com>
Clang warns when "return std::move(x)" is needed to elide a copy,
but the call to std::move() is missing. We disabled the warning during
the migration to clang. This patch re-enables the warning and fixes
the places it points out, usually by adding std::move() and in one
place by converting the returned variable from a reference to a local,
so normal copy elision can take place.
Closes#8739
When destroying an perf_sstable_test_env, an assert in sstables_manager
destructor fails, because it hasn't been closed.
Fix by removing all references to sstables from perf_sstable_test_env,
and then closing the test_env(as well as the sstables_manager)
Fixes#8736
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Closes#8737
Reopening #8286 since the token metadata fix that allows `Everywhere` strategy tables to work with RBO (#8536) has been merged.
---
Currently when a node wants to create and broadcast a new CDC generation
it performs the following steps:
1. choose the generation's stream IDs and mapping (how this is done is
irrelevant for the current discussion)
2. choose the generation's timestamp by taking the current time
(according to its local clock) and adding 2 * ring_delay
3. insert the generation's data (mapping and stream IDs) into
system_distributed.cdc_generation_descriptions, using the
generation's timestamp as the partition key (we call this table
the "old internal table" below)
4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP"
application state.
The timestamp spreads epidemically through the gossip protocol. When
nodes see the timestamp, they retrieve the generation data from the
old internal table.
Unfortunately, due to the schema of the old internal table, where
the entire generation data is stored in a single cell, step 3 may fail for
sufficiently large generations (there is a size threshold for which step
3 will always fail - retrying the operation won't help). Also the old
internal table lies in the system_distributed keyspace that uses
SimpleStrategy with replication factor 3, which is also problematic; for
example, when nodes restart, they must reach at least 2 out of these 3
specific replicas in order to retrieve the current generation (we write
and read the generation data with QUORUM, unless we're a single-node
cluster, where we use ONE). Until this happens, a restarting
node can't coordinate writes to CDC-enabled tables. It would be better
if the node could access the last known generation locally.
The commit introduces a new table for broadcasting generation data with
the following properties:
- it uses a better schema that stores the data in multiple rows, each
of manageable size
- it resides in a new keyspace that uses EverywhereStrategy so the
data will be written to every node in the cluster that has a token in
the token ring
- the data will be written using CL=ALL and read using CL=ONE; thanks
to this, restarting node won't have to communicate with other nodes
to retrieve the data of the last known generation. Note that writing
with CL=ALL does not reduce availability: creating a new generation
*requires* all nodes to be available anyway, because they must learn
about the generation before their clocks go past the generation's
timestamp; if they don't, partitions won't be mapped to stream IDs
consistently across the cluster
- the partition key is no longer the generation's timestamp. Because it
was that way in the old internal table, it forced the algorithm to
choose the timestamp *before* the generation data was inserted into
the table. What if the inserting took a long time? It increased the
chance that nodes would learn about the generation too late (after
their clocks moved past its timestamp). With the new schema we will
first insert the generation data using a randomly generated UUID as
the partition key, *then* choose the timestamp, then gossip both the
timestamp and the UUID.
Observe that after a node learns about a generation broadcasted using
this new method through gossip it will retrieve its data very quickly
since it's one of the replicas and it can use CL=ONE as it was
written using CL=ALL.
The generation's timestamp and the UUID mentioned in the last point form
a "generation identifier" for this new generation. For passing these new
identifiers around, we introduce the cdc::generation_id_v2 type.
Fixes#7961.
---
For optimal review experience it is best to first read the updated design notes (you can read them rendered here: https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md), specifically the ["Generation switching"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#generation-switching) section followed by the ["Internal generation descriptions table V1 and upgrade procedure"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#internal-generation-descriptions-table-v1-and-upgrade-procedure) section, then read the commits in topological order.
dtest gating run (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/1160/
unit tests (dev) passed locally
Closes#8643
* github.com:scylladb/scylla:
docs: update cdc.md with info about the new internal table
sys_dist_ks: don't create old CDC generations table on service initialization
sys_dist_ks: rename all_tables() to ensured_tables()
cdc: when creating new generations, use format v2 if possible
main: pass feature_service to cdc::generation_service
gms: introduce CDC_GENERATIONS_V2 feature
cdc: introduce retrieve_generation_data
test: cdc: include new generations table in permissions test
sys_dist_ks: increase timeout for create_cdc_desc
sys_dist_ks: new table for exchanging CDC generations
tree-wide: introduce cdc::generation_id_v2
This draft extends and obsoletes #8123 by introducing a way of determining the workload type from service level parameters, and then using this context to qualify requests for shedding.
The rough idea is that when the admission queue in the CQL server is hit, it might make more sense to start shedding surplus requests instead of accumulating them on the semaphore. The assumption that interactive workloads are more interested in the success rate of as many requests as possible, and hanging on a semaphore reduces the chances for a request to succeed. Thus, it may make sense to shed some requests to reduce the load on this coordinator and let the existing requests to finish.
It's a draft, because I only performed local guided tests. #8123 was followed by some experiments on a multinode cluster which I want to rerun first.
Closes#8680
* github.com:scylladb/scylla:
test: add a case for conflicting workload types
cql-pytest: add basic tests for service level workload types
docs: describe workload types for service levels
sys_dist_ks: fix redundant parsing in get_service_level
sys_dist_ks: make get_service_level exception-safe
transport: start shedding requests during potential overload
client_state: hook workload type from service levels
cql3: add listing service level workload type
cql3: add persisting service level workload type
qos: add workload_type service level parameter
It is forbidden to create a secondary index of a column which includes in
any way the "duration" type. This includes a UDT which including duration.
Our code attempted to print in this case the message "Secondary indexes
are not supported on UDTs containing durations" - but because we tested
for tuples first, and UDTs are also tuples - we got the message about
tuples.
By changing the order of the tests, we get the most specific (and
useful) error message.
Fixes#8724.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210526201042.642550-1-nyh@scylladb.com>
The routine used for getting service level information already
operates on the service level name, but the same information is
also parsed once more from a row from an internal table.
This parsing is redundant, so it's hereby removed.
The purpose of the class in question is to start sharded storage
service to make its global instance alive. I don't know when exactly
it happened but no code that instantiates this wrapper really needs
the global storage service.
Ref: #2795
tests: unit(dev), perf_sstable(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210526170454.15795-1-xemul@scylladb.com>
This commit implements the following overload prevention heuristics:
if the admission queue becomes full, a timer is armed for 50ms.
If any of the ongoing requests finishes, the timer is disarmed,
but if that doesn't happen, the server goes into shedding mode,
which means that it reads new requests from the socket and immediately
drops them until one of the ongoing requests finishes.
This heuristics is not recommended for OLAP workloads,
so it is applied only if the session declared itself as
interactive (via service level's workload_type parameter).
The workload type is currently one of three values:
- unspecified
- interactive
- batch
By defining the workload type, the service level makes it easier
for other components to decide what to do in overload scenarios.
E.g. if the workload is interactive, requests can be shed earlier,
while if it's batched (or unspecified), shedding does not take place.
Conversely, batch workloads could accept long full scan operations.
Some subclasses want to maintain state, which constness needlessly precludes.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8721
The recent commit 0ef0a4c78d added helpful
error messages in case an index cannot be created because the intended
name of its materialized view is already taken - but accidentally broke
the "CREATE INDEX IF NOT EXISTS" feature.
The checking code was correct, but in the wrong place: we need to first
check maybe the index already exists and "IF NOT EXISTS" was chosen -
and only do this new error checking if this is not the case.
This patch also includes a cql-pytest test for reproducing this bug.
The bug is also reproduced by the translated Cassandra unit tests
cassandra_tests/validation/entities/secondary_index_test.py::
testCreateAndDropIndex
and this is how I found this bug. After these patch, all these tests
pass.
Fixes#8717.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210526143635.624398-1-nyh@scylladb.com>
We have enabled off-strategy compaction for bootstrap, replace,
decommission and removenode operations when repair based node operation
is enabled. Unlike node operations like replace or decommission, it is
harder to know when the repair of a table is finished because users can
send multiple repair requests one after another, each request repairing
a few token ranges.
This patch wires off-strategy compaction for regular repair by adding
a timeout based automatic off-strategy compaction trigger mechanism.
If there is no repair activity for sometime, off-strategy compaction
will be triggered for that table automatically.
Fixes#8677Closes#8678
This warning triggers when a range for ("for (auto x : range)") causes
non-trivial copies, prompting the developer to replace with a capture
by reference. A few minor violations in the test suite are corrected.
Closes#8699
In commit 829b4c1 (repair: Make removenode safe by default), removenode
was changed to use repair based node operations unconditionally. Since
repair based node operations is not enabled by default, we should
respect the flag to use stream to sync data if the flag is false.
Fixes#8700Closes#8701
* github.com:scylladb/scylla:
storage_service: Add removenode_add_ranges helper
storage_service: Respect --enable-repair-based-node-ops flag during removenode
Fixes#8270
If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk _footprint_ that is over limit causing new segment allocation to stall.
We need to take a few things into account:
1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here.
2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task.
3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit.
Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything).
Closes#8695
* github.com:scylladb/scylla:
commitlog_test: Add test case for usage/disk size threshold mismatch
commitlog: Flush all segments if we only have one.
commitlog: Always force flush if segment allocation is waiting
commitlog: Include segment wasted (slack) size in footprint check
commitlog: Adjust (lower) usage threshold
The old table won't be created in clusters that are bootstrapped after
this commit. It will stay in clusters that were upgraded from a version
before this commit.
Note that a fully upgraded cluster doesn't automatically create a new
generation in the new format. Even if the last generation was created
before the upgrade, the cluster will keep using it.
A new generation will be created in the new format when either:
1. a new node bootstraps (in the new version),
2. or the user runs checkAndRepairCdcStreams, which has a new check: if
the current generation uses the old format, the command will decide
that repair is needed, even if the generation is completely fine
otherwise (also in the new version).
During upgrade, while the CDC_GENERATIONS_V2 feature is still not
enabled, the user may still bootstrap a node in the old version of
Scylla or run checkAndRepairCdcStreams on a not-yet-upgraded node. In
that case a new generation will be created in the old format,
using the old table definitions.
The static function `all_tables` in system_distributed_keyspace.cc was
used by the `system_distributed_keyspace` service initialization
function (`start()`) to ensure that a certain set of tables - which the
service provides accessors to - exist in the cluster. For each table in
the vector returned by `all_tables()` the function would try to create
the table, ignoring the "table already exists" error if it is thrown.
The commit renames `all_tables` to `ensured_tables` to better convey the
intention of this function and documents its purpose in a comment.
We do this because in the future the service may provide accessors to
tables which it *does not* actually create. The example - coming in a
later commit - is a table which was created in a previous version of
Scylla, and for which we still have to provide accessors for backward
compatibility / correct handling of the upgrade procedure, but which we
do not want to create in clusters that were freshly created using the
new version of Scylla, since in that case these tables would be just
unnecessary garbage. We mention this use case in the comment.
A node with this commit, when creating a new CDC generation (during
bootstrap, upgrade, or when running checkAndRepairCdcStreams command)
will check for the CDC_GENERATIONS_V2 feature and:
- If the feature is enabled create the generation in the v2 format
and insert it into the new internal table. This is safe because
a node joins the feature only if it understands the new format.
- Otherwise create it in the v1 format, limiting its size as before,
and insert it into the old table.
The second case should only happen if we perform bootstrap or run
checkAndRepairCdcStreams in the middle of an upgrade procedure. On fully
upgraded clusters the feature shall be enabled, causing all new
generations to use the new format.
When a node joins this feature (which it does immediately when upgrading
to a version that has this commit), it says: "I understand the new
generation storage format and the new identifier format". Thus, when the
feature becomes enabled - after all nodes have joined it - it means that
it's safe to create new generations using these new storage/ID formats.
This function given a generation ID retrieves its data from the internal
table in which the data resides. This depends on the version of the ID:
for _v1 we're using system_distributed.cdc_generation_descriptions, for
_v2 we're using the better
system_distributed_v2.cdc_generation_descriptions_v2 (see the previous
commit for detailed explanation of the superiority of the new table).
Currently when a node wants to create and broadcast a new CDC generation
it performs the following steps:
1. choose the generation's stream IDs and mapping (how this is done is
irrelevant for the current discussion)
2. choose the generation's timestamp by taking the current time
(according to its local clock) and adding 2 * ring_delay
3. insert the generation's data (mapping and stream IDs) into
system_distributed.cdc_generation_descriptions, using the
generation's timestamp as the partition key (we call this table
the "old internal table" below)
4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP"
application state.
The timestamp spreads epidemically through the gossip protocol. When
nodes see the timestamp, they retrieve the generation data from the
old internal table.
Unfortunately, due to the schema of the old internal table, where
the entire generation data is stored in a single cell, step 3 may fail for
sufficiently large generations (there is a size threshold for which step
3 will always fail - retrying the operation won't help). Also the old
internal table lies in the system_distributed keyspace that uses
SimpleStrategy with replication factor 3, which is also problematic; for
example, when nodes restart, they must reach at least 2 out of these 3
specific replicas in order to retrieve the current generation (we write
and read the generation data with QUORUM, unless we're a single-node
cluster, where we use ONE). Until this happens, a restarting
node can't coordinate writes to CDC-enabled tables. It would be better
if the node could access the last known generation locally.
The commit introduces a new table for broadcasting generation data with
the following properties:
- it uses a better schema that stores the data in multiple rows, each
of manageable size
- it resides in the `system_distributed_everywhere` keyspace so the
data will be written to every node in the cluster that has a token in
the token ring
- the data will be written using CL=ALL and read using CL=ONE; thanks
to this, restarting node won't have to communicate with other nodes
to retrieve the data of the last known generation. Note that writing
with CL=ALL does not reduce availability: creating a new generation
*requires* all nodes to be available anyway, because they must learn
about the generation before their clocks go past the generation's
timestamp; if they don't, partitions won't be mapped to stream IDs
consistently across the cluster
- the partition key is no longer the generation's timestamp. Because it
was that way in the old internal table, it forced the algorithm to
choose the timestamp *before* the generation data was inserted into
the table. What if the inserting took a long time? It increased the
chance that nodes would learn about the generation too late (after
their clocks moved past its timestamp). With the new schema we will
first insert the generation data using a randomly generated UUID as
the partition key, *then* choose the timestamp, then gossip both the
timestamp and the UUID. The timestamp and the UUID form the
"generation identifier" of this new generation; this should explain
why we introduced the generation_id_v2 type in previous commits.
Observe that after a node learns about a generation broadcasted using
this new method through gossip it will retrieve its data very quickly
since it's one of the replicas and it can use CL=ONE as it was
written using CL=ALL.
Note that the node is still using the old method - the actual switch
will be done in a later commit.
Refs #8270
Since segment allocation looks at actual disk footprint, not active,
the threshold check in timer task should include slack space so we
don't mistake sparse usage for space left.
Refs #8270
Try to ensure we issue a flush as soon as we are allocating in the
last allowable segment, instead of "half through". This will make
flushing a little more eager, but should reduce latencies created
by waiting for segment delete/recycle on heavy usage.
Currently the pending (memtables) flushes stats are adjusted back
only on success, therefore they will "leak" on error, so move
use a .then_wrapped clause to always update the stats.
Note that _commitlog->discard_completed_segments is still called
only on success, and so is returning the previous_flush future.
Test: unit(dev)
DTest: alternator_tests.py:AlternatorTest.test_batch_with_auto_snapshot_false(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210525055336.1190029-2-bhalevy@scylladb.com>
Currently, advance_and_wait() allocates a new gate
which might fail. Rather than returning this failure
as an exceptional future - which will require its callers
to handle that failure, keep the function as noexcept and
let an exception from make_lw_shared<gate>() terminate the program.
This makes the function "fail-free" to its callers,
in particular, when called from the table::stop() path where
we can't do much about these failures and we require close/stop
functions to always succeed.
The alternative of make the allocation of a new gate optional
and covering from it in start() is possible but was deemed not
worth it as it will add complexity and cost to start() that's
called on the common, hot, path.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210525055336.1190029-1-bhalevy@scylladb.com>
test_phased_barrier_reassignment has a timeout to prevent the test from
hanging on failure, but it occastionally triggers in debug mode since
the timeout is quite low (1ms). Increase the timeout to prevent false
positives. Since the timeout only expires if the test fails, it will
have no impact on execution time.
Ref #8613Closes#8692
In https://github.com/scylladb/scylla/issues/8609,
table::stop() that is called from database::drop_column_family
is expected to wait on outstanding flushes by calling
_memtable->request_flush(), but the memtable_list is considered
empty() at this point as it has a single empty memtable,
so request_flush() returns a ready future, without waiting
on outstanding flushes. This change replaces the call to
request_flush with flush().
Fix that by either returning _flush_coalescing future
that resolves when the memtable is sealed, if available,
or go through the get_flush_permit and
_dirty_memory_manager->flush_one song and dance, even though
the memtable is empty(), as the latter waits on pending flushes.
Fixes#8609
Test: unit(dev)
DTest: alternator_tests.py:AlternatorTest.test_batch_with_auto_snapshot_false(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210524143438.1056014-1-bhalevy@scylladb.com>
The algorithm used in `get_address_ranges` and `get_range_addresses`
calls `calculate_natural_endpoints` in a loop; the loop iterates over
all tokens in the token ring. If the complexity of a particular
implementation of `calculate_natural_endpoints` is large - say `θ(n)`,
where `n` is the number of tokens - this results in an `θ(n^2)`
algorithm (or worse). This case happens for `Everywhere` replication strategy.
For small clusters this doesn't matter that much, but if `n` is, say, `20*255`,
this may result in huge reactor stalls, as observed in practice.
We avoid these stalls by inserting tactical yields. We hope that
some day someone actually implements a subquadratic algortihm here.
The commit also adds a comment on
`abstract_replication_strategy::calculate_natural_endpoints` explaining
that the interface does not give a complexity guarantee (at this point);
the different implementations have different complexities.
For example, `Everywhere` implementation always iterates over all tokens
in the token ring, so it has `θ(n)` worst and best case complexity.
On the other hand, `NetworkTopologyStrategy` implementation usually
finishes after visiting a small part of the token ring (specifically,
as soon as it finds a token for each node in the ring) and performs
a constant number of operations for each visited token on average,
but theoretically its worst case complexity is actually `O(n + k^2)`,
where `n` is the number of all tokens and `k` is the number of endpoints
(the `k^2` appears since for each endpoint we must perform finds and
inserts on `unordered_set` of size `O(k)`; `unordered_set` operations
have `O(1)` average complexity but `O(size of the set)` worst case
complexity).
Therefore it's not easy to put any complexity guarantee in the interface
at this point. Instead, we say that:
- some implementations may yield - if their complexities force us to do so
- but in general, there is no guarantee that the implementation may
yield - e.g. the `Everywhere` implementation does not yield.
Fixes#8555.
Closes#8647
Off-strategy compaction on a table using STCS is slow because of
the needless write amplification of 2. That's because STCS reshape
isn't taking advantage of the fact that sstables produced by
a repair-based operation are disjoint. So the ~256 input sstables
were compacted (in batches of 32) into larger sstables, which in
turn were compacted into even larger ones. That write amp is very
significant on large data sets, making the whole operation 2x
slower.
Fixes#8449.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210524213426.196407-1-raphaelsc@scylladb.com>
In commit 829b4c1 (repair: Make removenode safe by default), removenode
was changed to use repair based node operations unconditionally. Since
repair based node operations is not enabled by default, we should
respect the flag to use stream to sync data if the flag is false.
Fixes#8700
This is a new type of CDC generation identifiers. Compared to old IDs,
additionally to the timestamp it contains an UUID.
These new identifiers will allow a safer and more efficient algorithm of
introducing new generations into a cluster (introduced in a later commit).
For now, nodes keep using the old identifier format when creating new
generations and whenever they learn about a new CDC generation from gossip
they assume that it also is stored in the v1 format. But they do know how
to (de)serialize the second format and how to persist new identifiers in
local tables.
config_file.cc instantiates std::istream& std::operator>>(std::istream&,
std::unordered_map<seastar::sstring, seastar::sstring>&), but that
instantiation is ignored since config_file_impl.hh specializes
that signature. -Winstantiation-after-specialization warns about it,
so re-enable it now that the code base is clean.
Also remove the matching "extern template" declaration, which has no
definition any more.
Closes#8696
* seastar 28dddd2683...f0f28d07e1 (4):
> httpd: allow handler to not read an empty content
Fixes#8691.
> compat: source_location: implement if no std or experimental are available
> compat: source_location: declare using in seastar::compat namespace
> perftune.py: fix a bug in mlx4 IRQs names matching pattern
To keep our cql-pytest tests "correct", we should strive for them to pass on
Cassandra - unless they are testing a Scylla-only feature or a deliberate
difference between Scylla and Cassandra - in which case they should be marked
"scylla-only" and cause such tests to be skipped when running on Cassandra.
The following few small patches fix a few cases where our tests we failing on
Cassandra. In one case this even found a bug in the test (a trivial Python
mistake, but still).
Closes#8694
* github.com:scylladb/scylla:
test/cql-pytest: fix python mistake in an xfailing test
test/cql-pytest: mark some tests with scylla-only
test/cql-pytest: clean up test_create_large_static_cells_and_rows
The cql3 layer manipulates lists as `std::vector`s (of `managed_bytes_opt`). Since lists can be arbitrarily large, let's use chunked vectors there to prevent potentially large contiguous allocations.
Closes#8668
* github.com:scylladb/scylla:
cql3: change the internal type of tuples::in_value from std::vector to chunked_vector
cql3: change the internal type of lists::value from std::vector to chunked_vector
cql3: in multi_item_terminal, return the vector of items by value
This series fixes a minor validation issue with service level timeouts - negative values were not checked. This bug is benign because negative timeouts act just like a 0s timeout, but the original series claimed to validate against negative values, so it's hereby fixed.
More importantly however, this series follows by enabling cql-pytest to run service level tests and provides a first batch of them, including a missing test case for negative timeouts.
The idea is similar to what we already have in alternator test suite - authentication is unconditionally enabled, which doesn't affect any existing tests, but at the same time allows writing test cases which rely on authentication - e.g. service levels.
Closes#8645
* github.com:scylladb/scylla:
cql-pytest: introduce service level test suite
cql-pytest: add enabling authentication by default
qos: fix validating service level timeouts for negative values
Now RPC module has some basic testing coverage to
make sure RPC configuration is updated appropriately
on configuration changes (i.e. `add_server` and
`remove_server` are called when appropriate).
The test suite currenty consists of the following
test-cases:
* Loading server instance with configuration from a snapshot.
* Loading server instance with configuration from a log.
* Configuration changes (remove + add node).
* Leader elections don't lead to RPC configuration changes.
* Voter <-> learner node transitions also don't change RPC
configuration.
* Reverting uncommitted configuration changes updates
RPC configuration accordingly (two cases: revert to
snapshot config or committed state from the log).
A few more refactorings are made along the way to be
able to reuse some existing functions from
`replication_test` in `rpc_test` implementation.
Please note, though, that there are still some functions
that are borrowed from `replication_test` but not yet
extracted to common helpers.
This is mostly because RPC tests doesn't need all
the complexity that `replication_test` has, thus,
some helpers are copied in a reduced form.
It would take some effort to refactor these bits to
fit both `replication_test` and `rpc_test` without
sacrificing convenience.
This will probably be addressed in another series later.
* manmanson/raft-rpc-tests-v9-alt3:
raft: add tests for RPC module
test: add CHECK_EVENTUALLY_EQUAL utility macro
raft: replication_test: reset test rpc network between test runs
raft: replication_test: extract tickers initialization into a separate func
raft: replication_test: support passing custom `apply_fn` to `change_configuration()`
raft: replication_test: introduce `test_server` aggregate struct
raft: replication_test: support voter<->learner configuration changes
raft: remove duplicate `create_command` function from `replication_test`
raft: avoid 'using' statements in raft testing helpers header
This patchset adds the missing features noted by the patchset
introducing it, namely:
* The ability to run a test through `coverage.py`, automating the entire
process of setting up the environment, running the test and generating
the report. This is possible with the new `--run` command line
argument. It supports either generating a report immediately after
running the provided test or just doing the running part, allowing the
user to generate the report after having run all the tests they wanted
to.
* A tweakable verbosity level.
It is also possible to specify a subset of the profiling data as input
for the report.
The documentation was also completed, with examples for all the
intended uses-cases.
With these changes, `coverage.py` is considered mature, the remaining
rough edges being located in other scripts (`tests.py` and
`configure.py`).
It is now possible to generate a coverage report for any test desired.
Also on: https://github.com/denesb/scylla.git
coverage-py-missing-features/v1
Botond Dénes (5):
scripts/coverage.py: allow specifying the input files to generate the
report from
scripts/coverage.py: add capability of running a test directly
scripts/coverage.py: add --verbose parameter
scripts/coverage.py: document intended uses-cases
HACKING.md: redirect to ./coverage.py for more details
scripts/coverage.py | 143 +++++++++++++++++++++++++++++++++++++++-----
HACKING.md | 19 +-----
2 files changed, 129 insertions(+), 33 deletions(-)
"
The patch set is an assorted collection of header cleanups, e.g:
* Reduce number of boost includes in header files
* Switch to forward declarations in some places
A quick measurement was performed to see if these changes
provide any improvement in build times (ccache cleaned and
existing build products wiped out).
The results are posted below (`/usr/bin/time -v ninja dev-build`)
for 24 cores/48 threads CPU setup (AMD Threadripper 2970WX).
Before:
Command being timed: "ninja dev-build"
User time (seconds): 28262.47
System time (seconds): 824.85
Percent of CPU this job got: 3979%
Elapsed (wall clock) time (h:mm:ss or m:ss): 12:10.97
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2129888
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1402838
Minor (reclaiming a frame) page faults: 124265412
Voluntary context switches: 1879279
Involuntary context switches: 1159999
Swaps: 0
File system inputs: 0
File system outputs: 11806272
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
After:
Command being timed: "ninja dev-build"
User time (seconds): 26270.81
System time (seconds): 767.01
Percent of CPU this job got: 3905%
Elapsed (wall clock) time (h:mm:ss or m:ss): 11:32.36
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2117608
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1400189
Minor (reclaiming a frame) page faults: 117570335
Voluntary context switches: 1870631
Involuntary context switches: 1154535
Swaps: 0
File system inputs: 0
File system outputs: 11777280
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
The observed improvement is about 5% of total wall clock time
for `dev-build` target.
Also, all commits make sure that headers stay self-sufficient,
which would help to further improve the situation in the future.
"
* 'feature/header_cleanups_v1' of https://github.com/ManManson/scylla:
transport: remove extraneous `qos/service_level_controller` includes from headers
treewide: remove evidently unneded storage_proxy includes from some places
service_level_controller: remove extraneous `service/storage_service.hh` include
sstables/writer: remove extraneous `service/storage_service.hh` include
treewide: remove extraneous database.hh includes from headers
treewide: reduce boost headers usage in scylla header files
cql3: remove extraneous includes from some headers
cql3: various forward declaration cleanups
utils: add missing <limits> header in `extremum_tracking.hh`
This is a follow up change to #8512.
Let's add aio conf file during scylla installation process and make sure
we also remove this file when uninstall Scylla
As per Avi Kivity's suggestion, let's set aio value as static
configuration, and make it large enough to work with 500 cpus.
Closes#8650
Currently, var-lib-scylla.mount may fails because it can start before
MDRAID volume initialized.
We may able to add "After=dev-disk-by\x2duuid-<uuid>.device" to wait for
device become available, but systemd manual says it automatically
configure dependency for mount unit when we specify filesystem path by
"absolute path of a device node".
So we need to replace What=UUID=<uuid> to What=/dev/disk/by-uuid/<uuid>.
Fixes#8279Closes#8681
The xfailing test cassandra_tests/validation/entities/collections_test.py::
testSelectionOfEmptyCollections had a Python mistake (using {} instead
of set() for an empty set), which resulted in its failure when run
against Cassandra. After this patch it passes on Cassandra and fails on
Scylla - as expected (this is why it is marked xfail).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Tests which are known to test a Scylla-only feature (such as CDC)
or to rely on a known and difference between Scylla and Cassandra
should be marked "scylla-only", so they are skipped when running
the tests against Cassandra (test/cql-pytest/run-cassandra) instead
of reporting errors.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The test test_create_large_static_cells_and_rows had its own
implementation of "nodetool flush" using Scylla's REST API.
Now that we have a nodetool.flush() function for general use in
cql-pytest, let's use it and save a bit of duplication.
Another benefit is that now this test can be run (and pass) against
Cassandra.
To allow this test to run on Cassandra, I had to remove a
"USING TIMEOUT" which wasn't necessary for this test, and is
not a feature supported by Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Before this patch every failure to pull the configuration have been
reported as a warning. However this is confusing for users for two
reasons:
1. It pollutes the logs if the configuration is polled which is Scylla's
mode of operation. Such a line is logged every failed iteration.
2. It confuses users because even though this level is warning, it logs
out an exception and the log message contains the word failed.
We see it a lot during QA runs and customer questions from the field.
Point 2 is only solvable by reducing the verbosity of the logged
information, which will make debugging harder.
Point 1 is addressed here in the following manner, first the
one shot configuration pull function is not handling the exception
itself, this is OK because it is harmless to fail once or twice in a
row in configuration pulling like in every other query, the caller is
the one that will be responsible to handle the exception and log the
information. Second, the polling loop capture the exceptions being
thrown from the configuration pulling function and only report an error
with the latest exception if the polling has failed in consecutive
iterations over the last 90 seconds. This value was chosen because this
is about the empirical worst case time that it takes to a node to notice
one of the other nodes in the cluster is down (hence not querying it).
It is not important for the user or to us to be notified on temporary
glitches in availability (through this error at least) and since we are
eventually consistent is ok that some nodes will catch up with the
configuration later than others.
We also set a threshold in which if the configuration still couldn't be
retrieved then the logging level is bumped to ERROR.
Closes#8574
This helper function is an artifact of forward-porting service levels,
and it wouldn't even compile when used because of mismatched
function declarations. It's not used anywhere in the open-source code,
so it's removed to avoid future merge conflicts.
Message-Id: <c9f421d0c4c1a807626775d324fd35b4c72505fe.1621845335.git.sarna@scylladb.com>
Since serialize_value needs to copy the values to a bigger buffer anyway,
there is no point in copying the argument higher in the call chain.
This patch eliminates some pointless copies, for example in
alternator/executor.cc
Closes#8688
Currently, gossip uses the updates of the gossip heartbeat from gossip
messages to decide if a node is up or down. This means if a node is
actually down but the gossip messages are delayed in the network, the
marking of node down can be delayed.
For example, a node sends 20 gossip messages in 20 seconds before it
is dead. Each message is delayed 15 seconds by the network for some
reason. A node receives those delayed messages one after another.
Those delayed messages will prevent this node from being marked as down.
Because heartbeat update is received just before the threshold to mark a
node down is triggered which is around 20 seconds by default.
As a result, this node will not be marked as down in 20 * 15 seconds =
300 seconds, much longer than the ~20 seconds node down detection time
in normal cases.
In this patch, a new failure detector is implemented.
- Direct detection
The existing failure detector can get gossip heartbeat updates
indirectly. For example:
Node A can talk to Node B
Node B can talk to Node C
Node A can not talk to Node C, due to network issues
Node A will not mark Node B to be down because Node A can get heart beat
of Node C from node B indirectly.
This indirect detection is not very useful because when Node A decides
if it should send requests to Node C, the requests from Node A to C will
fail while Node A thinks it can communicate with Node C.
This patch changes the failure detection to be direct. It uses the
existing gossip echo message to detect directly. Gossip echo messages
will be sent to peer nodes periodically. A peer node will be marked as
down if a timeout threshold has been meet.
Since the failure detection is peer to peer, it avoids the delayed
message issue mentioned above.
- Parallel detection
The old failure detector uses shard zero only. This new failure detector
utilizes all the shards to perform the failure detection, each shard
handling a subset of live nodes. For example, if the cluster has 32
nodes and each node has 16 shards, each shard will handle only 2 nodes.
With a 16 nodes cluster, each node has 16 shards, each shard will handle
only one peer node.
A gossip message will be sent to peer nodes every 2 seconds. The extra
echo messages traffic produced compared to the old failure detector is
negligible.
- Deterministic detection
Users can configure the failure_detector_timeout_in_ms to set the
threshold to mark a node down. It is the maximum time between two
successful echo message before gossip marks a node down. It is easier to
understand than the old phi_convict_threshold.
- Compatible
This patch only uses the existing gossip echo message. Nodes with or without
this patch can work together.
Fixes#8488Closes#8036
The -Wunused-private-field was squelched when we switched to
clang to make the change easier. But it is a useful warning, so
re-enable it.
It found a serious bug (#8682) and a few minor instances of waste.
Closes#8683
* github.com:scylladb/scylla:
build: enable -Wunused-private-field warning
test: drop unused fields
table: drop unused field database_sstable_write_monitor::_compaction_manager
streaming: drop unused fields
sstables: mx reader: drop unused _column_value_length field
sstables: index_consumer: drop unused max_quantity field
compaction: resharding_compaction: drop unused _shard field
compaction: compaction_read_monitor: drop unused _compaction_manager field
raft: raft_services: drop unused _gossiper field
repair: drop unused _nr_peer_nodes field
redis: drop unused fields _storage_proxy and _requests_blocked_memory
mutation_rebuilder: drop unused field _remaining_limit
db: data_listeners: remove unused field _db
cql3: insert_json_statement: note bug with unused _if_not_exists
cql3: authorized_prepared_statement_cache: drop unused field _logger
auth: service_level_resource_view: drop unused field _resource
Yet another patch preventing potentially large allocations.
Currently, collection_mutation{_view,}_description linearize each collection
value during deserialization. It's not unthinkable that a user adds a
large element to a list or a map, so let's avoid that.
This patch removes the dependency on linearizing_input_stream, which does not
provide a way to read fragmented subbuffers, and replaces it with a new
helper, which does. (Extending linearizing_input_stream is not viable without
rewriting it completely).
Only linearization of collection values is corrected in this patch.
Collection keys are still linearized. Storing them in managed_bytes is likely
to be more harmful than helpful, because large map keys are extremely unlikely,
and UUIDs, which are used as keys in lists, do not fit into manages_bytes's
small value optimization, so this would incure an extra allocation for every
list element.
Note: this patch leaves utils/linearizing_input_stream.hh unused.
Refs: #8120Closes#8690
Yet another patch aiming to prevent potentially large allocations.
abstract_type::hash somehow evaded the anti-linearization patches until now.
Fix that.
Note that decimals and varints are still linearized, but we leave it be,
under the assumption that nobody inserts 128KiB-large varints into a database.
Refs: #8120Closes#8689
There are places where abstract_type::deserialize is called just to pass the
result to compound_wrapper::from_singular, which immediately serializes it
again.
Get rid of this ritual by adding a version of from_singular which takes
a serialized argument.
As a bonus, along the way we eliminate some pointless copies of lw_shared_ptr
and std::shared_ptr caused by two careless uses of `auto`.
Closes#8687
As an optimization for optimistic cases, reclaim_from_evictable first evicts
the requested amount of memory before attempting to reclaim segments through
compactions. However, due to an oversight, it does this before every compaction
instead of once before all compactions.
Usually reclaim_from_evictable is called with small targets, or is preemptible,
and in those cases this issue is not visible. However, when the target is bigger
than one segment and the reclaim is not preemptible, which is he case when it's
called from allocating_section, this results in a quadratic explosion of
evictions, which can evict several hundred MiB to reclaim a few MiB.
Fix that by calculating the target of memory eviction only once, instead of
recalculating it after every compaction.
Fixes#8542.
Closes#8611
The -Wunused-private-field was squelched when we switched to
clang to make the change easier. But it is a useful warning, so
re-enable it.
It found a serious bug (#8682) and a few minor instances of waste.
The parser accepts INSERT JSON ... IF NOT EXISTS but we later
ignore it. This is a bug (#8682). Note it down and shut down
a compiler error that will result when we enable
-Wunused-private-field.
scripts/coverage.py now has a detailed help, don't repeat its content in
HACKING.md, instead redirect users who want to learn more to the
script's help.
Through `coverage.py`. This saves the user from all the required env
setup required for `coverage.py` to successfully generate the report
afterwards. Instead all of this is taken care automatically, by just
running:
./scripts/coverage.py --run ./build/coverage/.../mytest arg1 ... argN
`coverage.py` takes care of running the test and generating a coverage
report from it.
As a side effect, also fix `main()` ignoring its `argv` parameter.
Currently `coverage.py` includes all raw profiling data found at PATH
automatically. This patch gives an option to override this, instead
including only the given input files in the report.
This patch adds an Alternator test, test_batch_get_item_large,
which checks a BatchGetItem with a moderately large (1.5 MB) response.
The test passes - we do not have a bug in BatchGetItem - but it
does reproduce issue #8522 - the long response is stored in memory as
one long contiguous string and causes a warning about an over-sized
allocation:
WARN ... seastar_memory - oversized allocation: 2281472 bytes.
Incidentally, this test also reproduces a second contiguous
allocation problem - issue #8183 (in BatchWriteItem which we use
in this test to set up the item to read).
Refs #8522
Refs #8183
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210520161619.110941-1-nyh@scylladb.com>
_rows uses a deque, but doesn't need any special functionality.
Switch to chunked_vector, which uses one less allocation in the
common case (std::deque has an extra allocation for managing its
chunks).
Closes#8679
The gdb self-tests fail on aarch64 due to a failure to use thread-local
variables. I filed [1] so it can get fixed.
Meanwhile, disable the test so the build passes. It is sad, but the aarch64
build is not impacted by these failures.
[1] https://sourceware.org/bugzilla/show_bug.cgi?id=27886Closes#8672
Currently, the new NODE_OPS_CMD for replace operation is used only when
repair based node operation is enabled. However, We can use the
NODE_OPS_CMD to run replace operation and use streaming instead of
repair to sync data as well.
After this patch, we will use streaming inside run_replace_ops if repair
based node ops is not enabled. So that we can take the benefits that
NODE_OPS_CMD brings in commit 323f72e48a
(repair: Switch to use NODE_OPS_CMD for replace operation).
Fixes#8013
* seastar 847fccaf5e...28dddd2683 (13):
> reactor: disable xfs extent size hints if using the kernel page cache
> smp: replace _reactors global with a local
> Merge "Add test for IO-scheduler (fails now)" from Pavel E
> weak_ptr: lift restriction on copying
> core: expose hidden method from parent class
> perftune.py: __get_feature_file(): verify that parameters are not None
> gate: assert no outstanding requests when destroyed
> httpd: add status_types
> cmake: use -O2 for CMAKE_CXX_FLAGS_DEV with clang
> compat: source_location: use std::source_location only if available
> iotune: disambiguate "this" lambda capture in C++20 mode
> Merge "Consider disk saturation request lengths" from Pavel E
> Merge 'seastar-addr2line: support oneline backtrace in resolve call' from Benny Halevy
_maximum_request_size is renamed to _max_bytes_count
in 40a29d5590
This patch adds support for ioq io_group._max_bytes_count
if io_group._maximum_request_size isn't found.
Test: scylla-gdb(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210520151537.710123-1-bhalevy@scylladb.com>
"
There are many global stuff in repair -- a bunch of pointers to
sharded services, tracker, map of metas (maybe more). This set
removes the first group, all those services had become main-local
recently. Along the way a call to global storage proxy is dropped.
To get there the repair_service is turned into a "classical"
sharded<> service, gets all the needed dependencies by references
from main and spreads them internally where needed. Tracker and other
stuff is left global, but tracker is now the candidate for merging
with the now sharded repair_service, since it emulates the sharded
concept internally.
Overall the change is
- make repair_service sharded and put all dependencies on it at start
- have sharded<repair_service> in API and storage service
- carry the service reference down to repair_info and repair_meta
constructions to give them the depedencies
- use needed services in _info and _meta methods
tests: unit(dev), dtest.repair(dev)
"
* 'br-repair-service' of https://github.com/xemul/scylla: (29 commits)
repair: Drop most of globals from repair
repair: Use local references in messaging handler checks
repair: Use local references in create_writer()
repair: Construct repair_meta with local references
repair: Keep more stuff on repair_info
repair: Kill bunch of global usages from insert_repair_meta
repair: Pass repair service down to meta insertion
repair: Keep local migration manager on repair_info
repair: Move unused db captures
repair: Remove unused ms captures
repair: Construct repair_info with service
repair: Loop over repair sharded container
repair: Make sync_data_using_repair a method
repair: Use repair from storage service
repair: Keep repair on storage service
repair: Make do_repair_start a method
repair: Pass repair_service through the API until do_repair_start
repair: Fix indentation after previous patch
repair: Split sync_data_using_repair
repair: Turn repair_range a repair_info method
...
Following Nadav's advice, instead of ignoring the test
in sanitize/debug modes, the allocator simply has a special path
of failing sufficiently large allocation requests.
With that, a problem with the address sanitizer is bypassed
and other debug mode sanitizers can inspect and check
if there are no more problems related to wrapping the original
rapidjson allocator.
Closes#8539
Now RPC module has some basic testing coverage to
make sure RPC configuration is updated appropriately
on configuration changes (i.e. `add_server` and
`remove_server` are called when appropriate).
The test suite currenty consists of the following
test-cases:
* Loading server instance with configuration from a snapshot.
* Loading server instance with configuration from a log.
* Configuration changes (remove + add node).
* Leader elections don't lead to RPC configuration changes.
* Voter <-> learner node transitions also don't change RPC
configuration.
* Reverting uncommitted configuration changes updates
RPC configuration accordingly (two cases: revert to
snapshot config or committed state from the log).
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
It would be good to have a `CHECK` variant in addition
to an existing `REQUIRE_EVENTUALLY_EQUAL` macro. Will be used
in raft RPC tests.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Currently, emulated rpc network is shared between all test cases
in `replication_test.cc` (see static `rpc::net` map).
Though, its value is not reset when executing a subsequent test
case, which opens a possibility for heap-use-after-free bugs.
Also, make all `send_*` functions in test rpc class to throw an
error if a node being contacted is not in the network instead of
past-the-end access. This allows to safely contact a non-existent
node, which will be used in RPC tests later.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
assure_sufficient_live_nodes() is a huge template calling other
huge templates, and requires "network_topology_strategy.hh". It is
inlined in consistency_level.hh. This increases compile time and recompiles.
Move the template out-of-line and use "extern template" to instantiate it.
This is not ideal as new callers would require updates to the
instantiated signatures, but I think our goal should be to de-template
it completely instead. Meanwhile, this reduces some pain.
Ref #1.
Closes#8637
The first patch adds a nodetool-like capability to the cql-pytest framework.
It is *not* meant to be used to test nodetool itself, but rather to give CQL
tests the ability to use nodetool operations - currently only one operation -
"nodetool flush".
We try to use Scylla's REST API, if possible, and only fall back to using an
external "nodetool" command when the REST API is not available - i.e., when
testing Cassandra. The benefit of using the REST API is that we don't need
to run the jmx server to test Scylla.
The second patch is an example of using the new nodetool flush feature
in a test that needs to flush data to reproduce a bug (which has already
been fixed).
Closes#8622
* github.com:scylladb/scylla:
cql-pytest: reproducer for issue #8138
cql-pytest: add nodetool flush feature
We add a reproducing test for issue #8138, were if we write to an
TWCS table, scanning it would yield no rows - and worse - crash the
debug build.
This test requires "nodetool flush" to force the read to happen from
sstables, hence the nodetool feature was implemented in the previous
patch (on Scylla, it uses the REST API - not actually running nodetool
or requiring JMX).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a nodetool-compatible capability to the cql-pytest
framework. It is *not* meant to be used to test nodetool itself, but
rather to give CQL tests the ability to use nodetool operations -
currently one operation - "nodetool flush".
Use it in a test as:
import nodetool
nodetool.flush(cql, table)
I chose a functional API with parameters ("cql") instead of a fixture
with an implied connection so that in the future we may allow multiple
multiple nodes and this API will allow sending nodetool requests to
different nodes. However, multi-node support is not implemented yet,
nor used in any of the existing tests.
The implementation uses Scylla's REST API if available, or if not, falls
back to using an external "nodetool" command (which can be overridden
using the NODETOOL environment variable). This way, both cql-pytest/run
(Scylla) and cql-pytest/run-cassandra (Cassandra) now correctly support
these nodetool operations, and we still don't need to run JMX to test
Scylla.
The reason We want to support nodetool.flush() is to reproduce bugs that
depend on data reaching disk. We already had such a reproducer in
test_large_cells_rows.py - it too did something similar - but it was
Scylla-only (using only the REST API). Instead of copying such code to
multiple places, we better have a common nodetool.flush() function, as
done in this patch. The test in test_large_cells_rows.py can later be
changed to use the new function.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
"
This set does two things:
- hides migrate-fn machinery in allocation_strategy header
- conceptualizes dynamic objects
The former is possible after IMR rework -- nowadays users of LSA don't
need to do anything special with "migrators" so they can be turned to
be internal allocation-strategy helpers.
The latter is to make sure dynamic objects do not forget to overload
the size_for_allocation_strategy(). If this happens the whole thing
compiles fine and sometimes works, but generates memory corruptions, so
it's worth adding more confidence here.
tests: unit(dev)
"
* 'br-lsa-hide-migrators' of https://github.com/xemul/scylla:
bptree: Require dynamic object for nodes reconstruct
allocation_strategy, code: Conceptualize dynamic objects
allocation_strategy: Hide migrators
allocation_strategy, code: Simplify alloc()
allocation_strategy: Mark size_for_allocation_strategy noexcept
The B+ tree is not intrusive and supports both kinds of objects --
dynamic (in sense of previous patch) and fixed-size. Respectively,
the nodes provide .storage_size() method and get the embedded object
storage size themselves. Thus, if a dynamic object is used with the
tree but it misses the .storage_size() itself this would come
unnoticed. Fortunately, dynamic objects use the .reconstruct()
method, so the method should be equipeed with the DybnamicObject
concept.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Usually lsa allocation is performed with the construct() helper that
allocates a sizeof(T) slot and constructs it in place. Some rare
objects have dynamic size, so they are created by alloc()ating a
slot of some specific size and (!) must provide the correct overload
of size_for_allocation_strategy that reports back the relevant
storage size.
This "must provide" is not enforced, if missed a default sizer would
be instantiated, but won't work properly. This patch makes all users
of alloc() conform to DynamicObject concept which requires the
presense of .storage_size() method to tell that size.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After IMR rework the only lsa-migrating functionality is standard one
that calls move constructors on lsa slots. Hide the whole thing inside
allocation-strategy.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Todays alloc() accepts migrate-fn, size and alignment. All the callers
don't really need to provide anything special for the migrate-fn and
are just happy with default alignof() for alignment. The simplification
is in providing alloc() that only accepts size arg and does the rest
itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The semaphore `stats_collector` references is the one obtained from the
database object, which is already stopped by `database::stop()`, making
the stop in `~stats_collector()` redundant, and even worse, as it
triggers an assert failure. Remove it.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210518140913.276368-1-bdenes@scylladb.com>
A minimal implementation of alternator test env, a younger cousin
of cql_test_env, is implemented. Note that using this environment
for unit tests is strongly discouraged in favor of the official
test/alternator pytest suite. Still, alternator_test_env has its uses
for microbenchmarks.
Current uninstall.sh is trying to do similar logic with install.sh,
but it makes script larger meaninglessly, and also it failing to remove
few files under /opt/scylladb.
Let's just do rm -rf /opt/scylladb, and drop few other files located out
side of /opt/scylladb.
Closes#8662
Consider the following procedure:
- n1, n2, n3
- n3 is down
- n1 runs nodetool removenode uuid_of_n3 to removenode from n3 the
cluster
- n1 is down in the middle of removenode operation
Node n1 will set n3 to removing gossip status during removenode
operation. Whenever existing nodes learn a node is in removing gossip
status, they will call restore_replica_count to stream data from other
nodes for the ranges n3 loses if n3 was removed from the cluster. If
the streaming fails, the streaming will sleep and retry. The current
max number of retry attempts is 5. The sleep interval starts at 60
seconds and increases 1.5 times per sleep.
This can leave the cluster in a bad state. For example, nodes can go
out of disk space if the streaming continues. We need a way to abort
such streaming attempts.
To abort the removenode operation and forcely remove the node, users
can run `nodetool removenode force` on any existing nodes to move the
node from removing gossip status to removed gossip status. However,
the restore_replica_count will not be aborted.
In this patch, a status checker is added in restore_replica_count, so
that once a node is in removed gossip status, restore_replica_count
will be aborted.
This patch is for older releases without the new NODE_OPS_CMD
infrastructure where such abort will happen automatically in case of
error.
Fixes#8651Closes#8655
Currently `cql_test_env` runs its `func` in the default (main) group and
also leaves all scheduling groups in `dbcfg` default initialized to the
same scheduling group. This results in every part of the system,
normally isolated from each other, running in the same (default)
scheduling group. Not a big problem on its own, as we are talking about
tests, but this creates an artificial difference between the test and
the real environment, which is ever more pronounced since certain query
parameters are selected based on the current scheduling group.
To bring cql test env just that little bit closer to the real thing,
this patch creates all the scheduling groups main does (well almost) and
configures `dbcfg` with them.
Creating and destroying the scheduling group on each setup-teardown of
cql test env breaks some internal seastar components which don't like
seeing the same scheduling group with the same name but different id. So
create the scheduling groups once on first access and keep them around
until the test executable is running.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514141614.128213-2-bdenes@scylladb.com>
Currently `with_cql_test_env()` is equivalent to
`with_cql_test_env_thread()`, which resulted in many tests using the
former while really needing the latter and getting away with it. This
equivalence is incidental and will go away soon, so make sure all cql
test env using tests that expect to be run in a thread use the
appropriate variant.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514141614.128213-1-bdenes@scylladb.com>
"
The current scrub compaction has a serious drawback, while it is
very effective at removing any corruptions it recognizes, it is very
heavy-handed in its way of repairing such corruptions: it simply drops
all data that is suspected to be corrupt. While this *is* the safest way
to cleanse data, it might not be the best way from the point of view of
a user who doesn't want to loose data, even at the risk of retaining
some business-logic level corruption. Mind you, no database-level scrub
can ever fully repair data from the business-logic point of view, they
can only do so on the database-level. So in certain cases it might be
desirable to have a less heavy-handed approach of cleansing the data,
that tries as hard as it can to not loose any data.
This series introduces a new scrub mode, with the goal of addressing
this use-case: when the user doesn't want to loose any data. The new
mode is called "segregate" and it works by segregating its input into
multiple outputs such that each output contains a valid stream. This
approach can fix any out-of-order data, be that on the partition or
fragment level. Out-of-order partitions are simply written into a
separate output. Out of order fragments are handled by injecting a
partition-end/partition-start pair right before them, so that they are
now in a separate (duplicate) partition, that will just be written into
a separate output, just like a regular out-of-order partition.
The reason this series is posted as an RFC is that although I consider
the code stable and tested, there are some questions related to the UX.
* First and foremost every scrub that does more than just discard data
that is suspected to be corrupt (but even these a certain degree) have
to consider the possibility that they are rehabilitating corruptions,
leaving them in the system without a warning, in the sense that the
user won't see any more problems due to low-level corruptions and
hence might think everything is alright, while data is still corrupt
from the business logic point of view. It is very hard to draw a line
between what should and shouldn't scrub do, yet there is a demand from
users for scrub that can restore data without loosing any of it. Note
that anybody executing such a scrub is already in a bad shape, even if
they can read their data (they often can't) it is already corrupt,
scrub is not making anything worse here.
* This series converts the previous `skip_corrupted` boolean into an
enum, which now selects the scrub mode. This means that
`skip_corrupted` cannot be combined with segregate to throw out what
the former can't fix. This was chosen for simplicity, a bunch of
flags, all interacting with each other is very hard to see through in
my opinion, a linear mode selector is much more so.
* The new segregate mode goes all-in, by trying to fix even
fragment-level disorder. Maybe it should only do it on the partition
level, or maybe this should be made configurable, allowing the user to
select what to happen with those data that cannot be fixed.
Tests: unit(dev), unit(sstable_datafile_test:debug)
"
* 'sstable-scrub-segregate-by-partition/v1' of https://github.com/denesb/scylla:
test: boost/sstable_datafile_test: add tests for segregate mode scrub
api: storage_service/keyspace_scrub: expose new segregate mode
sstables: compaction/scrub: add segregate mode
mutation_fragment_stream_validator: add reset methods
mutation_writer: add segregate_by_partition
api: /storage_service/keyspace_scrub: add scrub mode param
sstables: compaction/scrub: replace skip_corrupted with mode enum
sstables: compaction/scrub: prevent infinite loop when last partition end is missing
tests: boost/sstable_datafile_test: use the same permit for all fragments in scrub tests
Otherwise an interleaving cache update can clear the `_prev_snapshot`
before the reader is created, leading to the reader being created via a
null mutation source.
Tests: unit(dev, release, debug:row_cache_test)
Fixes#8671.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210518092317.227433-1-bdenes@scylladb.com>
Currently, unified installer does not apply correct file security context
while copying files, it causes permission error on scylla-server.service.
We should apply default file security context while copying files, using
'-Z' option on /usr/bin/install.
Also, because install -Z requires normalized path to apply correct security
context, use 'realpath -m <PATH>' on path variables on the script.
Fixes#8589Closes#8602
Since we have added scylla-node-exporter, we needed to do 'install -d'
for systemd directory and sysconfig directory before copying files.
Fixes#8663Closes#8664
When recreating the paging state from an indexed query,
a bunch of panic checks were introduced to make sure that
the code is correct. However, one of the checks is too eager -
namely, it throws an error if the base column type is not equal
to the view column type. It usually works correctly, unless the
base column type is a clustering key with DESC clustering order,
in which case the type is actually "reversed". From the point of view
of the paging state generation it's not important, because both
types deserialize in the same way, so the check should be less
strict and allow the base type to be reversed.
Tests: unit(release), along with the additional test case
introduced in this series; the test also passes
on Cassandra
Fixes#8666Closes#8667
* github.com:scylladb/scylla:
test: add a test case for paging with desc clustering order
cql3: relax a type check for index paging
While having a large list in the IN clause is unlikely, it's still an
arbitrarily large piece of user-provided data.
On principle, let's use a chunked container here to prevent large contiguous
allocations.
Issue #8666 revealed an issue with validating types for paged
indexed queries - namely, the type checking mechanism is too strict
in comparing types and fails on mismatched clustering order -
e.g. an `int` column type is different from `int` with DESC
clustering order. As a result, users see a *very* confusing
message (because reversed types are printed as their underlying type):
> Mismatched types for base and view columns c: int and int
This test case fails before the fix for #8666 and thus acts
as a regression test.
When recreating the paging state from an indexed query,
a bunch of panic checks were introduced to make sure that
the code is correct. However, one of the checks is too eager
- namely, it throws an error if the base column type is not equal
to the view column type. It usually works correctly, unless the
base column type is a clustering key with DESC clustering order,
in which case the type is actually "reversed". From the point of view
of the paging state generation it's not important, because both
types deserialize in the same way, so the check should be less
strict and allow the base type to be reversed.
Tests: unit(release), along with the additional test case
introduced in this series; the test also passes
on Cassandra
Fixes#8666
Returning by reference requires that the elements are internally stored in in
the multi_item_terminal as a std::vector, but in the next patch we will change
the internal type of lists::value from std::vector to utils::chunked_vector.
The copy is not a problem because all users of multi_item_terminal were copying
the returned vector.
run_custom_job() was swallowing all exceptions, which is definitely
wrong because failure in a resharding or reshape would be incorrectly
interpreted as success, which means upper layer will continue as if
everything is ok. For example, ignoring a failure in resharding could
result in a shared sstable being left unresharded, so when that sstable
reaches a table, scylla would abort as shared ssts are no longer
accepted in the main sstable set.
Let's allow the exception to be propagated, so failure will be
communicated, and resharding and reshape will be all or nothing, as
originally intended.
Fixes#8657.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210515015721.384667-1-raphaelsc@scylladb.com>
Extract raft tickers initialization into `init_raft_tickers()`
functon. This will be later used in rpc tests.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This will be used later in rpc tests to support passing a dummy
apply function, which does not need to update state machine at all.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Use a somewhat neater structure to represent `create_raft_server()`
return value instead of cumbersome
`std::pair<std::unique_ptr<raft::server>, state_machine*>`.
Not only `test_server` is much shorter, but it is also much more
descriptive.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Include the version from `helpers.hh`. This also makes possible
to use additional utilities from this header file, like `id()`
and `address_set()`, which comes handy in simple tests and
will be used in rpc testing.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
It is generally considered a bad practice to use the `using`
directives at global scope in header files.
Also, many parts of `test/raft/helpers.hh` were already
using `raft::` prefixes explicitly, so definitely not much
to lose there.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Use a hardware counter to report instructions per fragment. Results
vary from ~4k insns/f when reading sequentially to more than 1M insns/f.
Instructions per fragment can be a more stable metric than frags/sec.
It would probably be even more stable with a fake file implementation
that works in-memory to eliminate seastar polling instruction variation.
Closes#8660
This is the 1st PR in series with the goal to finish the hackathon project authored by @tgrabiec, @kostja, @amnonh and @mmatczuk (improved virtual tables + function call syntax in CQL). Virtual tables created within this framework are "materialized" in memtables, so current solution is for small tables only. As an example system.status was added. It was checked that DISTINCT and reverse ORDER BY do work.
This PR was created by @jul-stas and @StarostaGit
Fixes#8343
This is the same as #8364, but with a compilation fix (newly added `close()` method was not implemented by the reader)
Closes#8634
* github.com:scylladb/scylla:
boost/tests: Add virtual_table_test for basic infrastructure
boost/tests: Test memtable_filling_virtual_table as mutation_source
db/system_keyspace: Add system.status virtual table
db/virtual_table: Add a way to specify a range of partitions for virtual table queries.
db/virtual_table: Introduce memtable_filling_virtual_table
db: Add virtual tables interface
db: Introduce chained_delegating_reader
Following alternator unit tests, cql-pytest now also boots
Scylla/Cassandra with authentication enabled.
Unconditionally enabling authentication does not ruin any existing
test case, while it enables testing more scenarios. For instance,
Scylla-specific service levels can only be created and attached
to roles, which depends on authentication being enabled.
A sad side-effect is that Scylla boots slower with PasswordAuthenticator
than without it - it takes 15 seconds to set up the default
superuser account due to a hardcoded sleep duration [1] :( That should be
solved by a separate fix though.
[1]:
auth/common.hh:
inline future<> delay_until_system_ready(seastar::abort_source& as) {
return sleep_abortable(15s, as);
}
This flag is not really needed, because we can just attempt a resume on
first use which will fail with the default constructed inactive read
handle and the reader will be created via the recreate-after-evicted
path.
This allows the same path to be used for all reader creation cases,
simplifying the logic and more importantly making further patching
easier without the special case.
To make the recreate path (almost) as cheap for the first reader
creation as it was with the special path, `_trim_range_tombstones` and
`_validate_partition_key` is only set when really needed.
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514141511.127735-1-bdenes@scylladb.com>
We now have close() which is expected to clean up, no need for cleanup
in the destructor and consequently a destructor at all.
Message-Id: <20210514112349.75867-1-bdenes@scylladb.com>
Enabling it for each run_worker call will invoke ioctl
PERF_EVENT_IOC_ENABLE in parallel to other workers running
and this may skew the results.
Test: perf_simple_query
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210514130542.301168-1-bhalevy@scylladb.com>
No code left that uses these globals, so rip them altogether. Also
drop the former messaging init/uninit methods that are now only
setting up those globals.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some time ago checks for sys-dist-ks and view-update-generator to
be locally initalized were moved inside the repair service message
handlers. Now everything is ready to use service's reference instead
of global pointers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The repair_writer::create_writer() method needs sys-dist-ks
and view-update-generator. It's only called from repair_meta
which already has both.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The repair_meta needs sys-dist-ks and view-update-generator. Now
when it's created both are available. Once from the repair-service
and another time from the repair_info.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The repair-meta is once created from the repair_info. It will need
the sys-dist-ks and view-update-generator. Put them into the info
to have them at meta creation.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The insert_repair_meta needs to peek global proxy to get db from,
migration_manager to call get_schema_for_write(), global messaging
to pass it as argument to the mentioned call and to construct
the repair_meta.
All three can be obtained from repair-service which's now passed
there as argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The repair_meta will need to get local references to sys_dist_ks
and view_update_generator. One of the places where it's created
is insert_repair_meta that's called (almost) directly from the
repair messaging handler which already has the repair service.
One thing to take care of is that the handler reshards on entry,
so do the container().invoke_on() and get the local repair from
the lambda's argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The repair_range routine needs to mess with migration manager.
Fortunatelly the routine had been patched to be repair_info's
method and the repair_info itself can get the migration manager
from repair_service.
ATTN -- the obtained reference is local, not sharded<>, but the
repair_info doesn't reshard and can carry local reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the repair service is passed around repair-info
creation a lot of local messaging captures become unused and
can be removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The repair_info object will need to carry more services
references on board. This now can be easily achieved by passing
the repair service into the repair_info constructor. The info
can then get all it needs from the service. This patch is the
step #1 here -- replace db and messaging args with repair
service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Previous patches made sync_data_using_repair and do_repair_start
methods of repair service. This was done to have local repair
reference near the creation of repair_info. The last step left
is to make the local repair service available inside the .invoke_on
lambda. This patch makes this invoke_on on the repair service
itself thus automagically getting the local repair service in lambda.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The do_..._with_repair()-s all call sync_data_using_repair, the latter
was previously prepared to receive local repair service reference via
"this", so it's finally time to make it happen.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is the continuation of the previous patch -- the do_..._with_repair
functions become repair_service methods and will get local repair
service reference as "this".
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Storage service calls a bunch of do_something_with_repair() methods. All
of them need the local repair_service and the only way to get it is by
keeping it on storage service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The do_repair_start() creates repair_info which will need the
repair_service reference, so turn this function into a method
to have repair_service as "this".
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The routine in question creates repair_info inside. The repair_info
will need to receive local repair_service reference somehow, this
split prepares ground for this change.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This routine uses global migration_manager pointer. Next patches
will keep the reference on a manager on repair_info and it will
be possible to use this->migration_manager reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sys-dist-ks and view-update-generator checked are global pointers
and are performed inside static repair_meta's method. At the same
time the caller of this method is repair_service class which already
has both on-board, so move the checks up. Later they will use the
service-local references, not global pointers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now repair messaging handlers are set-up on all shards by
doing messaging.invoke_on_all() calls. Since now repair service
is sharded and its .start() and .stop() are invoke-on-all-ed, it's
better to move messaging init/deinit into them.
The indentation is deliberately left broken until next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The repair service needs database, migration manager, messaging and
sys-dist-ks + view-update-generator pair. Put all these guys on it
in advance and equip the service with getters for future use.
Some dependencies are sharded because they are used in cross-shard
context and needs more care to be cleaned out later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now the repair service exists in a single instance, but it's becoming
a sharded<> service. Tracker expects to be constructed once, so make
it a pointer and next patch eill instantiate it on shard 0 only.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We introduce `PureStateMachine`, which is the most direct translation
of the mathematical definition of a state machine to C++ that I could
come up with. Represented by a C++ concept, it consists of: a set of
inputs (represented by the `input_t` type), outputs (`output_t` type),
states (`state_t`), an initial state (`init`) and a transition
function (`delta`) which given a state and an input returns a new
state and an output.
The rest of the testing infrastructure is going to be generic
w.r.t. `PureStateMachine`. This will allow easily implementing tests
using both simple and complex state machines by substituting the
proper definition for this concept.
Next comes `logical_timer`: it is a wrapper around
`raft::logical_clock` that allows scheduling events to happen after a
certain number of logical clock ticks. For example,
`logical_timer::sleep(20_t)` returns a future that resolves after 20
calls to `logical_timer::tick()`. It will be used to introduce
timeouts in the tests, among other things.
To replicate a state machine, our Raft implementation requires it to
be represented with the `raft::state_machine` interface.
`impure_state_machine` is an implementation of `raft::state_machine`
that wraps a `PureStateMachine`. It keeps a variable of type `state_t`
representing the current state. In `apply` it deserializes the given
command into `input_t`, uses the transition (`delta`) function to
produce the next state and output, replaces its current state with the
obtained state and returns the output (more on that below); it does so
sequentially for every given command. We can think of `PureStateMachine`
as the actual state machine - the business logic, and
`impure_state_machine` as the ``boilerplate'' that allows the pure machine
to be replicated by Raft and communicate with the external world.
The interface also requires maintainance of snapshots. We introduce the
`snapshots_t` type representing a set of snapshots known by a state
machine. `impure_state_machine` keeps a reference to `snapshots_t`
because it will share it with an implementation of `persistence`.
Returning outputs is a bit tricky because apply is ``write-only'' - it
returns `future<>`. We use the following technique:
1. Before sending a command to a Raft leader through `server::add_entry`,
one must first directly contact the instance of `impure_state_machine`
replicated by the leader, asking it to allocate an ``output channel''.
2. On such a request, `impure_state_machine` creates a channel
(represented by a promise-future pair) and a unique ID; it stores the
input side of the channel (the promise) with this ID internally and returns
the ID and the output side of the channel (the future) to the requester.
3. After obtaining the ID, one serializes the ID together with the input
and sends it as a command to Raft. Thus commands are (ID, machine input)
pairs.
4. When `impure_state_machine` applies a command, it looks for a promise
with the given ID. If it finds one, it sends the output through this
channel.
5. The command sender waits for the output on the obtained future.
The allocation and deallocation of channels is done using the
`impure_state_machine::with_output_channel` function. The `call`
function is an implementation of the above technique.
Note that only the leader will attempt to send the output - other
replicas won't find the ID in their internal data structure. The set of
IDs and channels is not a part of the replicated state.
A failure may cause the output to never arrive (or even the command to
never be applied) so `call` waits for a limited time. It may also
mistakenly `call` a server which is not currently the leader, but it
is prepared to handle this error.
We implement the `raft::rpc` interface, allowing Raft servers to
communicate with other Raft servers.
The implementation is mostly boilerplate. It assumes that there exists a
method of message passing, given by a `send_message_t` function passed
in the constructor. It also handles the receival of messages in the
`receive` function. It defines the message type (`message_t`) that will
be used by the message-passing method.
The actual message passing is implemented with `network` and `delivery_queue`.
The only slightly complex thing in `rpc` is the implementation of `send_snapshot`
which is the only function in the `raft::rpc` interface that actually
expects a response. To implement this, before sending the snapshot
message we allocate a promise-future pair and assign to it a unique ID;
we store the promise and the ID in a data structure. We then send the
snapshot together with the ID and wait on the future. The message
receival function on the other side, when it receives the snapshot message,
applies the snapshot and sends back a snapshot reply message that contains
the same ID. When we receive a snapshot reply message we look up the ID in the
data structure and if we find a promise, we push the reply through that
promise.
`rpc` also keeps a reference to `snapshots_t` - it will refer to the
same set of snapshots as the `impure_state_machine` on the same server.
It accesses the set when it receives or sends a snapshot message.
`persistence` represents the data that does not get lost between server
crashes and restarts.
We store a log of commands in `_stored_entries`. It is invariably
``contiguous'', meaning that the index of each entry except the first is
equal to the index of the previous entry plus one at all times (i.e.
after each yield). We assume that the caller provides log entries
in strictly increasing index order and without gaps.
Additionally to storing log entries, `persistence` can be asked to store
or load a snapshot. To implement this it takes a reference to a set of snapshots
(`snapshots_t&`) which it will share with `impure_state_machine` and an
implementation of `rpc`. We ensure that the stored log either ``touches''
the stored snapshot on the right side or intersects it.
In order to simulate a production environment as closely as possible, we
implement a failure detector which uses heartbeats for deciding whether
to convict a server as failed. We convict a server if we don't receive a
heartbeat for a long enough time.
Similarly to `rpc`, `failure_detector` assumes a message passing method
given by a `send_heartbeat_t` function through the constructor.
`failure_detector` uses the knowledge about existing servers to decide
who to send heartbeats to. Updating this knowledge happens through
`add_server` and `remove_server` functions.
`network` is a simple priority queue of "events", where an event is a
message associated with delivery time. Each message contains a source,
a destination, and payload. The queue uses a logical clock to decide
when to deliver messages; it delivers are messages whose associated
times are smaller than the current time.
The exact delivery method is unknown to `network` but passed as a
`deliver_t` function in the constructor. The type of payload is generic.
The fact that `network` has delivered a message does not mean the
message was processed by the receiver. In fact, `network` assumes that
delivery is instantaneous, while processing a message may be a long,
complex computation, or even require IO. Thus, after a message is
delivered, something else must ensure that it is processed by the
destination server.
That something in our framework is `delivery_queue`. It will be the
bridge between `network` and `rpc`. While `network` is shared by all
servers - it represents the ``environment'' in which the servers live -
each server has its own private `delivery_queue`. When `network`
delivers an RPC message it will end up inside `delivery_queue`. A
separate fiber, `delivery_queue::receive_fiber()`, will process those
messages by calling `rpc::receive` (which is a potentially long
operation, thus returns a `future<>`) on the `rpc` of the destination
server.
`raft_server` is a package that contains `raft::server` and other
facilities needed for the server to communicate with its environment:
the delivery queue, the set of snapshots (shared by
`impure_state_machine`, `rpc` and `persistence`) and references to the
`impure_state_machine` and `rpc` instances of this server.
`environment` represents a set of `raft_server`s connected by a `network`.
The `network` inside is initialized with a message delivery function
which notifies the destination server's failure detector on each message
and if the message contains an RPC payload, pushes it into the destination's
`delivery_queue`.
Needs to be periodically `tick()`ed which ticks the network
and underlying servers.
`ticker` calls the given function as fast as the Seastar reactor
allows and yields between each call. It may be provided a limit
for the number of calls; it crashes the test if the limit is reached
before the ticker is `abort()`ed.
Finally, we add a simple test that serves as an example of using the
implemented framework. We introduce `ExRegister`, an implementation
of `PureStateMachine` that stores an `int32_t` and handles ``exchange''
and ``read'' inputs; an exchange replaces the state with the given value
and returns the previous state, a read does not modify the state and returns
the current state. In order to pass the inputs to Raft we must
serialize them into commands so we implement instances of `ser::serializer`
for `ExReg`'s input types.
* kbr/randomized-nemesis-test-v5:
raft: randomized_nemesis_test: basic test
raft: randomized_nemesis_test: ticker
raft: randomized_nemesis_test: environment
raft: randomized_nemesis_test: server
raft: randomized_nemesis_test: delivery queue
raft: randomized_nemesis_test: network
raft: randomized_nemesis_test: heartbeat-based failure detector
raft: randomized_nemesis_test: memory backed persistence
raft: randomized_nemesis_test: rpc
raft: randomized_nemesis_test: impure_state_machine
raft: randomized_nemesis_test: introduce logical_timer
raft: randomized_nemesis_test: `PureStateMachine` concept
* scylla-dev/raft-cleanup-v1:
raft: drop _leader_progress tracking from the tracker
raft: move current_leader into the follower state
raft: add some precondition checks
In commit c82250e0cf (gossip: Allow deferring
advertise of local node to be up), the replacing node is changed to postpone
the responding of gossip echo message to avoid other nodes sending read
requests to the replacing node. It works as following:
1) replacing node does not respond echo message to avoid other nodes to
mark replacing node as alive
2) replacing node advertises hibernate state so other nodes knows
replacing node is replacing
3) replacing node responds echo message so other nodes can mark
replacing node as alive
This is problematic because after step 2, the existing nodes in the
cluster will start to send writes to the replacing node, but at this
time it is possible that existing nodes haven't marked the replacing
node as alive, thus failing the write request unnecessarily.
For instance, we saw the following errors in issue #8013 (Cassandra
stress fails to achieve consistency when only one of the nodes is down)
```
scylla:
[shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2
required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1},
pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send
EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond
gossip echo message)
c-s:
java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]:
Error executing: (UnavailableException): Not enough replicas available
for query at consistency QUORUM (2 required but only 1 alive
```
To solve this problem for older releases without the patch "repair:
Switch to use NODE_OPS_CMD for replace operation", a minimum fix is
implemented in this patch. Once existing nodes learn the replacing node
is in HIBERNATE state, they add the replacing as replacing, but only add
the replacing to the pending list only after the replacing node is
marked as alive.
With this patch, when the existing nodes start to write to the replacing
node, the replacing node is already alive.
Tests: replace_address_test.py:TestReplaceAddress.replace_node_same_ip_test + manual test
Fixes: #8013Closes#8614
Too many or too resource-hungry reads often lie at the heart of issues
that require an investigation with gdb. Therefore it is very useful to
have a way to summarize all reads found on a shard with their states and
resource consumptions. This is exactly what this new command does. For
this it uses the reader concurrency semaphores and their permits
respectively, which are now arranged in an intrusive list and therefore
are enumerable.
Example output:
(gdb) scylla read-stats
Semaphore _read_concurrency_sem with: 1/100 count and 14334414/14302576 memory resources, queued: 0, inactive=1
permits count memory table/description/state
1 1 14279738 multishard_mutation_query_test.fuzzy_test/fuzzy-test/active
16 0 53532 multishard_mutation_query_test.fuzzy_test/shard-reader/active
1 0 1144 multishard_mutation_query_test.fuzzy_test/shard-reader/inactive
1 0 0 *.*/view_builder/active
1 0 0 multishard_mutation_query_test.fuzzy_test/multishard-mutation-query/active
20 1 14334414 Total
* botond/scylla-gdb.py-scylla-reads/v5:
scylla-gdb.py: introduce scylla read-stats
scylla-gdb.py: add pretty printer for std::string_view
scylla-gdb.py: std_map() add __len__()
scylla-gdb.py: prevent infinite recursion in intrusive_list.__len__()
The repair parallelism is calculated by the number of memory allocated to
repair and memory usage per repair instance. Currently, it does not
consider memory bloat issues (e.g., issue #8640) which cause repair to
use more memory and cause std::bad_alloc.
Be more conservative when calculating the parallelism to avoid repair
using too much memory.
Fixes#8641Closes#8652
The auth intialization path contains a fixed 15s delay,
which used to work around a couple of issues (#3320, #3850),
but is right now quite useless, because a retry mechanism
is already in place anyway.
This patch speeds up the boot process if authentication is enabled.
In particular, for a single-node clusters, common for test setups,
auth initialization now takes a couple of milliseconds instead
of the whole 15 seconds.
Fixes#8648Closes#8649
This is a simple test that serves as an example of using the
framework implemented in the previous commits. We introduce
`ExRegister`, an implementation of `PureStateMachine` that stores
an `int32_t` and handles ``exchange'' and ``read'' inputs;
an exchange replaces the state with the given value and returns
the previous state, a read does not modify the state and returns
the current state. In order to pass the inputs to Raft we must
serialize them into commands so we implement instances of `ser::serializer`
for `ExReg`'s input types.
`ticker` calls the given function as fast as the Seastar reactor
allows and yields between each call. It may be provided a limit
for the number of calls; it crashes the test if the limit is reached
before the ticker is `abort()`ed.
The commit also introduces a `with_env_and_ticker` helper function which
creates an `environment`, a `ticker`, and passes references to them to
the given function. It destroys them after the function finishes
by calling `abort()`.
`environment` represents a set of `raft_server`s connected by a `network`.
The `network` inside is initialized with a message delivery function
which notifies the destination server's failure detector on each message
and if the message contains an RPC payload, pushes it into the destination's
`delivery_queue`.
Needs to be periodically `tick()`ed which ticks the network
and underlying servers.
New servers can be created in the environment by calling `new_server`.
`raft_server` is a package that contains `raft::server` and other
facilities needed for the server to communicate with its environment:
the delivery queue, the set of snapshots (shared by
`impure_state_machine`, `rpc` and `persistence`) and references to the
`impure_state_machine` and `rpc` instances of this server.
The fact that `network` has delivered a message does not mean the
message was processed by the receiver. In fact, `network` assumes that
delivery is instantaneous, while processing a message may be a long,
complex computation, or even require IO. Thus, after a message is
delivered, something else must ensure that it is processed by the
destination server.
That something in our framework is `delivery_queue`. It will be the
bridge between `network` and `rpc`. While `network` is shared by all
servers - it represents the ``environment'' in which the servers live -
each server has its own private `delivery_queue`. When `network`
delivers an RPC message it will end up inside `delivery_queue`. A
separate fiber, `delivery_queue::receive_fiber()`, will process those
messages by calling `rpc::receive` (which is a potentially long
operation, thus returns a `future<>`) on the `rpc` of the destination
server.
`network` is a simple priority queue of "events", where an event is a
message associated with delivery time. Each message contains a source,
a destination, and payload. The queue uses a logical clock to decide
when to deliver messages; it delivers are messages whose associated
times are smaller than the current time.
The exact delivery method is unknown to `network` but passed as a
`deliver_t` function in the constructor. The type of payload is generic.
In order to simulate a production environment as closely as possible, we
implement a failure detector which uses heartbeats for deciding whether
to convict a server as failed. We convict a server if we don't receive a
heartbeat for a long enough time.
Similarly to `rpc`, `failure_detector` assumes a message passing method
given by a `send_heartbeat_t` function through the constructor.
`failure_detector` uses the knowledge about existing servers to decide
who to send heartbeats to. Updating this knowledge happens through
`add_server` and `remove_server` functions.
`persistence` represents the data that does not get lost between server
crashes and restarts.
We store a log of commands in `_stored_entries`. It is invariably
``contiguous'', meaning that the index of each entry except the first is
equal to the index of the previous entry plus one at all times (i.e.
after each yield). We assume that the caller provides log entries
in strictly increasing index order and without gaps.
Additionally to storing log entries, `persistence` can be asked to store
or load a snapshot. To implement this it takes a reference to a set of snapshots
(`snapshots_t&`) which it will share with `impure_state_machine` and an
implementation of `rpc` coming in a later commit. We ensure that the stored
log either ``touches'' the stored snapshot on the right side or intersects it.
We implement the `raft::rpc` interface, allowing Raft servers to
communicate with other Raft servers.
The implementation is mostly boilerplate. It assumes that there exists a
method of message passing, given by a `send_message_t` function passed
in the constructor. It also handles the receival of messages in the
`receive` function. It defines the message type (`message_t`) that will
be used by the message-passing method.
The actual message passing is implemented with `network` and `delivery_queue`
which are introduced in later commits.
The only slightly complex thing in `rpc` is the implementation of `send_snapshot`
which is the only function in the `raft::rpc` interface that actually
expects a response. To implement this, before sending the snapshot
message we allocate a promise-future pair and assign to it a unique ID;
we store the promise and the ID in a data structure. We then send the
snapshot together with the ID and wait on the future. The message
receival function on the other side, when it receives the snapshot message,
applies the snapshot and sends back a snapshot reply message that contains
the same ID. When we receive a snapshot reply message we look up the ID in the
data structure and if we find a promise, we push the reply through that
promise.
`rpc` also keeps a reference to `snapshots_t` - it will refer to the
same set of snapshots as the `impure_state_machine` on the same server.
It accesses the set when it receives or sends a snapshot message.
To replicate a state machine, our Raft implementation requires it to
be represented with the `raft::state_machine` interface.
`impure_state_machine` is an implementation of `raft::state_machine`
that wraps a `PureStateMachine`. It keeps a variable of type `state_t`
representing the current state. In `apply` it deserializes the given
command into `input_t`, uses the transition (`delta`) function to
produce the next state and output, replaces its current state with the
obtained state and returns the output (more on that below); it does so
sequentially for every given command. We can think of `PureStateMachine`
as the actual state machine - the business logic, and
`impure_state_machine` as the ``boilerplate'' that allows the pure machine
to be replicated by Raft and communicate with the external world.
The interface also requires maintainance of snapshots. We introduce the
`snapshots_t` type representing a set of snapshots known by a state
machine. `impure_state_machine` keeps a reference to `snapshots_t`
because it will share it with an implementation of `raft::persistence`
coming with a later commit.
Returning outputs is a bit tricky because apply is ``write-only'' - it
returns `future<>`. We use the following technique:
1. Before sending a command to a Raft leader through `server::add_entry`,
one must first directly contact the instance of `impure_state_machine`
replicated by the leader, asking it to allocate an ``output channel''.
2. On such a request, `impure_state_machine` creates a channel
(represented by a promise-future pair) and a unique ID; it stores the
input side of the channel (the promise) with this ID internally and returns
the ID and the output side of the channel (the future) to the requester.
3. After obtaining the ID, one serializes the ID together with the input
and sends it as a command to Raft. Thus commands are (ID, machine input)
pairs.
4. When `impure_state_machine` applies a command, it looks for a promise
with the given ID. If it finds one, it sends the output through this
channel.
5. The command sender waits for the output on the obtained future.
The allocation and deallocation of channels is done using the
`impure_state_machine::with_output_channel` function. The `call`
function is an implementation of the above technique.
Note that only the leader will attempt to send the output - other
replicas won't find the ID in their internal data structure. The set of
IDs and channels is not a part of the replicated state.
A failure may cause the output to never arrive (or even the command to
never be applied) so `call` waits for a limited time. It may also
mistakenly `call` a server which is not currently the leader, but it
is prepared to handle this error.
This is a wrapper around `raft::logical_clock` that allows scheduling
events to happen after a certain number of logical clock ticks.
For example, `logical_timer::sleep(20_t)` returns a future that resolves
after 20 calls to `logical_timer::tick()`.
The commit introduces `PureStateMachine`, which is the most direct translation
of the mathematical definition of a state machine to C++ that I could come up with.
Represented by a C++ concept, it consists of: a set of inputs
(represented by the `input_t` type), outputs (`output_t` type), states (`state_t`),
an initial state (`init`) and a transition function (`delta`) which
given a state and an input returns a new state and an output.
The rest of the testing infrastructure is going to be
generic w.r.t. `PureStateMachine`. This will allow easily implementing
tests using both simple and complex state machines by substituting the
proper definition for this concept.
One possibility of modifying this definition would be to have `delta`
return `future<pair<state_t, output_t>>` instead of
`pair<state_t, output_t>`. This would lose some ``purity'' but allow
long computations without reactor stalls in the tests. Such modification,
if we decide to do it, is trivial.
Uses the infrastructure for testing mutation_sources, but only a
subset of it which does not do fast forwarding (since virtual_table
does not support it).
table queries.
This change introduces a query_restrictions object into the virtual
table infrastructure, for now only holding a restriction on partition
ranges.
That partition range is then implemented into
memtable_filling_virtual_table.
This change adds a more specific implementation of the virtual table
called memtable_filling_virtual_table. It produces results by filling
a memtable on each read.
This change introduces the basic interface we expect each virtual
table to implement. More specific implementations will then expand
upon it if needed.
This change adds a new type of mutation reader which purpose
is to allow inserting operations before an invocation of the proper
reader. It takes a future to wait on and only after it resolves will
it forward the execution to the underlying flat_mutation_reader
implementation.
We remove a log of severity error that is later thrown as an
exception, being catched few lines below and then printed out as
a warning.
Fixes#8616Closes#8617
Refs #251.
Closes#8630
* github.com:scylladb/scylla:
statistics: add global bloom filter memory gauge
statistics: add some sstable management metrics
sstables: make the `_open` field more useful
sstables: stats: noexcept all accessors
"
Row cache reader can produce overlapping range tombstones in the mutation
fragment stream even if there is only a single range tombstone in sstables,
due to #2581. For every range between two rows, the row cache reader queries
for tombstones relevant for that range. The result of the query is trimmed to
the current position of the reader (=position of the previous row) to satisfy
key monotonicity. The end position of range tombstones is left unchanged. So
cache reader will split a single range tombstone around rows. Those range
tombstones are transient, they will be only materialized in the reader's
stream, they are not persisted anywhere.
That is not a problem in itself, but it interacts badly with mutation
compactor due to #8625. The range_tombstone_accumulator which is used to
compact the mutation fragment stream needs to accumulate all tombstones which
are relevant for the current clustering position in the stream. Adding a new
range tombstone is O(N) in the number of currently active tombstones. This
means that producing N rows will be O(N^2).
In a unit test introduced in this series, I saw reading 137'248 rows which
overlap with a range tombstone take 245 seconds. Almost all of CPU time is in
drop_unneeded_tombstones().
The solution is to make the cache reader trim range tombstone end to the
currently emited sub-range, so that it emits non-overlapping range tombstones.
Fixes#8626.
Tests:
- row_cache_test (release)
- perf_row_cache_reads (release)
"
* tag 'fix-perf-many-rows-covered-by-range-tombstone-v2' of github.com:tgrabiec/scylla:
tests: perf_row_cache_reads: Add scenario for lots of rows covered by a range tombstone
row_cache: Avoid generating overlapping range tombstones
range_tombstone_accumulator: Avoid update_current_tombstone() when nothing changed
Add the following metrics, as part of #251:
- open for writing (a.k.a. "created", unless I'm missing something?)
- open for reading
- deleted
- currently open for reading/writing (gauges)
Refs #251.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
As a function returning a future, simplify
its interface by handling any exceptions and
returning an exceptional future instead of
propagating the exception.
In this specific case, throwing from advance_and_await()
will propagate through table::await_pending_* calls
short-circuiting a .finally clause in table::stop().
Also, mark as noexcept methods of class table calling
advance_and_await and table::await_pending_ops that depends on them.
Fixes#8636
A followup patch will convert advance_and_await to a coroutine.
This is done separately to facilitate backporting of this patch.
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210511161407.218402-1-bhalevy@scylladb.com>
Row cache reader can produce overlapping range tombstones in the
mutation fragment stream even if there is only a single range
tombstone in sstables, due to #2581. For every range between two rows,
the row cache reader queries for tombstones relevant for that
range. The result of the query is trimmed to the current position of
the reader (=position of the previous row) to satisfy key
monotonicity. The end position of range tombstones is left
unchanged. So cache reader will split a single range tombstone around
rows. Those range tombstones are transient, they will be only
materialized in the reader's stream, they are not persisted anywhere.
That is not a problem in itself, but it interacts badly with mutation
compactor due to #8625. The range_tombstone_accumulator which is used
to compact the mutation fragment stream needs to accumulate all
tombstones which are relevant for the current clustering position in
the stream. Adding a new range tombstone is O(N) in the number of
currently active tombstones. This means that producing N rows will be
O(N^2).
In a unit test, I saw reading 137'248 rows which overlap with a range
tombstone take 245 seconds. Almost all of CPU time is in
drop_unneeded_tombstones().
The solution is to make the cache reader trim range tombstone end to
the currently emited sub-range, so that it emits non-overlapping range
tombstones.
Fixes#8626.
Recalculation of the current tombstone is O(N) in the number of active
range tombstones. This can be a significant overhead, so better avoid it.
Solves the problem of quadratic complexity when producing lots of
overlaping range tombstones with a common end bound.
Refs #8625
Refs #8626
When an index is created without an explicit name, a default name
is chosen. However, there was no check if a table with conflicting
name already exists. The check is now in place and if any conflicts
are found, a new index name is chosen instead.
When an index is created *with* an explicit name and a conflicting
regular table is found, index creation should simply fail.
This series comes with a test.
Fixes#8620
Tests: unit(release)
Closes#8632
* github.com:scylladb/scylla:
cql-pytest: add regression tests for index creation
cql3: fail to create an index if there is a name conflict
database: check for conflicting table names for indexes
The operator== of enum_option<> (which we use to hold multi-valued
Scylla options) makes it easy to compare to another enum_option
wrapper, but ugly to compare the actual value held. So this patch
adds a nicer way to compare the value held.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210511120222.1167686-1-nyh@scylladb.com>
utils::phased_barrier holds a `lw_shared_ptr<gate>` that is
typically `enter()`ed in `phased_barrier::start()`,
and left when the operation is destroyed in `~operation`.
Currently, the operation move-assign implementation is the
default one that just moves the lw_shared gate ptr from the
other operation into this one, without calling `_gate->leave()` first.
This change first destroys *this when move-assigned (if not self)
to call _gate->leave() if engaged, before reassigning the
gate with the other operation::_gate.
A unit test that reproduces the issue before this change
and passes with the fix was added to serialized_action_test.
Fixes#8613
Test: unit(dev), serialized_action_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210510120703.1520328-1-bhalevy@scylladb.com>
"
The current printout is has multiple problems:
* It is segregated by state, each having its own sorting criteria;
* Number of permits and count resources is collapsed in to a single
column, not clear which is the one printed.
* Number of available/initial units of the semaphore are not printed;
This series solves all this problems:
* It merges all states into a single table, sorted by memory
consumption, in descending order.
* It separates number of permits and count resources into separate
columns.
* Prints a summary of the semaphore units.
* Provides a cap on the maximum amount of printable lines, to not blow
up the logs.
The goal of all this is to make it easy to find the culprit a semaphore
problem: easily spot the big memory consumers, then unpack the name
column to determine which table and code path is responsible.
This brings the printout close to the recently `scylla reads`
scylla-gdb.py command, providing a uniform report format across the two
tools.
Example report:
INFO 2021-05-07 09:52:16,806 [shard 0] testlog - With max-lines=4: Semaphore reader_concurrency_semaphore_dump_reader_diganostics with 8/2147483647 count and 263599186/9223372036854775807 memory resources: user request, dumping permit diagnostics:
permits count memory table/description/state
7 2 77M ks.tbl1/op1/active
6 3 59M ks.tbl1/op0/active
4 0 36M ks.tbl1/op2/active
3 1 36M ks.tbl0/op2/active
11 2 43M permits omitted for brevity
31 8 251M total
"
* 'reader-concurrency-semaphore-dump-improvement/v1' of https://github.com/denesb/scylla:
test: reader_concurrency_test: add reader_concurrency_semaphore_dump_reader_diganostics
reader_concurrency_semaphore: dump_reader_diagnostics(): print more information in the header
reader_concurrency_semaphore: dump_reader_diagnostics(): cap number of printed lines
reader_concurrency_semaphore: dump_reader_diagnostics(): sort lines in descending order
reader_concurrency_semaphore: dump_reader_diagnostics(): merge all states into a single table
reader_concurrency_semaphore: dump_reader_diagnostics(): separate number of permits and count resources
In commit 3e39985c7a we added the Cassandra-compatible system table
system."IndexInfo" (note the capitalized table name) which lists built
indexes. Because we already had a table of built materialized views, and
indexes are implemented as materialized views, the index list was
implemented as a virtual table based on the view list.
However, the *name* of each materialized view listed in the list of
views looks like something_index, with the suffix "_index", while the
name of the table we need to print is "something". We forgot to do this
transformation in the virtual table - and this is what this patch does.
This bug can confuse applications which use this system table to wait for
an index to be built. Several tests translated from Cassandra's unit
tests, in cassandra_tests/validation/entities/secondary_index_test.py fail
in wait_for_index() because of this incompatibility, and pass after this
patch.
This patch also changes the unit test that enshrined the previous, wrong,
behavior, to test for the correct behavior. This problem is typical of
C++ unit tests which cannot be run against Cassandra.
Fixes#8600
Unfortunately, although this patch fixes "typical" applications (including
all tests which I tried) - applications which read from IndexInfo in a
"typical" method to look for a specific index being ready, the
implementation is technically NOT correct: The problem is that index
names are not sorted in the right order, because they are sorted with
the "_index" prefix.
To give an example, the index names "a" should be listed before "a1", but
the view names "a1_index" comes before "a_index" (because in ASCII, 1
comes before underscore). I can't think of any way to fix this bug
without completely reimplementing IndexInfo in a different way - probably
based on a temporary memtable (which is fine as this is not a
performance-critical operation). We'll need to do this rewrite eventually,
and I'll open a new issue.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210509140113.1084497-1-nyh@scylladb.com>
Ref: #7617
This series adds timeout parameters to service levels.
Per-service-level timeouts can be set up in the form of service level parameters, which can in turn be attached to roles. Setting up and modifying role-specific timeouts can be achieved like this:
```cql
CREATE SERVICE LEVEL sl2 WITH read_timeout = 500ms AND write_timeout = 200ms AND cas_timeout = 2s;
ATTACH SERVICE LEVEL sl2 TO cassandra;
ALTER SERVICE LEVEL sl2 WITH write_timeout = null;
```
Per-service-level timeouts take precedence over default timeout values from scylla.yaml, but can still be overridden for a specific query by per-query timeouts (e.g. `SELECT * from t USING TIMEOUT 50ms`).
Closes#7913
* github.com:scylladb/scylla:
docs: add a paragraph describing service level timeouts
test: add per-service-level timeout tests
test: add refreshing client state
transport: add updating per-service-level params
client_state: allow updating per service level params
qos: allow returning combined service level options
qos: add a way of merging service level options
cql3: add preserving default values for per-sl timeouts
qos: make getting service level public
qos: make finding service level public
treewide: remove service level controller from query state
treewide: propagate service level to client state
sstables: disambiguate boost::find
cql3: add a timeout column to LIST SERVICE LEVEL statement
db: add extracting service level info via CQL
types: add a missing translation for cql_duration
cql3: allow unsetting service level timeouts
cql3: add validating service level timeout values
db: add setting service level params via system_distributed
cql3: add fetching service level attrs in ALTER and CREATE
cql3: add timeout to service level params
qos: add timeout to service level info
db,sys_dist_ks: add timeout to the service level table
migration_manager: allow table updates with timestamp
cql3: allow a null keyword for CQL properties
This patchset adds a basic scylla-gdb.py test to the test suite.
First two patches add the test itself (disabled), subsequent ones are
fixes for scylla-gdb.py to make the test pass, and the last one
enables the test.
Closes#8618
* github.com:scylladb/scylla:
test: enable scylla-gdb/run
scylla-gdb.py: "this" -> "self"
scylla-gdb.py: wrap std::unordered_{set,map} and flat_hash_map
scylla-gdb.py: robustify execution_strategy traversal
scylla-gdb.py: recognize new sstable reader types
scylla-gdb.py: make list_unordered_map more resilient
scylla-gdb.py: robustify netw & gms
scylla-gdb.py: redo find_db() in terms of sharded()
scylla-gdb.py: debug::logalloc_alignment may not exist
scylla-gdb.py: handle changed container type of keyspaces
scylla-gdb.py: walk intrusive containers using provided link fields
test: add a basic test for scylla-gdb.py
test.py: refine test mode control
"
Storage service needs migration notifier reference to pass it to cdc
service via get_local_storage_service(). This set removes
- get_local_storage_service from cdc
- migration notifier from storage service
- db_context::builder from cdc (released nuclear binding energy)
tests: unit(dev)
"
* 'br-cdc-no-storage-service' of https://github.com/xemul/scylla:
storage_service: Remove migration notifier dependency
cdc: Remove db_context::builder
cdc: Provide migration notifier right at once
cdc: Remove db_context::builder::with_migration_notifier
"
This patch-set builds on the existing very basic coverage generation
support and greatly improves it, adding an almost fully automated way of
generating reports, as well as a more manual way.
At the heart of this is a new build mode, coverage, that is dedicated
to coverage report generation, containing all the required build flags,
without interfering with that of the "host" build mode, like currently
(with the --coverage flag).
Additionally a new script, scripts/coverage.py, is added which automates
the magic behind the scenes needed to get from raw profile files to a
nice html report, as long as the raw files are at the expected place.
There are still some rough edges:
* There is no direct ninja support for coverage generation, one has to
build the tests, then run them via test.py.
* Building and running just a few tests is a miserable experience
(#8608).
* Only boost unit tests are supported at the moment when using test.py.
* A --verbose flag for coverage.py would be nice.
* coverage.py could have a way to run a test itself, automatically
adding the required ENV variable(s).
I plan on addressing all these in the future, in the meanwhile, with
this series, the coverage report generation is made available for
non-masochists as well.
"
* 'coverage-improvements/v1' of https://github.com/denesb/scylla:
HACKING.md: update the coverage guide
test.py: add basic coverage generation support
scripts: introduce coverage.py
configure.py: replace --coverage with a coverage build mode
configure.py: make the --help output more readable
configure.py: add build mode descriptions
configure.py: fix fallback mode selection for checkheaders target
configure.py: centralize the declaration of build modes
When decommission is done, all nodes that receive data from the
decommission node will run node_ops_cmd::decommission_done handler.
Trigger off-strategy compaction inside the handler to wire off-strategy
for decommission.
Refs #5226Closes#8607
This cuts an allocation in the write path. Instruction count reduction isn't
large, but performance does improve (results are consistent):
before: 196369.48 tps ( 55.2 allocs/op, 13.2 tasks/op, 51658 insns/op)
after: 199290.32 tps ( 54.2 allocs/op, 13.2 tasks/op, 51600 insns/op)
(this is perf_simple_query --write --smp 1 --operations-per-shard 1000000)
Since small_vector requires noexcept move constructor and assignment,
they corresponding unique_response_handler members are adjusted/added
respectively.
Closes#8606
* github.com:scylladb/scylla:
storage_proxy: place unique_response_handler:s in small_vector instead of std::vector
storage_proxy: make unique_response_handler friendly to small_vector
storage_proxy: give a name to a vector of unique_response_handlers
On severl instance types in AWS and Azure, we get the following failure
during scylla_io_setup process:
```
ERROR 2021-04-14 07:50:35,666 [shard 5] seastar - Could not setup Async
I/O: Resource temporarily unavailable. The most common cause is not
enough request capacity in /proc/sys/fs/aio-max-nr. Try increasing that
number or reducing the amount of logical CPUs available for your
application
```
We have scylla_prepare:configure_io_slots() running before the
scylla-server.service start, but the scylla_io_setup is taking place
before
1) Let's move configure_io_slots() to scylla_util.py since both
scylla_io_setup and scylla_prepare are import functions from it
2) cleanup scylla_prepare since we don't need the same function twice
3) Let's use configure_io_slots() during scylla_io_setup to avoid such
failure
Fixes: #8587Closes#8512
The get_all_endpoints() should return the nodes that are part of the ring.
A node inside _endpoint_to_host_id_map does not guarantee that the node
is part of the ring.
To fix, return from _token_to_endpoint_map.
Fixes#8534Closes#8536
* github.com:scylladb/scylla:
token_metadata: Get rid of get_all_endpoints_count
range_streamer: Handle everywhere_topology
range_streamer: Adjust use_strict_sources_for_ranges
token_metadata: Fix get_all_endpoints to return nodes in the ring
Some unordered_map instantiations have cache=true, some cache=false,
but we don't need to care.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
I haven't found a way to make it stay -- __attribute__((used)) is not
enough and apparently lld is going to ignore __attribute__((retain))
until at least LLVM 13.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
clang & gdb apparently conspire to not reveal template argument types
beyond the first one -- at least for some templates, and definitely
for Boost's intrusive container ones. This severely restricts our
ability to find the right intrusive list link by examining the
container type.
Allow the caller to simply provide the relevant field name, so we
don't have to guess.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
(And disable it initially, because it won't pass without subsequent
commits)
Runs only in release mode, to keep things more realistic.
Doesn't exercise Scylla much at present -- just stops it after several
compactions and tries (almost) all "scylla *" commands in order.
Refs #6952.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
* Add ability to skip tests in individual modes using "skip_in_<mode>".
* Add ability to allow tests in specific modes using "run_in_<mode>".
* Rename "skip_in_debug_mode" to "skip_in_debug_modes", because there
is an actual mode named "debug" and this is confusing.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
This commit adds unit tests for an issue with index creation
after a table with malicious name is previously created as well.
The cases cover both indexes with a default name and the ones with
explicit name set.
When an index with an explicit name is created, it's underlying
materalized view's name is set to <index-name>_index.
If there already exists a regular table with such a name,
the creation should fail with a proper error message.
When an index is created without an explicit name, a default name
is chosen. However, there was no check if a table with conflicting
name already exists. The check is now in place and if any conflicts
are found, a new index name is chosen instead.
Not really testing anything, at least not automatically. It just
provides coverage for the diagnostics dump code, as well as allows for
developers to inspect the printout visually when making changes.
Too many or too resource-hungry reads often lie at the heart of issues
that require an investigation with gdb. Therefore it is very useful to
have a way to summarize all reads found on a shard with their states and
resource consumptions. This is exactly what this new command does. For
this it uses the reader concurrency semaphores and their permits
respectively, which are now arranged in an intrusive list and therefore
are enumerable.
Example output:
(gdb) scylla read-stats
Semaphore _read_concurrency_sem with: 1/100 count and 14334414/14302576 memory resources, queued: 0, inactive=1
permits count memory table/description/state
1 1 14279738 multishard_mutation_query_test.fuzzy_test/fuzzy-test/active
16 0 53532 multishard_mutation_query_test.fuzzy_test/shard-reader/active
1 0 1144 multishard_mutation_query_test.fuzzy_test/shard-reader/inactive
1 0 0 *.*/view_builder/active
1 0 0 multishard_mutation_query_test.fuzzy_test/multishard-mutation-query/active
20 1 14334414 Total
The command accepts a list of semaphores to dump reads from as its
arguments, or if none are provided, it will dump reads from the
semaphores of the local database instance.
With a helper client state refresher, some attributes
which are usually only refreshed after a client disconnects
and then reconnects, can be verified in the test suite.
Per-service-level parameters (currently timeouts)
are now updated when a new connection is established.
The other connections which have the changed role are currently
not immediately reloaded.
Originally, the API for finding a service level controller returned
its name, which also implied that only a single service level
may be active for a user and provide its options.
After adding timeout parameters it makes more sense to return a result
which combines multiple service level parameters - e.g. a user
can be attached to one level for read timeouts and a separate one
for write timeouts.
In order to combine multiple service level options coming from
multiple roles, a helper function is provided to merge two
of them. The semantics depend on each parameter, but for timeouts,
which are the only parameters at the time of writing this message,
the minimum value of the two is taken. That in particular means
that when service level A has timeout = 50ms and service level B
has timeout = 1s, the resulting service level options
would set the timeout to 50ms.
In order to avoid needless schema disagreements, a way of announcing
a schema change with fixed timestamp is added.
That way, when nodes update schemas of their internal tables (e.g.
during updates), it's possible for all nodes to use an identical
timestamp for this operation, which in turn makes their digests
identical.
This report is logged, so we don't want huge printouts, cap the table at
20 lines, and print only a summary for the rest.
For manual dumps, allow the limit to be set to a custom value, including
no limit at all.
The goal of the printout is to allow finding the culprit for semaphore
related problems and this usually involves finding the table/op/state
eating the most memory. This is much easier when all the permit
summaries are in a single table.
Currently we have a single "count" column and it is not at all clear
what it refers to: the number of permits or count resources used by
them. Whichever it is, it only represent one of them, so in this commit
we add a "permits" column, which in addition to clearing things up,
supplies further information to the printout.
We'd like to change the vector to a more involved type, so to avoid
repeating it everywhere, give it a name. The actual type isn't changed
in this patch.
The tracker maintains a separate pointer to current leader progress,
but all this complexity is not needed because the tracker already have
find() function that can either find a leader's progress by id or return
null. Removing the tracking simplifies code and make going out of sync
(which is always a possibility if a state is maintained in two different
places) impossible.
Only when fsm is in the follower state current_leader has any meaning.
In the leader state a node is always its own follower and in a candidate
state there is no leader. To make sure that the current_leader value
cannot be out of sync with fsm state move it into the follower state.
With strict mode, it could happen that a sstable alone in level 0 is
selected for offstrategy compaction, which means that we could run
into an infinite reshape process.
This is fixed by respecting the offstrategy threshold. Unit test is
added.
Fixes#8573.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210506181324.49636-1-raphaelsc@scylladb.com>
Instead of some slides from an internal summit. debugging.md has much
more details then said slides (which lacks the associated voice
recording).
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210507125956.381763-1-bdenes@scylladb.com>
Add support for the newly added coverage mode. When --mode=coverage,
also invoke the coverage generation report script to produce a coverage
report after having run the tests.
There are still some rough edges, alternator and cql tests don't work.
This script finds all the profiling data generated by tests at the
specified path, it merges them and generates a html report combining all
their results.
It can be used both as a standalone python script, or imported and
invoked from python directly.
A separate build mode is a much better fit for coverage generation.
Generating coverage requires certain flags and optimization modes,
which is much better expressed with a separate build mode, then by
bolting it on top of an existing one, possibly conflicting with its own
requirements.
This patch therefore converts the current `--coverage` flag to a build
mode of its own. The build mode is based on debug mode, in fact seastar
is built in plain debug mode, with some extra cflags.
The new build mode is called "coverage" and it is a non-default mode (by
default configure.py doesn't generate build files for it).
The huge amount of choices for the --with argument obscures the help
output, making it hard to read. This patch removes the choices list and
instead manually checks the passed in artifacts. Unknown artifacts are
removed from the list and if it remains empty the script is aborted.
Available artifacts can be listed by a new --list-artifacts flag.
Currently modes[0] is used as the fallback when 'dev' is not available.
But modes is a dict with mode names as keys, so this won't work. Replace
it with modes.keys()[0] to select the first key instead.
Currently the declaration of build modes is scattered throughout the
script. We have several places where build modes are mentioned
hardcoded, and related configuration is also scattered in several data
structures. This commit centralized all this into a single data
structure, all other code uses this to iterate over modes and to mutate
their configuration.
This patch was motivated by the wish to make it easier to add a new
build mode, which is what the next patch does. This is not something we
do often, but I believe these changes also serve to make the code easier
to understand and modify.
This should really be provided by the C++ standard library and indeed I
do recall pretty-printing for std:: types working sometimes. They don't
for the most of the time however, which is not a disaster if you just
want to use them in the gdb shell, but is not fine if we want to rely
on them in internal scripts, which is what the next patch does. So
provide it.
storage_proxy uses std::vector<inet_address> for small lists of nodes - for replication (often 2-3 replicas per operation) and for pending operations (usually 0-1). These vectors require an allocation, sometimes more than one if reserve() is not used correctly.
This series switches storage_proxy to use utils::small_vector instead, removing the allocations in the common case.
Test results (perf_simple_query --smp 1 --task-quota-ms 10):
```
before: median 184810.98 tps ( 91.1 allocs/op, 20.1 tasks/op, 54564 insns/op)
after: median 192125.99 tps ( 87.1 allocs/op, 20.1 tasks/op, 53673 insns/op)
```
4 allocations and ~900 instructions are removed (the tps figure is also improved, but it is less reliable due to cpu frequency changes).
The type change is unfortunately not contained in storage_proxy - the abstraction leaks to providers of replica sets and topology change vectors. This is sad but IMO the benefits make it worthwhile.
I expect more such changes can be applied in storage_proxy, specifically std::unordered_set<gms::inet_address> and vectors of response handles.
Closes#8592
* github.com:scylladb/scylla:
storage_proxy, treewide: use utils::small_vector inet_address_vector:s
storage_proxy, treewide: introduce names for vectors of inet_address
utils: small_vector: add print operator for std::ostream
hints: messages.hh: add missing #include
* scylla-dev/raft-snapshot-fixes-v4:
raft: document that add entry my throw commit_status_unknown
raft: test: add test of a leadership change during ongoing snapshot transfer
raft: test: retry submitting an entry if it was dropped
raft: test: wait for the log to be fully replicated on new leader only
raft: drop waiters with outdated terms
raft: make snapshot transfer abortable
raft: accept snapshots transfer from multiple nodes simultaneously
raft: do not send probes while transferring snapshot
raft: handle messages sending errors
raft: test: return error from rpc module if nodes are disconnected
raft: fix a typo in a variable name
In commit 323f72e48a (repair: Switch to
use NODE_OPS_CMD for replace operation), we switched replace operation
to use the new NODE_OPS_CMD infrastructure.
In this patch set, we continue the work to switch decommission and bootstrap
operation to use NODE_OPS_CMD.
Fixes#8472Fixes#8471Closes#8481
* github.com:scylladb/scylla:
repair: Switch to use NODE_OPS_CMD for bootstrap operation
repair: Switch to use NODE_OPS_CMD for decommission operation
"
The current home of these tests -- mutation_reader_test -- is already
one of the larges test files we have. To reduce the size of the former
and to make finding these tests easier they are moved to a separate
file.
"
* 'reader-concurrency-semaphore-test/v2' of https://github.com/denesb/scylla:
test: move reader_concurrency_semaphore related tests into separate file
test: mutation_reader_test: convert restricted reader tests to semaphore tests
Currently an entry is declared to be dropped only when an entry with
different term is committed with the same index, but that may create a
situation where, if no new entries are submitted for a long time, an
already dropped entry will not be noticed for a long time as well.
Consider the case where a client submits 10 entries on a leader A, but
before they get replicated the leadership moves to a node B. B will
commit a dummy entry which will be committed eventually and will release
one of the waiters on A, but if anything else is submitted to B 9 other
waiters will wait forever.
The way to solve that is to drop all waiters that wait for a term
smaller that one been committed. There is no chance they will be
committed any longer since terms in the log may only grow.
A snapshot transfer may take a lot of time and meanwhile a leader doing
it may lose the leadership. If that happens the ongoing snapshot transfer
becomes obsolete since the snapshot will be rejected by the receiving
node as coming from an old leader. Make snapshot transfer abortable and
abort them when leader changes.
A leader may change while one of its followers is in snapshot transfer
mode and that node may get additional request for snapshot transfer from
a new leader while previous transfer is still not aborted. Currently
such situation will trigger an assert. This patch allows to have active
snapshot transfers from multiple nodes, but only one of them will succeed
in the end, all other will be replied to with 'fail'.
The mutation_reader_test is already one of our largest test files.
Move the reader concurrency semaphore related tests to a new file,
making them easier to find making the mutation reader test a little bit
smaller too.
These two tests (restricted_reader_timeout and
restricted_reader_max_queue_length) are testing the semaphore in
reality, but through the restricted reader, which is distracting as it
needlessly brings in an additional layer into the picture. Rewrite them
to test the semaphore directly, getting much lighter in the process.
Apparently, if a container is passed to `list()` it tries to obtain its
size first, which in this case leads to infinite recursion. To prevent
this, coerce `self` to an iterator.
The everywhere_topology returns the number of nodes in the cluster as
RF. This makes only streaming from the node losing the range impossible
since no node is losing the range after bootstrap.
Shortcut it not to use strict source in case the keyspace is
everywhere_topology.
Refs #8503
Now the get_all_endpoints() returns the number of nodes in the ring. We
need to adjust the checking for using strict source or not.
Use strict when number of nodes in the ring is equal or more than RF
Refs #8534
The get_all_endpoints() should return the nodes that are part of the ring.
A node inside _endpoint_to_host_id_map does not guarantee that the node
is part of the ring.
To fix, return from _token_to_endpoint_map.
Fixes#8534
Replace std::vector<inet_address> with a small_vector of size 3 for
replica sets (reflecting the common case of local reads, and the somewhat
less common case of single-datacenter writes). Vectors used to
describe topology changes are of size 1, reflecting that up to one
node is usually involved with topology changes. At those counts and
below we save an allocation; above those counts everything still works,
but small_vector allocates like std::vector.
In a few places we need to convert between std::vector and the new types,
but these are all out of the hot paths (or are in a hot path, but behind a
cache).
storage_proxy works with vectors of inet_addresses for replica sets
and for topology changes (pending endpoints, dead nodes). This patch
introduces new names for these (without changing the underlying
type - it's still std::vector<gms::inet_address>). This is so that
the following patch, that changes those types to utils::small_vector,
will be less noisy and highlight the real changes that take place.
"
Choosing the max-result-size for unlimited queries is broken for unknown
scheduling groups. In this case the system limit (unlimited) will be
chosen. A prime example of this break-down is when service levels are
used.
This series fixes this in the same spirit as the similar semaphore
selection issue (#8508) was fixed: use the user limit as the fall-back
in case of unknown scheduling groups.
To ensure future fixes automatically apply to both query-classification
related configurations, selecting the max result size for unlimited
queries is now delegated to the database, sharing the query
classification logic with the semaphore selection.
Fixes: #8591
Tests: unit(dev)
"
* 'query-max-size-service-level-fix/v2' of https://github.com/denesb/scylla:
service/storage_proxy: get_max_result_size() defer to db for unlimited queries
database: add get_unlimited_query_max_result_size()
query_class_config: add operator== for max_result_size
database: get_reader_concurrency_semaphore(): extract query classification logic
The timestamp_type is an int64_t. So, it has to be explicitly
initialized before using it.
This missing inicialization prevented the major compactation
from happening when a time window finishes, as described in #8569.
Fixes#8569
Signed-off-by: Lauro Ramos Venancio <lauro.venancio@incognia.com>
Closes#8590
"
This patchset contains 3 main improvements to STCS get_buckets
implementation and algorithm:
1. Consider only current bucket for each sstable.
No need to scan all buckets using a map
since the inserted sstables are sorted by size.
2. Use double precision for keeping bucket average size.
Prevent rounding error accumulation.
3. Don't let the bucket average drift too high.
As we insert increasingly larger sstables into a bucket,
it's average size drifts up and eventually this may break
the bucket invariant that all sstables in the bucket should
be within the (bucket_low, bucket_high) range relative
to the bucket average.
Test: unit(dev)
DTest: compaction_test.py:TestCompaction_with_SizeTieredCompactionStrategy,
compaction_additional_test.py:CompactionAdditionalStrategyTests_with_SizeTieredCompactionStrategy
Fixes#8584
"
* tag 'stcs-buckets-v3' of github.com:bhalevy/scylla:
compaction: size_tiered_compaction_strategy: get_buckets: fixup indentation
compaction: size_tiered_compaction_strategy: get_buckets: don't let the bucket average drift too high
compaction: size_tiered_compaction_strategy: get_buckets: keep bucket average size as double precision floating point number
compaction: size_tiered_compaction_strategy: get_buckets: rename old_average_size to bucket_average_size
compaction: size_tiered_compaction_strategy: get_buckets: consider only current bucket for each sstable
Since Linux 5.12 [1], XFS is able to to asynchronously overwrite
sub-block ranges without stalling. However, we want good performance
on older Linux versions, so this patch reduces the block size to the
minimum possible.
That turns out to be 1024 for crc-protected filesystems (which we want)
and it can also not be smaller than the sector size. So we fetch the
sector size and set the block size to that if it is larger than 512.
Most SSDs have a sector size of 512, so this isn't a problem.
Tested on AWS i3.large.
Fixes#8156.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ed1128c2d0c87e5ff49c40f5529f06bc35f4251bCloses#8585
Every time db/config.hh is modified (e.g., to add a new configuration
option), 110 source files need to be recompiled. Many of those 110 didn't
really care about configuration options, and just got the dependency
accidentally by including some other header file.
In this patch, I remove the include of "db/config.hh" from all header
files. It is only needed in source files - and header files only
need forward declarations. In some cases, source files were missing
certain includes which they got incidentally from db/config.hh, so I
had to add these includes explicitly.
After this patch, the number of source files that get recompiled after a
change to db/config.hh goes down from 110 to 45.
It also means that 65 source files now compile faster because they don't
include db/config.hh and whatever it included.
Additionally, this patch also eliminates a few unnecessary inclusions
of database.hh in other header files, which can use a forward declaration
or database_fwd.hh. Some of the source files including one of those
header files relied on one of the many header files brought in by
database.hh, so we need to include those explicitly.
In view_update_generator.hh something interesting happened - it *needs*
database.hh because of code in the header file, but only included
database_fwd.hh, and the only reason this worked was that the files
including view_update_generator.hh already happened to unnecessarily
include database.hh. So we fix that too.
Refs #1
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505121830.964529-1-nyh@scylladb.com>
As far as I can tell, we never documented requirement for self-contained
headers in our coding style. So let's do it now, and explain the
"ninja dev-headers" command and how to use it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505120908.963388-1-nyh@scylladb.com>
In segregate mode scrub will segregate the content of of input sstables
into potentially multiple output sstables such that they respect
partition level and fragment level monotonicity requirements. This can
be used to fix data where partitions or even fragments are out-of-order
or duplicated. In this case no data is lost and after the scrub each
sstables contains valid data.
Out-of-order partitions are fixed by simply being written into a
separate output, compared to the last one compaction was writing into.
Out-of-order fragments are fixed by injecting a
partition-end/partition-start pair right before them, effectively
moving them into a separate (duplicate) partition which is then treated
in the above mentioned way.
This mode can fix corruptions where partitions are out-of-order or
duplicated.
This mode cannot fix corruptions where partitions were merged, although
data will be made valid from the database level, it won't be on the
business-logic level.
Using integer division lose accuracy by rounding down the result.
Each time we calculate:
```
auto total_size = bucket.size() * old_average_size;
auto new_average_size = (total_size + size) / (bucket.size() + 1);
```
We accumulate the rounding error.
total_size might be too small since old_average_size was previously
rounded down, and then new_average_size is rounded down again.
Rather than trying to compensate for the rounding errors
by e.g. adding size / 2 to the dividend, simply keep the average
as a double precision number.
Note that we multiply old_average_size by options.bucket_{low,high},
that are double precision too so the size comparisons
are already using FP instructions implicitly.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since now it became a reference used to update the bucket's average size
after a new sstable is inserted into it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since the sstables are sorted in increasing size order
there is no need to consider all buckets to find a matching one.
Instead, just consider the most recently inserted bucket.
Once we see a sstable size outside the allowed range for this bucket,
create a new bucket and consider this one for the next sstable.
Note, `old_average_size` should be renamed since this change
turns it into a reference and it's assigned with the new average_size.
This patch keeps the old name to reduce the churn. The following
patch will do only the rename.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Defer picking the appropriate max result size for unlimited queries to
the database, which is already the place where we make query classifying
decisions. This move means that all these decisions are now centralized
in the database, not scattered in different places and fixing one fixes
all users.
Similar to the already existing get_reader_concurrency_semaphore(),
this method determines the appropriate max result size for the query
class, which is deduced from the current scheduling group. This method
shares its scheduling group -> query class association mechanism with
the above mentioned semaphore getter.
Every time db/config.hh is modified (e.g., to add a new configuration
option), 110 source files need to be recompiled. Many of those 110 didn't
really care about configuration options, and just got the dependency
accidentally by including some other header file.
In this patch, I remove the include of "db/config.hh" from all header
files. It is only needed in source files - and header files only
need forward declarations. In some cases, source files were missing
certain includes which they got incidentally from db/config.hh, so I
had to add these includes explicitly.
After this patch, the number of source files that get recompiled after a
change to db/config.hh goes down from 110 to 45.
It also means that 65 source files now compile faster because they don't
include db/config.hh and whatever it included.
Additionally, this patch also eliminates a few unnecessary inclusions
of database.hh in other header files, which can use a forward declaration
or database_fwd.hh. Some of the source files including one of those
header files relied on one of the many header files brought in by
database.hh, so we need to include those explicitly.
In view_update_generator.hh something interesting happened - it *needs*
database.hh because of code in the header file, but only included
database_fwd.hh, and the only reason this worked was that the files
including view_update_generator.hh already happened to unnecessarily
include database.hh. So we fix that too.
Refs #1
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505102111.955470-1-nyh@scylladb.com>
Allow resetting the validator to a given partition or mutation fragment.
This allows a user which is able to fix corrupt streams to reset the
validator to a partition or row which the validator normally wouldn't
accept and hence it wouldn't advance its internal state to it.
Add a new segregator which segregates a stream, potentially containing
duplicate or even out-of-order partitions, into multiple output streams,
such that each output stream has strictly monotonic partitions.
This segregator will be used by a new scrub compaction mode which is
meant to fix sstables containing duplicate or out-of-order data.
Add direct support to the newly added scrub mode enum. Instead of the
legacy `skip_corrupted` flag, one can now select the desired mode from
the mode enum. `skip_corrupted` is still supported for backwards
compatibility but it is ignored when the mode enum is set.
Scrub compaction will add the missing last partition-end in a stream
when allowed to modify the stream. This however can cause an infinite
loop:
1) user calls fill_buffer()
2) process fragments until underlying is at EOS
3) add missing partition end
4) set EOS
5) user sees that last buffer wasn't empty
6) calls fill_buffer() again
7) goto (3)
To prevent this cycle, break out of `fill_buffer()` early when both the
scrub reader and the underlying is at EOS.
Into a local function. In the next patch we want to add another method
which needs to classify queries based on the current scheduling group,
so prepare for sharing this logic.
Instructions retired per op is a much more stable than time per op
(inverse throughput) since it isn't much affected by changes in
CPU frequencey or other load on the test system (it's still somewhat
affected since a slower system will run more reactor polls per op).
It's also less indicative of real performance, since it's possible for
fewer inststructions to execute in more time than more instructions,
but that isn't an issue for comparative tests).
This allows incremental changes to the code base to be compared with
more confidence.
Current results are around 55k instructions per read, and 52k for writes.
Closes#8563
* github.com:scylladb/scylla:
test: perf: tidy up executor_stats snapshot computation
test: perf: report instructions retired per operations
test: perf: add RAII wrapper around Linux perf_event_open()
test: perf: make executor_stats_snapshot() a member function of executor
As we are now serially adding commands with consecutive integers there
is no need to build vectors of commands. Remove helper.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Before this change, `cdc$deleted_` columns were all `NULL` in pre-images. Lack of such information made it hard to correctly interpret the pre-image rows, for example:
```
INSERT INTO tbl(pk, ck, v, v2) VALUES (1, 1, null, 1);
INSERT INTO tbl(pk, ck, v2) VALUES (1, 1, 1);
```
For this example, pre-image generated for the second operation would look like this (in both `true` and `full` pre-image mode):
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```
`v=NULL` has two meanings:
1. If pre-image was in `true` mode, `v=NULL` describes that v was not affected (affected columns: pk, ck, v2).
2. If pre-image was in `full` mode, `v=NULL` describes that v was equal to `NULL` in the pre-image.
Therefore, to properly decode pre-images you would need to know in which mode pre-image was configured on the CDC-enabled table at the moment this CDC log row was inserted. There is no way to determine such information (you can only check a current mode of pre-image).
A solution to this problem is to fill in the `cdc$deleted_` columns for pre-images. After this PR, for the `INSERT` described above, CDC now generates the following log row:
If in pre-image 'true' mode:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```
If in pre-image 'full' mode:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=true, v2=1
```
A client library now can properly decode a pre-image row. If it sees a `NULL` value, it can now check the `cdc$deleted_` column to determine if this `NULL` value was a part of pre-image or it was omitted due to not being an affected column in the delta operation.
No such change is necessary for the post-image rows, as those images are always generated in the `full` mode.
Additional example:
Additional example of trouble decoding pre-images before this change.
tbl2 - `true` pre-image mode, tbl3 - `full` pre-image mode:
```
INSERT INTO tbl2(pk, ck, v, v2) VALUES (1, 1, 5, 1);
INSERT INTO tbl3(pk, ck, v, v2) VALUES (1, 1, null, 1);
```
```
INSERT INTO tbl2(pk, ck, v2) VALUES (1, 1, 1);
```
generated pre-image:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```
```
INSERT INTO tbl3(pk, ck, v2) VALUES (1, 1, 1);
```
generated pre-image:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```
Both pre-images look the same, but:
1. `v=NULL` in tbl2 describes v being omitted from the pre-image.
2. `v=NULL` in tbl3 described v being `NULL` in the pre-image.
Closes#8568
* github.com:scylladb/scylla:
cdc: log: assert post_image is always in full mode
cdc: tests: check cdc$deleted_ columns in images
cdc: log: fill cdc$deleted_ columns in pre-images
Add a test that checks whether the cdc$deleted_ columns are properly
filled in the pre/post-image rows.
This test checks tables with only atomic columns, tables with frozen
collections and non-frozen collections. The test is performed with
both 'true' pre-image mode and 'full' pre-image mode.
The tests, when added, where not named kosher (*_test), which the
runner apparently quaintly, require to pick it up (instead of the more
sensisble *.cql).
Thusly, the test was never run beyond initial creation, and also
bit-rotted slightly during behaviour changes.
Renamed and re-resulted.
Closes#8581
"
Recent changes in seastar added the ability for data sinks to
advertise the buffer size up to the stream level. This change was
needed to make the output stack honor the io-queue's max request
length. There are two more places left to patch.
The first is the sstables checksumming writer. This is the sink
implementation that has another sink inside. So this one is patched
to report up (to the output stream) the buffer size from the lower
sink (which is a file data sink that already "knows" the maximim IO
lengths).
The second one is the compress sink, but this sink embeds an output
stream inside, so even if it's working with larger buffers, that
inner stream will split them properly. So this place is patched just
to stop using the deprecated output stream constructor.
tests: unit(dev)
"
* 'br-streams-napi' of https://github.com/xemul/scylla:
sstables: Make checksum sink report buffer size from lower sink
sstables: Report buffer size from compressed file sink
The checksum sink carries another sink on board and forwards
the put buffers lower, so there's no point in making these
two have different buffer sizes. This is what really happens
now, but this change makes this more explicit and makes the
checksumming code conform to the new output stream API.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This change just moves the place from which the output_stream
knows the compression::uncompressed_chunk_length() value.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently, if sync_all_segments fails during shutdown,
_shutdown is never set, causing replenish_reserve
to hang, as possibly seen in #8577.
It is better if scylla aborts on critical errors
during shutdown rather than just hang.
Refs #8577
Test: unit(dev)
DTest: commitlog_test.py
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Both hinted handoff and repair are meant to improve the consistency of the cluster's data. HH does this by storing records of failed replica writes and replaying them later, while repair goes through all data on all participaring replicas and makes sure the same data is stored on all nodes. The former is generally cheaper and sometimes (but not always) can bring back full consistency on its own; repair, while being more costly, is a sure way to bring back current data to full consistency.
When hinted handoff and repair are running at the same time, some of the work can be unnecessarily duplicated. For example, if a row is repaired first, then hints towards it become unnecessary. However, repair needs to do less work if data already has good consistency, so if hints finish first, then the repair will be shorter.
This PR introduces a possibility to wait for hints to be replayed before continuing with user-issued repair. The coordinator of the repair operation asks all nodes participating in the repair operation (including itself) to mark a point at the end of all hint queues pointing towards other nodes participating in repair. Then, it waits until hint replay in all those queues reaches marked point, or configured timeout is reached.
This operation is currently opt-in and can be turned on by setting the `wait_for_hint_replay_before_repair_in_ms` config option to a positive value.
Fixes#8102
Tests:
- unit(dev)
- some manual tests:
- shutting down repair coordinator during hints replay,
- shutting down node participating in repair during hints replay,
Closes#8452
* github.com:scylladb/scylla:
repair: introduce abort_source for repair abort
repair: introduce abort_source for shutdown
storage_proxy: add abort_source to wait_for_hints_to_be_replayed
storage_proxy: stop waiting for hints replay when node goes down
hints: dismiss segment waiters when hint queue can't send
repair: plug in waiting for hints to be sent before repair
repair: add get_hosts_participating_in_repair
storage_proxy: coordinate waiting for hints to be sent
config: add wait_for_hint_replay_before_repair option
storage_proxy: implement verbs for hint sync points
messaging_service: add verbs for hint sync points
storage_proxy: add functions for syncing with hints queue
db/hints: make it possible to wait until current hints are sent
db/hints: add a metric for counting processed files
db/hints: allow to forcefully update segment list on flush
Add support for configuration change on leader.
Keep track of servers in config in test.
Add a dummy entry to confirm configuration changed. If the add fails,
because the old leader was not in the new config and stepped down, the
config is considered changed, too.
Add a test with some configuration changes.
Add a test cycling every scenario for 1 of 4 nodes removed.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Use a special value as dummy entry to be ignored when seen in state
machine input.
Ignore dummy entries for count.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Before this change the default was prevote enabled.
With this change each test is run with and without prevote.
This duplicates the number of test cases.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
The test suite requires an initial leader and at the moment it's always
just 0. Make it default and simplify code.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
If a leader was already disconnected the election of a new leader could
re-connect. Save original connectivity and restore it when done electing
new leader.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Use the new specific connectivity to manage old leader disconnection
more specifically.
This fixes having elections where the vote of the old leader is required
for quorum. For example {A,B} and we want to switch leader. For B to
become candidate it has to see A as down. Then A has to see B's request
for vote, and vote for A.
So to make the general case old leader needs to be first disconnected
from all nodes, make the desired node candidate, then have the old
leader connected only to the desired candidate (else, other nodes would
see the new candidate as disrupting a live leader).
Also, there might be stray messages from the former leader. These could
revert the candidate to follower. To handle this this patch retries
the process until the desired node becomes leader.
The helper function elect_me_leader() is split and renamed to
wait_until_candidate() and wait_election_done(). The former ticks until
the node is a candidate and the later waits until a candidate either
becomes a leader or reverts to follower
The existing etcd test workaround of incrementing from n=2 to n=3 nodes
is corrected back to original n=2.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Add 2 helper functions for making nodes reach timeout threshold and to
elect a specific node.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Replace simple full disconnect of a node with specific from -> to
disconnection tracking.
This will help electing new leaders.
Say there are {A,B,C} with A leader and we want to elect B.
Before this patch, we would disconnect A, run an election with just
{B,C}, and then re-connect A.
If we have {A,B} and want to elect B, this won't work as B needs 2/2+1
votes and A is disconnected. Even if we made A stepped down. This patch
corrects this shortcoming. (@gleb-cloudius)
With this patch, we can specify other followers (not the previous or
next leader) to not see the old leader, but the new and old leaders see
each other just fine. In the example {A,B,C} above we can cut A<->B
specifcally.
Also, this is closer to etcd testing and should help porting cases.
NOTE: in the current test implementation failure_detector reports
node.is_alive(other_node) if there is a connection both ways.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Checksum was removed so undo support for multiple versions added in:
test: add support for different state machines
43dc5e7dc2
NOTE: as there is a test with custom total_values, expected value cannot
be static const anymore. (line 630)
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Previously, entries were added in parallel and we needed to check if
order was broken. Using a simple checksum was better than a hash as you
could easily find the position it broke (we add consecutive numbers).
Now order of entries is forced so it's not useful. This patch removes
it.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
* seastar 0b2c25d133...f1b6b95b69 (10):
> Merge "Cap streams buffer sizes with IO limits" from Pavel
> io_queue: fix mismatched class/struct tag for priority_class_data
> perftune.py: instrument bonding tuning flow with 'nic' parameter
> perftune.py: strip a newline off the 'scheduler' file content
> perftune.py: add support for virtual non-raid disk devices
> doc: fix typos in doc/tutorial.md
> Merge "Add IO in-disk stats" from Pavel E
> iotune: Perform fs-check on all directories
> file: Keep reference on io-queue
> Merge "Assorted set of improvements over io-queue" from Pavel E
An out of block log print resulted in repeated prints about removal of
the default service level. The period of this print is every time the
configuration is scanned for changes. It happens when the default
service level is one of the last on the map (sorted as in the map).
Fixes#8567Closes#8576
Introduce a tagged id struct for `group_id`.
Raft code would want to generate quite a lot of unique
raft groups in the future (e.g. tablets). UUID is designed
exactly for that (e.g. larger capacity than `uint64_t`, obviously,
and also has built-in procedures to generate random ids).
Also, this is a preparation to make "raft group 0" use a random
ID instead of a literal fixed `0` as a group id.
The purpose is that every scylla cluster must have a unique ID
for "raft group 0" since we don't want the nodes from some other
cluster to disrupt the current cluster. This can happen if,
for some reason, a foreign node happens to contact a node in
our cluster.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210429170630.533596-3-pa.solodovnikov@scylladb.com>
Many workloads have fairly constant and small request sizes, so we
don't need large reserves for them. These workloads suffer needlessly
from the current large reserve of 10 segments (1.2MB) when they really
need a few hundred bytes. Reduce the reserve to a minimum of 1 segment.
Note that due to #8542 this can make a large difference. Consider a
workload that has a 1000-byte footprint in cache. If we've just
consumed some free memory and reduced the reserve to zero, then
we'll evict about 50,000 objects before proceeding to compact. With
the reserved reduced to 1, we'll evict 128 objects. All this
for 1000 bytes of memory.
Of course, #8542 should be fixed, but reducing the reserve provides
some quick relief and makes sense even with the larger fix. The
reserve will quickly grow for workloads that handle bigger requests,
so they won't see an impact from the reduction.
Closes#8572
The only reason why storage service keeps a refernce on the migration
notifier is that the latter was needed by cdc before previous patch.
Now cdc gets the notifier directly from main, so storage service is
a bit more off the hook.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now the builder is just an opaque transfer between cdc_service
constructor args and cdc_service's db_context constructor args.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The only way db_context's migration notifier reference is set up
is via cdc_service->db_context::builder->.build chain of calls.
Since the builder's notifier optional reference is always
disengaged (the .with_migration_notifier is removed by previous
patch) the only possible notifier reference there is from the
storage service which, in turn, is the same as in main.cc.
Said that -- push the notifier reference onto db_context directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Before this change, cdc$deleted_ columns were all NULL in pre-images.
Lack of such information made it hard to correctly interpret the
pre-image rows, for example:
INSERT INTO tbl(pk, ck, v, v2) VALUES (1, 1, null, 1);
INSERT INTO tbl(pk, ck, v2) VALUES (1, 1, 1);
For this example, pre-image generated for the second operation
would look like this (in both 'true' and 'full' pre-image mode):
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
v=NULL has two meanings:
1. If pre-image was in 'true' mode, v=NULL describes that v was not
affected (affected columns: pk, ck, v2).
2. If pre-image was in 'full' mode, v=NULL describes that v was equal
to NULL in the pre-image.
Therefore, to properly decode pre-images you would need to know in
which mode pre-image was configured on the CDC-enabled table at the
moment this CDC log row was inserted. There is no way to determine
such information (you can only check a current mode of pre-image).
A solution to this problem is to fill in the cdc$deleted_ columns
for pre-images. After this change, for the INSERT described above, CDC
now generates the following log row:
If in pre-image 'true' mode:
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
If in pre-image 'full' mode:
pk=1, ck=1, v=NULL, cdc$deleted_v=true, v2=1
A client library now can properly decode a pre-image row. If it sees
a NULL value, it can now check the cdc$deleted_ column to determine
if this NULL value was a part of pre-image or it was omitted due to
not being an affected column in the delta operation.
No such change is necessary for the post-image rows, as those images
are always generated in the 'full' mode.
Additional example of trouble decoding pre-images before this change.
tbl2 - 'true' pre-image mode, tbl3 - 'full' pre-image mode:
INSERT INTO tbl2(pk, ck, v, v2) VALUES (1, 1, 5, 1);
INSERT INTO tbl3(pk, ck, v, v2) VALUES (1, 1, null, 1);
INSERT INTO tbl2(pk, ck, v2) VALUES (1, 1, 1);
generated pre-image:
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
INSERT INTO tbl3(pk, ck, v2) VALUES (1, 1, 1);
generated pre-image:
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
Both pre-images look the same, but:
1. v=NULL in tbl2 describes v being omitted from the pre-image.
2. v=NULL in tbl3 described v being NULL in the pre-image.
Scylla apply iterates over all shards in undetermined order and leaves
the last shard as the current one. This is counter-intuitive and can
lead to surprises as the user might not expect the current shard to be
changed by a command that just executes a command on each shard.
This patch ensures that both in case of the happy and error paths the
current shard is unchanged.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210429104937.61315-1-bdenes@scylladb.com>
The error message currently complains about "invalid version" and later
says the reason is that the path is not recognized. This is confusing so
change the error message to start with "invalid path" instead. It is the
path that is invalid not the version after all.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210429092749.52659-1-bdenes@scylladb.com>
So it is closed when loading the sstable throws an exception too.
Failing to close the manager will mask the real error as the user will
only see the assert failure due to failing to close the manager.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210429092248.50968-1-bdenes@scylladb.com>
We encountered a phenomena where shutting down the messaging service
don't complete, leaving the shutdown process stuck. Since we couldn't
pinpoint where exactly the shutdown went wrong, here we add some
verbosity to the shutdown stages so we can more accurately pinpoint the
culprit.
Closes#8560
The fuzzy test consumes a large chunk of resource from the semaphore
up-front to simulate a contested semaphore. This isn't an accurate
simulation, because no permit will have more than 1 units in reality.
Furthermore this can even cause a deadlock since 8aaa3a7 as now we rely
on all count units being available to make forward progress when memory
is scarce.
This patch just cuts out this part of the test, we now have a dedicated
unit test for checking a heavily contested semaphore, that does it
properly, so no need to try to fix this clumsy attempt that is just
making trouble at this point.
Refs: #8493
Tests: release(multishard_mutation_query_test:fuzzy_test)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210429084458.40406-1-bdenes@scylladb.com>
Previously, we checked if all partition-key restrictions were EQ at runtime. This is, however, known already at prep time; no need to redo it on every query execution. Move the check to prep time.
Tests: unit (dev, debug), perf_simple_query
Closes#8565
* github.com:scylladb/scylla:
cql3: Replace runtime check with a prepared flag
cql3: Track IN partition-key restrictions
cql3: Inline add_single_column_restriction
cql3: Inline statement_restrictions::add_restriction
Checking that every PK restriction is an EQ was happening at runtime.
This is wasteful, as the result is always the same. Replace that
check with a flag computed once at preparation time.
Separate the simple-case processing into its own function rather than
pass the flag as an extra parameter.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Add a bool member to statement_restrictions indicating whether any of
the partition columns are restricted by IN, which requires more
complex processing.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
This fixes a potential cause for reactor stalls during memory
reclamation. Applies only to schemas without clustering columns.
Every partition in cache has a dummy row at the end of the clustering
range (last dummy). That dummy must be evicted last, because MVCC
logic needs it to be there at all times. If LRU picks it for eviction
and it's not the last row, eviction does nothing and moves
on. Eventually, all other rows in this partition will be evicted too
and then the partition will go away with it.
Mutation reader updates the position of rows in the LRU (aka touching)
as it walks over them. However, it was not always touching the last
dummy row. If the partition was fully cached, and schema had no
clustering key, it would exit early before reaching the last dummy
row, here:
inline
void cache_flat_mutation_reader::move_to_next_entry() {
clogger.trace("csm {}: move_to_next_entry(), curr={}", fmt::ptr(this), _next_row.position());
if (no_clustering_row_between(*_schema, _next_row.position(), _upper_bound)) {
move_to_next_range();
That's because no_clustering_row_between() is always true for any key
in tables with no clustering columns, and the reader advances to
end-of-stream without advancing _next_row to the last dummy. This is
expected and desired, it means that the query range ends at the
current row and there is no need to move further. We would not take
this exit for tables with a non-singular clustering key domain and
open-ended query range, since there would be a key gap before the last
dummy row.
Refs #2972.
The effect of leaving the last dummy row not touched will be that such
scans will segregate rows in the LRU, bring all regular rows to the
front, and dummy rows at the tail. When eviction reaches the band of
dummy rows, it will have to walk over it, because evicting them
releases no memory. This can cause a reactor stall.
An easy fix for the scenario would be to always touch the dummy entry
when entering the partition. It's unlikely that the read will not
proceed to the regular rows. It would be best to avoid linking such
dummies in the LRU, but that's a much more complex change.
Discovered in perf_row_cache_update, test_small_partitions(). I saw
200ms stalls with -m8G.
Refs #8541.
Tests:
- row_cache_test (release)
- perf_simple_query [no change]
Message-Id: <20210427111619.296609-1-tgrabiec@scylladb.com>
Invoking statement_restrictions::add_single_column_restriction()
outside the constructor would leave some data members out-of-date.
Prevent it by deleting the method and inlining its body into the only
call site.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Now that executor_stats_snapshot() is a member function, we can move
the capture of _count into invocations into it, capturing all the
stats in one place.
Invoking this method outside the constructor would leave some data
members out-of-date. Prevent it by deleting the method and inlining
its body into the only call site.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Instructions retired per op is a much more stable than time per op
(inverse throughput) since it isn't much affected by changes in
CPU frequencey or other load on the test system (it's still somewhat
affected since a slower system will run more reactor polls per op).
It's also less indicative of real performance, since it's possible for
fewer inststructions to execute in more time than more instructions,
but that isn't an issue for comparative tests).
This allows incremental changes to the code base to be compared with
more confidence.
I'd like to add an instructions counter which isn't accessible via
a global, so make the snapshot function a member. Out of respect to #1,
define functions for getting the number of allocations and tasks processed,
as they need heavy header files.
Currently `scylla shard` with no args results in
the following error:
```
(gdb) scylla shard
Traceback (most recent call last):
File "master-scylla-gdb.py", line 2384, in invoke
id = int(arg)
ValueError: invalid literal for int() with base 10: ''
Error occurred in Python: invalid literal for int() with base 10: ''
```
Instead, let it just print the current shard, similar to
`(gdb) thread`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210428093913.1070051-1-bhalevy@scylladb.com>
I used {:.0} to truncate to integer, but apparently that resulted
in only one significant digit in the report, so 93.1 was reported as
90. Use the {:5.1f} to avoid truncation, and even get an extra
digit (we can have fractional tasks/op due to batching).
Current result is 93.1 allocs/op, 20.1 tasks/op (which suggests
batch size of around 10).
Closes#8550
"
From now on, offstrategy compaction is triggered on completion of repair-based
removenode. So compaction will no longer act aggressively while removenode
is going on, which helps reducing both latency and operation time.
Refs #5226.
"
* 'offstrategy_removenode' of github.com:raphaelsc/scylla:
repair: Wire offstrategy compaction to repair-based removenode
table: introduce trigger_offstrategy_compaction()
repair/row_level: make operations_supported static const
In the alternator and cql-pytest test frameworks, we have some convenient
contextmanager-based functions that allows us to create a temporary
resource (e.g., a table) that will be automatically deleted, for
example:
with create_stream_test_table(...) as table:
test_something(table)
However, our implementation of these functions wasn't safe. We had
code looking like:
table = ...
yield table
table.delete()
The thinking was that the cleanup part (the table.delete()) will be
called after the user's code. However, if the user's code threw
(i.e., a failed assertion), the cleanup wasn't called... When the user's
code throws, it looks as if the "yield" throws. So the correct code
should look like:
table = ...
try:
yield table
finally:
table.delete()
Python's contextmanager documentation indeed gives this idiom in its
example.
This patch fixes all contextmanager implementations in our tests to do
the cleanup even if the user's "with" block throws.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210428083748.552203-1-nyh@scylladb.com>
On Debian variants, mdmonitor.service cannnot enable because it missing
[Install] section, so 'systemctl enable mdmonitor.service' will fail,
not able to run mdmonitor after the system restarted.
To force running the service, add Wants=mdmonitor.service on
var-lib-scylla.mount.
Fixes#8494Closes#8530
The range passed to create_sharding_metadata is supposed to be owned or
at least partially owned by the shard. Log keys, range and split
ranges for debugging if the range does not belong to the shard.
This is helpful for debugging "Failed to generate sharding
metadata for foo.db" issues reported.
Refs #7056Closes#8557
In an ongoing effort to drop the `restrictions` class hierarchy, rewrite the partition-range calculation code to use the new `expression` objects.
Refs: #7217#3815
Tests: unit (dev, debug)
Closes#8525
* github.com:scylladb/scylla:
cql3: Specialize partition-range computation for EQ
cql3: Replace some bounds_ranges calls
cql3: Get partition range from expr::expression
cql3: Track partition-range expressions
In commit 323f72e48a (repair: Switch to
use NODE_OPS_CMD for replace operation), we switched replace operation
to use the new NODE_OPS_CMD infrastructure.
In this patch, we continue the work to switch bootstrap operation to use
NODE_OPS_CMD.
The benefits:
- It is more reliable to detect pending node operations, to avoid
multiple topology changes at the same time, than using gossip status.
- The cluster reverts to a state before the bootstrap operation
automatically in case of error much faster than gossip.
- Allows users to pass a list of dead nodes to ignore during bootstrap
explicitly.
- The BOOTSTRAP gossip status is not needed any more. This is one step
closer to achieve gossip-less topology change.
Fixes#8472
Save a couple of allocations per request by treating all-EQ cases
specially during the computation of partition ranges.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
removenode_with_repair() runs on all the nodes that need to sync data
from other nodes, so offstrategy compaction can be easily wired by
notifying tables when removenode completes.
From now on, when user runs removenode, new sstables produced in receiving
nodes will be added to table's maintenance set, and when the operation
completes, offstrategy compacted will be started to reshape those new
ssts before integrating them into the main set.
Refs #5226.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Adds an abort_source for repair_tracker which is triggered when all
repairs are asked to be stopped. It is currently used by the "wait for
all hints to be sent" operation - now, it aborts when repairs are
requested to be aborted.
Adds an abort_source to repair_tracker. The abort source is triggered
when the repair subsystem is shut down. Its purpose is to allow
operations such like waiting for hints to be sent to be able to abort
themselves.
When a hint queue becomes stuck due to not being able to send to its
destination (e.g. destination node is no longer UP, or we failed to send
some hints from a file), then it's better to immediately dismiss anybody
who waits for hint replay instead of letting them wait until timeout.
Adds the `get_hosts_participating_in_repair` function which returns a
list of hosts participating in repair.
This list will be used by repair coordinator to tell other nodes to wait
until they replay their hints towards the nodes from the list.
Adds a `wait_for_hints_to_be_replayed` function which waits until all
hints between specified endpoints are replayed.
For each node, a hint sync point is created. Then, repair coordinator
waits until the hint sync point is reached on every node, or timeout
occurs. This is done by querying each host participating in repair every
second in order if the sync point is still there.
Adds the `wait_for_hint_replay_before_repair` configuration option. If
set to true, the repair coordinator will first wait until the cluster
replays its hints towards the nodes participating in repair. It is set
to true by default, and is live-updateable. It will be used in
subsequent commits from the same PR.
Adds two verbs: HINT_SYNC_POINT_CREATE and HINT_SYNC_POINT_CHECK.
Those will make it possible to create a sync point and regularly poll
to check its existence.
Adds two methods to `storage_proxy`:
- `create_hint_queue_sync_point` - creates a "hint sync point" which
is kept present in storage_proxy until all hint queues on the local
node reach their curent end. It will also disappear if given deadline
is reached first.
- `check_hint_queue_sync_point` - checks if given hint sync point still
exists.
The created sync point waits for hint queues in all hint managers, on
all shards.
The r5b instances also have ena support. For a confirmation
that all r5b instances have ena, go to the following page:
https://instances.vantage.sh/
Select the r5b and add the 'enhanced networking' column. Then
it will show that for every r5b type there is ena support
Closes#8546
The original series which forward-ported the service levels into open-source omitted important fixes to their infrastructure. The fixes are hereby ported.
Tests: unit(release)
Closes#8540
* github.com:scylladb/scylla:
workload prioritization: Fix configuration change detection
workload prioritization: add exception protection in configuration polling
The configuration detection is based on a loop that
advances two iterators and compares the two collection
for deducing the configuration change. In order to
correctly deduce the changes the iteration have to be
according to the key (service level name) order for both
of the collections. If it doesn't happen the results are
undefined and in some cases can lead to a crash of the
system. The bug is that the _service_level_db field was
implemented using an unordered_map which obviously don't
guarantie the configuration change detection assumption.
The fix was simply to change the field type to a map
instead of unordered_map.
Another problem is that when a static service level (i.e
default) is at the end of the keys list, it is repeatedly
being deleted - which doesn't really do anything since deleting
a static service level is just retaining it's defult values
but it is stil wrong.
Exceptions around the loop polling were not handled properly.
This is an issue due to the fact that if an unhandled exception
slips out to the configuration polling loop itself it will break
it. When the configuration polling loop is broken, any further
change to the configuration will not be acted uppon in the nodes
where the loop is broken until the node is restarted. The chances
for exceptions are now greater than before since in one of the
previous commits we started quering the workload prioritization
configuration table with a sensible, shorter timeout.
This change also adds a logger for the workload prioritization
module and some logging mainly arround the configuration polling loop.
Most logs are added in the info level since they are not expected to
happen frequently but when they do we would like to have some
information by default regarding what broke the loop.
Issues #4476 and #8489, and also Cassandra's CASSANDRA-10715, all request
that filtering with "WHERE v=NULL" should return the rows where the column
v is unset. However, we made a deliberate decision to do something else:
That "WHERE v=NULL" should match no row. Exactly like it does in SQL.
This is what this test verifies - that "WHERE v=NULL" never matches any
row - not even rows where "v" is unset.
This test is expected to fail on Cassandra (so marked cassandra_bug),
because in Cassandra the "WHERE v=NULL" restriction is forbidden,
instead of succeeding and returning nothing.
Although we differ here from Cassandra, after a lot of deliberation we
decided that Scylla's behavior is the correct one, so this test verifies
it.
Refs #4776.
Refs #8489.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210426183145.323301-1-nyh@scylladb.com>
Seastar seems to have added another layer of indirection to
alloc_site_list_head/backtrace, so scylla_heapprof() can't find the
members it's looking for, resulting in errors. Fix it by
dereferencing the added layer.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8551
When probes are sent over a slow network, the leader would send
multiple probes to a lagging follower before it would get a
reject response to the first probe back. After getting a reject, the
leader will be able to correctly position `next_idx` for that
follower and switch to pipeline mode. Then, an out of order reject
to a now irrelevant probe could crash the leader, since it would
effectively request it to "rewind" its `match_idx` for that
follower, and the code asserts this never happens.
We fix the problem by strengthening `is_stray_reject`. The check that
was previously only made in `PIPELINE` case
(`rejected.non_matching_idx <= match_idx`) is now always performed and
we add a new check: `rejected.last_idx < match_idx`. We also strengthen
the assert.
The commit improves the documentation by explaining that
`is_stray_reject` may return false negatives. We also precisely state
the preconditions and postconditions of `is_stray_reject`, give a more
precise definition of `progress.match_idx`, argue how the
postconditions of `is_stray_reject` follow from its preconditions
and Raft invariants, and argue why the (strengthened) assert
must always pass.
Message-Id: <20210423173117.32939-1-kbraun@scylladb.com>
RPC configuration was updated only when an instance was
started with an initial snapshot.
In case we don't have an initial snapshot, but do have
a non-empty log with a configuration entry, the RPC
instance isn't set up correctly.
Fix that by moving RPC setup code outside the check for
snapshot id and look at `_log.get_configuration()` instead.
Also, set up RPC mappings both for `current` and `previous`
components, since in case the last configuration index
points to an entry from the log, it can happen to be
a joint configuration entry.
For example, this can happen if a leader made an attempt
to change configuration, but failed shortly afterwards
without being able to commit the new configuration.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodonikov@scylladb.com>
Message-Id: <20210423220718.642470-1-pa.solodovnikov@scylladb.com>
This series is a conceptual revert of 4c8ab10, which turned out to be a
misguided defense mechanism that proved to be a hotbed for bugs. This
protection was superseded by 0fe75571d9 which guarantees forward
progress at all times without all the gotchas and bad interactions
introduced by 4c8ab10.
The latest instance of bad interaction that triggered this series is a
case of resource units being leaked when a previously evicted reader is
re-admitted, leaking already owned resources on each re-admission.
To prove that neither the resource leak, nor the deadlock 4c8ab10 was
supposed to guard against exists after this series, it includes two unit
tests stressing the respective areas: readmission and admission on a
highly contested semaphore.
Fixes: #8493
Also on: https://github.com/denesb/scylla.git
reader-permit-resource-leak-v2
Changelog
v2:
* Rebase over the recently merged reader close series. Fix merge
conflicts and an exposed bug.
* 'reader-permit-resource-leak-v2' of https://github.com/denesb/scylla:
test: mutation_reader_test: add test_reader_concurrency_semaphore_forward_progress
test: mutation_reader_test: add test_reader_concurrency_semaphore_readmission_preserves_units
reader_concurrency_semaphore: add dump_diagnostics()
reader_permit: always forward resources
reader_concurrency_semaphore: inactive_read_handle: abandon(): close reader
This series adds filling the `username` column in alternator tracing info, if the username is available. When alternator is enforcing authorization, each request contains a username in its headers.
The difference is as follows. A tracing entry excerpt before the series:
```
{
(...)
'source_ip': '::',
'table_names': 'alternator_Pets.Pets',
'username': '<unauthenticated request>'
}
```
and after the series:
```
{
(...)
'source_ip': '::',
'table_names': 'alternator_Pets.Pets',
'username': 'alternator'
}
```
This series also modifies one of the tests to check the username column.
Fixes#8547Closes#8548
* github.com:scylladb/scylla:
test: add username verification to alternator tracing tests
alternator: add user context to tracing
alternator: return username when verifying signature
This unit test checks that the semaphore doesn't get into a deadlock
when contended, in the presence of many memory-only reads (that don't
wait for admission). This is tested by simulating the 3 kind of reads we
currently have in the system:
* memory-only: reads that don't pass admission and only own memory.
* admitted: reads that pass admission.
* evictable: admitted reads that are furthermore evictable.
The test creates and runs a large number of these reads in parallel,
read kinds being selected randomly, then creates a watchdog which
kills the test if no progress is being made.
This unit test passes a read through admission again-and-again, just
like an evictable reader would be during its lifetime. When readmitted
the read sometimes has to wait and sometimes not. This is to check that
the readmitting a previously admitted reader doesn't leak any units.
This commit conceptually reverts 4c8ab10. Said commit was meant to
prevent the scenario where memory-only permits -- those that don't pass
admission but still consume memory -- completely prevent the admission
of reads, possibly even causing a deadlock because a permit might even
blocks its own admission. The protection introduced by said commit
however proved to be very problematic. It made the status of resources
on the permit very hard to reason about and created loopholes via which
permits could accumulate without tracking or they could even leak
resources. Instead of continuing to patch this broken system, this
commit does away with this "protection" based on the observation that
deadlocks are now prevented anyway by the admission criteria introduced
by 0fe75571d9, which admits a read anyway when all the initial count
resources are available (meaning no admitted reader is alive),
regardless of availability of memory.
The benefits of this revert is that the semaphore now knows about all
the resources and is able to do its job better as it is not "lied to"
about resource by the permits. Furthermore the status of a permit's
resources is much simpler to reason about, there are no more loopholes
in unexpected state transitions to swallow/leak resources.
To prove that this revert is indeed safe, in the next commit we add
robust tests that stress test admission on a highly contested semaphore.
This patch also does away with the registered/admitted differentiation
of permits, as this doesn't make much sense anymore, instead these two
are unified into a single "active" state. One can always tell whether a
permit was admitted or not from whether it owns count resources anyway.
fa43d7680 recently introduced mandatory closing of readers before they
are destroyed. One reader destroy path that was left not closing the
reader before destruction is `inactive_reader_handle::abandon()`. This
path is executed when the handle is destroyed while still referring to a
non-evicted inactive read. This patch fixes it up to close the reader
and adds a small unit test which checks that this happens.
Implements `wait_until_hints_are_replayed` method returning a future
which blocks until either all current hint segments are replayed
(returns success in this case), or when provided timeout is reached
(returns a timeout exception in this case).
Before this patch, each entry in alternator tracing included
an "<unauthenticated request>" field. It's not really true,
because most of alternator requests are actually performed
by authenticated users (unless auth is disabled).
This patch introduces the most basic bare infrastructure to generate
coverage report as well as a guide on how to manually generate them.
Although this barely qualifies as "support", it already allows one to
generate a coverage report with the help of this guide.
One immediate limitation of this patch is that it only supports clang,
which is not a terrible problem, given that its our main compiler
currently.
Future patches will build on this to incrementally improve and
automate this:
* Provide script to automatically merge profraw files and generate html
report from it.
* Integrate into test.py, adding a flag which causes it to generate
a coverage report after a run.
* Support GCC too, but at least auto-detect whether clang is used or
not.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210423140100.659452-1-bdenes@scylladb.com>
"
This patchset adds future-returning close methods to all
flat_mutation_reader-s and makes sure that all readers
are explicitly closed and waited for.
The main motivation for doing so is for providing a path
for cancelling outstanding i/o requests via a the input_stream
close (See https://github.com/scylladb/seastar/issues/859)
and wait until they complete.
Also, this series also introduces a stop
method to reader_concurrency_semaphore to be used when
shutting down the database, instead of calling
clear_inactive_readers in the database destructor.
The series does not change microbenchmarks performance in a significant way.
It looks like the results are within the tests' jitter.
- perf_simple_query: (in transactions per second, more is better)
before: median 184701.83 tps (90 allocs/op, 20 tasks/op)
after: median 188970.69 tps (90 allocs/op, 20 tasks/op) (+2.3%)
- perf_mutation_readers: (in time per iteration, less is better)
combined.one_row 65.042ns -> 57.961ns (-10.9%)
combined.single_active 46.634us -> 46.216us ( -0.9%)
combined.many_overlapping 364.752us -> 371.507us ( +1.9%)
combined.disjoint_interleaved 43.634us -> 43.448us ( -0.4%)
combined.disjoint_ranges 43.011us -> 42.991us ( -0.0%)
combined.overlapping_partitions_disjoint_rows 57.609us -> 58.820us ( +2.1%)
clustering_combined.ranges_generic 93.464ns -> 96.236ns ( +3.0%)
clustering_combined.ranges_specialized 86.537ns -> 87.645ns ( +1.3%)
memtable.one_partition_one_row 903.546ns -> 957.639ns ( +6.0%)
memtable.one_partition_many_rows 6.474us -> 6.444us ( -0.5%)
memtable.one_large_partition 905.593us -> 878.271us ( -3.0%)
memtable.many_partitions_one_row 13.815us -> 14.718us ( +6.5%)
memtable.many_partitions_many_rows 161.250us -> 158.590us ( -1.6%)
memtable.many_large_partitions 24.237ms -> 23.348ms ( -3.7%)
average -0.02%
Fixes#1076
Refs #2927
Test: unit(release, debug)
Perf: perf_mutation_readers, perf_simple_query (release)
Dtest: next-gating(release),
materialized_views_test:TestMaterializedViews.interrupt_build_process_and_resharding_max_to_half_test repair_additional_test:RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test(debug)
"
* tag 'flat_mutation_reader-close-v7' of github.com:bhalevy/scylla: (94 commits)
mutation_reader: shard_reader: get rid of stop
mutation_reader: multishard_combining_reader: get rid of destructor
flat_mutation_reader: abort if not closed before destroyed
flat_mutation_reader: require close
repair: row_level_repair: run: close repair_meta when done
repair: repair_reader: close underlying reader on_end_of_stream
perf: everywhere: close flat_mutation_reader when done
test: everywhere: close flat_mutation_reader when done
mutation_partition: counter_write_query: close reader when done
index: built_indexes_reader: implement close
mutation_writer: multishard_writer: close readers when done
mutation_writer: feed_writer: close reader when done
table: for_all_partitions_slow: close iteration_step reader when done
view_builder: stop: close all build_step readers
stream_transfer_task: execute: close send_info reader when done
view_update_generator: start: close staging_sstable_reader when done
view: build_progress_virtual_reader: implement close method
view: generate_view_updates: close builder readers when done
view_builder: initialize_reader_at_current_token: close reader before reassigning it
view_builder: do_build_step: close build_step reader when done
...
"
There are few places left that call for migration manager
by global reference. This set patches all those places
and makes the migration manager a service that locally
lives in main(). Surprisingly, the largest changes are to
get rid of global migration manager calls from ... the
migration manager itself.
Two tricks here. First, repair code gets its private global
migration manager pointer. That's not nice, but it aligned
with current repair design -- all its references are now
"global". Some day they all will be moved into sharded
repair service, for now these globals just describe the real
dependencies of the repair code.
Second is storage proxy that needs to call migration manager
to get schema. Proper layering makes migration manager sit
on top of storage proxy, so the direct back-reference is
not nice. To overcome this the proxy gets migration manager's
shared_from_this() pointer and drops all of them on stop.
This makes sure that by the time migration manager stops
no references from proxy exist.
tests: unit(dev), start-stop, start-drain-stop
"
* 'br-turn-migration-manager-local' of https://github.com/xemul/scylla: (21 commits)
migration_manager: Make it main-local
tests: Have own migration manager instances
tests: Use migration_manager from cql_test_env
migration_manager: Call maybe_sync from this
migration_manager: Make get_schema_for_... methods
migration_manager: Hide get_schema_definition
streaming: Keep migration_manager ptr in rpc lambdas
storage_proxy: Keep migration_manager ptr in rpc lambdas
streaming: Get migration_manager shared_ptr in messaging
storage_proxy: Get migration_manager shared_ptr in messaging
migration_manager: Make maybe_sync a method
migration_manager: Open-code merge lambda
migration_manager: Turn do_announce_new_type non-static
migration_manager: Make announce() non-static method
storage_servive: Use local migration manager
storage_service: Keep migration manager on board
migration_manager: Use 'this' where appropriate
repair: Use private migration manager pointer
repair: Keep private sharded migration manager pointer
redis: Carry sharded migration manager over init
...
If utils/rjson.hh is modified, 300 (!) source files get recompiled.
This is frustrating for anyone working on this header file (like me).
Moreover - utils/rjson.hh includes the large rapidjson header
files (rapidjson is a header-only library!), slowing the compilation
all these 300 files.
It turns out most includers utils/rjson.hh get it because
column_computation.hh includes it. But the fact that column
computations are serialized as JSON are an internal implementation
detail that the users of this header don't need to know - and they
care even less that this JSON implementation uses utils/rjson.hh.
So in this patch column_computation.hh no longer includes rjson.hh,
and no longer exposes a method taking a rjson::value that was never
used outside the implementation.
After this patch, touching utils/rjson.hh only recompiles 21 files.
Refs #1
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210422183526.114366-1-nyh@scylladb.com>
Now that the multishard_combining_reader is guaranteed to be called
there is no longer need for stopping the shard readers
in the destructor.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The motivation to abort if the reader is not closed before its destroyed
is mainly driven by:
1. Aborting will force us find and fix missing closes.
Otherwise, log warnings can easily be lost in the noise.
(ERRORs however are caught by dtests but won't be necessarily
caught in SCT / production environments)
2. Following patches remove existing cleanup code
in destructors that is not needed once close() is mandated.
If we don't abort on missing close we'll have to keep maintaining
both cleanup paths forever.
3. Not enforcing close exposes us to leaks and potential
use-after-free from background tasks that are left behind.
We want to stop guranteeing the safety of the background tasks
post close().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Make flat_mutation_reader::impl::close pure virtual
so that all implementations are required to implemnt it.
With that, provide a trivial implementation to
all implementations that currently use the default,
trivial close implementation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Always close repair_meta (that closes its reader).
Proper closing is done via the repair_meta::stop path.
Ignore any errors when auto-closing in a deferred action
since there is nothing else we can do at this point.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Make sure to close the builder's _updates and optional _existings
readers before they are destroyed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If reader.fill_buffer() fails, we will not call `maybe_pause` and the
reader will be destroyed, so make sure to close it.
Otherwise, the reader is std:move'ed to `maybe_pause` that either
paused using register_inactive_read or further std::move'ed to _reader,
in both cases it doesn't need to be closed. `with_closeable`
can safely try to close the moved-from reader and do nothing in this
case, as the f_m_r::impl was already moved away.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Close the _closing_gate to wait on background
close of dropped queries, and close all remaining queriers.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Make sure to close the dropped querier before it's destroyed.
The operation is moved to the background so not to penelize
the common path.
A following patch will add a querier_cache::close() method
that will close _closing_gate to wait on the querier close
(among other things it needs to wait on :)).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In preparation to closing the querier in the background
before dropping it.
With that, stats need not be passed as a parameter,
but rather the _stats member can be used directly.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Make sure to close the querier and subsequently its reader before
destroying it (unless it was moved).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Depening on the variant _reader contents, either close
the reader or unregister the inactive reader and close it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Instead of calling a lambda function for each index
simply iterate over all indices and use co_await / co_return
in the inner loop.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Make sure to close the unregistered inactive_read
before it's destroyed, if the unregistered reader_opt
is engaged.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"forward" the unregister to the other semaphore
in case on_internal_error throws rather than aborting.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Close readers in the background:
- evicted based on ttl, or
- those that weren't admitted by register_inactive_read
- those that are destoryed in clear_inactive_reads.
Use a gate for waiting on these background closes in stop().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
enqueue_waiter before evicting readers and start a loop in the
background to dequeue and close inactive_readers until
either the _wait_list is empty or there are no more inactive_readers
to evict.
We admit the read synchronously only if the wait_list is empty
and the semaphore has_available_units to statisfy admission.
We need to enqueue the reader before starting to evict readers
to make sure any evicted resources are assigned to the
waiter at the head of the queue and not "stolen"
in case we yield and some other caller grabs them.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In addition to clear_inactive_reads, that's currently called when
the database object is destroyed, introduce a stop() method that will:
1. wait on all background closes of inactive_reads.
2. close all present inactive_reads and waits on their close.
3. signal waiters on the wait_list via broken() with a proper
exception indicating that the semaphore was closed.
In addition, assert in the semaphore's destructor
that it has no remaining inactive reads.
Stop must be called from whoever owns the r_c_s.
Mainly, from database::stop.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than explcitily generating it by all callers
and then not using the argument at all.
Prepare for providing a different exception_ptr
from a stop() path to be introduced in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently unregister_inactive_read for other shards is moved
to the background with nothing keep the respective
reader_concurrency_semaphore around.
This change runs the loop in parallel_for_each
so that we don't have to serially wait on all of them
but rather they can run in parallel on all shards, but
all are waited on via the returned future<>.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently only if the reader_meta is in the saved state
we unregister its inactive_read, yet it is possible
that it will hold an inactive_read also in the lookup state.
To cover all cases, rather than testing the reader_state,
unregister if the inactive_read_handle is engaged.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
As it can't throw. This is needed to simplify the following
patch that will always close the reader in read_page.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
return reader lifecycle policy's destroy_reader future
so it can be waited on by caller (multishard_combining_reader::close).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move the logic in ~foreign_reader to close()
to wait on the read_ahead future and close the underlying
reader on the remote shard. Still call close in the background
in ~foreign_reader if destroyed without closing to keep the current
behavior, but warn about it, until it's proved to be unneeded.
Also, added on_iternal_error in close if _read_ahead_future
is engaged but _reader is not, since this must never happen
and we wait on the _read_ahead_future without the _reader.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
These will be called by merging_reader::close via
mutation_fragment_merger::close in the following patches.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Such that the holder, that is responsible for closing the
read_context before destroying it, holds it uniquely.
cache_flat_mutation_reader may be constructed either
with a read_context&, where it knows that the read_context
is owned externally, by the caller, or it could
be constructed with a std::unique_ptr<read_context> in
which case it assumes ownership of the read_context
and it is now responsible for closing it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Note that scanning_and_populating_reader::read_next_partition
now closes the current reader unconditionally
and before assigning a new reader. This should be an improvement
since we want to release resources the reader resources as early
as possible, certainly before allocating new resources.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
use the newly introduced reassign method to first
close the flat_mutation_reader_opt before assigning it with
a new reader.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Close _delegate if it's engaged both in the close() method
and when ever it is currently reset by _delegate = {}.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We don't own _source therefore do not close it.
That said, we still need to make sure that the reversing reader
itself is closed to calm down the check when it's destroyed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The underlying reader is owned by the caller if it is moved to it,
but not if it was constructed with a reference to the underlying reader.
Close the underlying reader on close() only in the former case.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Close both the _index_reader and _context, if they are engaged.
Warn and ignore any erros from close as it may be called either
from the destructor or from f_m_r close.
Call close() for closing in the background if needed when destroyed
and warn about.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We'd like that to simplify the soon-to-be-introduced
sstable_mutation_reader::close error handling path.
close_index_list can be marked noexcept since parallel_for_each is,
with that index_reader::close can be marked noexcept too.
Note that since reader close can not fail
both lower and upper bounds are closed (since
closing lower_bound cannot fail).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
`with_closeable` simplifies scoped use of
flat_mutation_reader, making sure to always close
the reader after use.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We cannot close in the background since there are use cases
that require the impl to be destroyed synchronously.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Allow closing readers before destorying them.
This way, outstanding background operations
such as read-aheads can be gently canceled
and be waited upon.
Note that similar to destructors, close must not fail.
There is nothing to do about errors after the f_m_r is done.
Enforce that in flat_mutation_reader::close() so if the f_m_r
implementation did return a failure, report it and abort as internal
error.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We will remove bounds_ranges when we kill the restrictions class
hierarchy. Of the several call sites, two can be easily modified to
avoid it. Others are more complicated and will be modified in a
subsequent commit.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Now everybody is patched to use component-local instance of migration
manager and its global instance can be moved into main() scope.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
No more global migration manager usage left, so all the tests
can be patched to use local migration manager instance. In fact,
it's only the cql_test_env that's such.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All the tests that need migration manager are run inside
cql_test_env context and can use the migration manager
from the env. For now this is still the global one, but
next patch will change this.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The only caller of maybe_sync() method is now the method itself
and can stop using global migration manager instance and switch
to using 'this'.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
These two helpers are now namespace-scoped methods, but both
need the migration manager instance inside. All their callers
are now patched to have the migration manager at hands, so
the helpers can be turned into methods.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This method is exclusively used inside migration manager code,
so (for now) no use in keeping it exposed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch is the bridge between the previous one and the
next one and is quite messy to be merged with either.
No heavy changes -- just copy the migration manager's ptr
onto rpc lambdas. Will be used in the next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The proxy's messaging code uses migration manager to obtain schema.
Since proxy is more low-level service than migration manager, it's
incorrect to make proxy reference the manager directly. Instead,
push the shared_ptr into proxy's messaging code. This kills two
birst with one stone:
1: let proxy use migration manager
2: makes sure that by the time migration manager is stopped
the proxy's use of this pointer is gone (unregistered from
rpc)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now the maybe_sync is namespace-scope function. Turn it into
a migration_manager method so that it can use 'this' instead of
get_local_migration_manager().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This lambda uses global migration manager instance. Open-coding
this short lambda makes further patching simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's the only place that calls recently patched .announce()
method, so instead of grabbing global migration manager,
use 'this'.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This method needs to get migration manager instance to call
methods on it, so turn it non-static to have the instance in
'this'. Caller (yes, only one) gets local migration manager
itself, but will be patched soon.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The storage service needs migration manager to sync schema
on lifecycle notifiers and to stop the guy on drain. So this
patch just pushes the migration manager reference all the
way through the storage service constructor.
Few words about tests. Since now storage service needs the
migration manager in constructor, some tests should take it
from somewhere. The cql_test_env already has (and uses) it,
all the others can just provide a not-started sharded one,
it won't be in use in _those_ tests anyway.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some its non-static method call get_local_migration_manager
instead of using 'this'. None of these places use this to
get cross-shard instance, so it's safe to use 'this' there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's nowadays standard for repair to keep global pointers on
the needed services. Keep the migration manager there too to
avoid explicit call to get_local_migration_manager. Later this
pile of global pointers will be encapsulated on redis service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The only place in redis that needs migration manager is the
::init method that's called on start. It's possible to pass
the migration manager as an argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The migration_task is the class with the single static method
that's called from a single place in migration manager and
this method calls migration manager back right at once. There's
no much sense in keeping this abstraction, merge it into the
migration manager.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Adds a field to `end_point_hints_manager::sender`:
`_total_replayed_segments_count` which keeps track of how many segments
were replayed so far. This metric will be used to calculate the
sequence number of the last current hint segments in the queue - so
that we can implement waiting for current segments to be replayed.
Add a statement_restrictions member that tracks expressions that
together define the partition range.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Endpoint hints manager keeps a list of segments to replay. New segments
are appended to it lazily - only when a hint flush occurs (hints
commitlog instance is re-created) and the list is empty. Because of
that, this list cannot be currently used to tell how many segments are
on disk.
This commit allows to trigger hints flush and forcefully update the list
of segments to replay. In later commits, a mechanism will be implemented
which will allow to wait until a given number of hint segments is
replayed. Triggering a hints flush with segment list update will allow
us to properly synchronize and determine up to which segment we need to
wait.
This series adds a wrapper for the default rjson allocator which throws on allocation/reallocation failures. It's done to work around several rapidjson (the underlying JSON parsing library) bugs - in a few cases, malloc/realloc return value is not checked, which results in dereferencing a null pointer (or an arbitrary pointer computed as 0 + `size`, with the `size` parameter being provided by the user). The new allocator will throw an `rjson:error` if it fails to allocate or reallocate memory.
This series comes with unit tests which checks the new allocator behavior and also validates that an internal rapidjson structure which we indirectly rely upon (Stack) is not left in invalid state after throwing. The last part is verified by the fact that its destructor ran without errors.
Fixes#8521
Refs #8515
Tests:
* unit(release)
* YCSB: inserting data similar to the one mentioned in #8515 - 1.5MB objects clustered in partitions 30k objects in size - nothing crashed during various YCSB workloads, but nothing also crashed for me locally before this patch, so it's not 100% robust
relevant YCSB workload config for using 1.5MB objects:
```yaml
fieldcount=150
fieldlength=10000
```
Closes#8529
* github.com:scylladb/scylla:
test: add a test for rjson allocation
test: rename alternator_base64_test to alternator_unit_test
rjson: add a throwing allocator
Migration manager has a function to get a schema (for read or write),
this function queries a peer node and retrieves the schema from it. One
scenario where it can happen is if an old node, queries an old not fixed
index.
This makes a hole through which views that are only adjusted for reading
can slip through.
Here we plug the hole by fixing such views before they are registered.
Closes#8509
Otherwise waiters on committed configuration changes (e.g.
`server::set_configuration`) would never get notified.
Also if we tried to send another entry concurrently we would get
replication_test: raft/server.cc:318: void raft::server_impl::notify_waiters(std::map<index_t, op_status> &, const std::vector<log_entry_ptr> &): Assertion `entry_idx >= first_idx' failed.
(not sure if this commit also fixes whatever caused that).
Message-Id: <20210419181319.68628-2-kbraun@scylladb.com>
sstable cells are parsed into temporary_buffers, which causes large contiguous allocations for some cells.
This is fixed by storing fragments of the cell value in a fragmented_temporary_buffer instead.
To achieve this, this patch also adds new methods to the fragmented_temporary_buffer(size(), ostream& operator<<()) and adds methods to the underlying parser(primitive_consumer) for parsing byte strings into fragmented buffers.
Fixes#7457Fixes#6376Closes#8182
* github.com:scylladb/scylla:
primitive_consumer: keep fragments of parsed buffer in a small_vector
sstables: add parsing of cell values into fragmented buffers
sstables: add non-contiguous parsing of byte strings to the primitive_consumer
utils: add ostream operator<<() for fragmented_temporary_buffer::view
compound_type: extend serialize_value for all FragmentedView types
The service level controller spawns an updating thread,
which wasn't properly waited for during shutdown.
This behavior is now fixed.
Tests: manual
Fixes#8468Closes#8470
* github.com:scylladb/scylla:
qos: make sure to wait for sl updates on shutdown
db: stop using infinite timeout for service level updates
This adds support for disabling writeback cache by adding a new
DISABLE_WRITEBACK_CACHE option to "scylla-server" sysconfig file, which
makes the "scylla_prepare" script (that is run before Scylla starts up)
call perftune.py with appropriate parameters. Also add a
"--disable-writeback-cache" option to "scylla_sysconfig_setup", which
can be called by scylla-machine image scripts, for example.
Refs: #7341
Tests: dtest (next-gating)
Closes#8526
"
Memory usage is considerably reduced by making reshape switch to partitioned set,
given that input sstables are disjoint. This will benefit reshape for all
strategies, not only TWCS.
Write amplification is reduced a lot by compacting all input sstables at once,
which is possible given that unbounded memory usage is fixed too.
With both these issues fixed, TWCS reshape will be much more efficient.
tests: mode(dev).
"
* 'twcs_reshape_fixes' of github.com:raphaelsc/scylla:
tests: sstables: Check that TWCS is able to reshape disjoint sstables efficiently
TWCS: Reshape all sstables in a time window at once if they're disjoint
sstables: Extract code to count amount of overlapping into a function
LCS: reshape: Fix overlapping check when determining if a sstable set is disjoint
compaction: Make reshape compaction always use partitioned_sstable_set
compaction: Allow a compaction type to override the sstable_set for input sstables
The service level controller spawns an updating thread,
which wasn't properly waited for during shutdown.
This behavior is now fixed.
In order to make the shutdown order more standardized,
the operation is split into two phases - draining and stopping.
Tests: manual
Fixes#8468
Due to a porting bug, the routines for updating service levels
used the default infinite timeout for internal CQL queries,
which causes Scylla to hang on shutdown. The behavior is now
fixed and the routines use the same timeout as the other
similar functions - 10s at the time of writing this message.
With repair-based operations, each window will have 256 disjoint
sstables due to data segregation which produces N sstables for each
vnode range, where N = # of existing windows. So each window ends up
with one sstable per vnode range = 256.
Given that reshape now unconditionally uses partitioned set's incremental
selector, all the 256 sstables can be compacted at once as compaction
essentially becomes a copy operation, where only one sstable will be
opened at a time, making its memory usage very efficient.
By compacting all sstables at once, write amplification is a lot
reduced because each byte is now only rewritten once.
Previously, with the initial set of 256 sstables, write amp could be
up to 8, which makes reshape for TWCS very slow.
Refs #8449.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This function will be reused by TWCS reshape when checking if all
sstables in a window are disjoint and can be all compacted together.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Wrong comparison operator is used when checking for overlapping. It
would miss overlapping when last key of a sstable is equal to the first
key of another sstable that comes next in the set, which is sorted by
first key.
Fixes#8531.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
In commit 323f72e48a (repair: Switch to
use NODE_OPS_CMD for replace operation), we switched replace operation
to use the new NODE_OPS_CMD infrastructure.
In this patch, we continue the work to switch decommission operation to use
NODE_OPS_CMD.
The benefits:
- A UUID is used to identify each node operation across the cluster.
- It is more reliable to detect pending node operations, to avoid
multiple topology changes at the same time.
- The cluster reverts to a state before the decommission operation
automatically in case of error. Without this patch, the node to be
decommissioned will be stuck in decommission status forever until it
is restarted and goes back to normal status.
- Allows users to pass a list of dead nodes to ignore for decommission
explicitly.
- The LEAVING gossip status is not needed any more. This is one step
closer to achieve gossip-less topology change.
- Allows us to trigger of off-strategy easily on the node receiving the
ranges
Fixes#8471
The default rapidjson allocator returns nullptr from
a failed allocation or reallocation. It's not a bug by itself,
but rapidjson internals usually don't check for these return values
and happily use nullptr as a valid pointer, which leads to segmentation
faults and memory corruptions.
In order to prevent these bugs, the default allocator is wrapped
with a class which simply throws once it fails to allocate or reallocate
memory, thus preventing the use of nullptr in the code.
One exception is Malloc/Realloc with size 0, which is expected
to return nullptr by rapidjson code.
On 'product != scylla' environment, we have a bug with .default file
(sysconfig file) handling.
Since .default file should be install original name, package name can be
doesn't match with .default filename.
(ex: default file is /etc/default/scylla-node-exporter, but
package name is scylla-enterprise-node-exporter)
When filename doesn't match with package name, it should be renamed with
as follows:
<package name>.<filename>.default
We already do this on .service file, but mistakenly haven't handled
.default file, so let's add it too.
Related scylladb/scylla-enterprise#1718
Fixes#8527Closes#8528
Reduce rebuilds and build time by removing unnecessary includes. Along the way,
improve header sanity.
Ref #1.
Test: dev-headers, unit(dev).
Closes#8524
* github.com:scylladb/scylla:
treewide: remove inclusions of storage_proxy.hh from headers
storage_proxy: unnest coordinator_query_result
treewide: make headers self-sufficient
utils: intrusive_btree: add missing #pragma once
Make sure that sstable::unlink will never fail.
It will terminate in the unlikely case toc_filename
throws (e,g, on bad_alloc), otherwise it ignores any other error
and juts warns about it.
Make unlink a coroutine to simplify the implementation
without introducing additional allocations.
Note that remove_by_toc_name and maybe_delete_large_data_entries
are executed asynchronously and concurrently.
Waiting for them to finish is serialized by co_await,
making sure that both are being waited on so not to leave
abandoned futures behind.
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210420135020.102733-1-bhalevy@scylladb.com>
Reshape compaction potentially works with disjoint sstables, so it will
benefit a lot from using partitioned_sstable_set, which is able to
incrementally open the disjoint sstables. Without it, all sstables are
opened at once, which means unbounded memory usage.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
storage_proxy.hh is huge and includes many headers itself, so
remove its inclusions from headers and re-add smaller headers
where needed (and storage_proxy.hh itself in source files that
need it).
Ref #1.
Nested classes cannot be forward declared, and
storage_proxy::coordinator_query_result is used in pagers, where
we'd like to forward-declare it. Unnest it and introduce an alias
for compatibility.
By default, compaction will pick a implementation of sstable_set as
defined by the underlying compaction strategy.
However, reshape compaction potentially works with disjoint sstables
and will benefit a lot from always using partitioned set.
For example, when reshaping a TWCS table, it's better to use the
partitioned set rather than the time window set, as the former will
be much more memory efficient by incrementally selecting sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
In issue #5021 we noted that Alternator's equality operator needs to be
fixed for the case of comparing two sets, because the equality check needs
to take into account the possibility of different element order.
Unfortunately, we fixed only the equality check operator, but forgot there
is also an inequality operator!
So in this patch we fix the inequality operator, and also add a test for
it that was previously missing.
The implementation of the inequality operator is trivial - it's just the
negation of the equality test. Our pre-existing tests verify that this is
the correct implementation (e.g., if attribute x doesn't exist, then "x = 3"
is false but "x <> 3" is true).
Refs #5021Fixes#8513
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419141450.464968-1-nyh@scylladb.com>
In issue #5021 we noticed that the equality check in Alternator's condition
expressions needs to handle sets differently - we need to compare the set's
elements ignoring their order. But the implementation we added to fix that
issue was only correct when the entire attribute was a set... In the
general case, an attribute can be a nested document, with only some
inner set. The equality-checking function needs to tranverse this nested
document, and compare the sets inside it as appropriate. This is what
we do in this patch.
This patch also adds a new test comparing equality of a nested document with
some inner sets. This test passes on DynamoDB, failed on Alternator before
this patch, and passes with this patch.
Refs #5021Fixes#8514
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419184840.471858-1-nyh@scylladb.com>
When a condition expression (ConditionExpression, FilterExpression, etc.)
checks for equality of two item attributes, i.e., "x = y", and when one of
these attributes was missing we correctly returned false.
However, we also need to return false when *both* attributes are missing in
the item, because this is what DynamoDB does in this case. In other words
an unset attribute is never equal to anything - not even to another unset
attribute. This was not happening before this patch:
When x and y were both missing attributes, Alternator incorrectly returned
true for "x = y", and this patch fixes this case. It also fixes "x <> y"
which should to be true when both x and y are unset (but was false
before this patch).
The other comparison operators - <, <=, >, >=, BETWEEN, were all
implemented correctly even before this patch.
This patch also includes tests for all the two-unset-attribute cases of
all the operators listed above. As usual, we check that these tests pass
on both DynamoDB and Alternator to confirm our new behavior is the correct
one - before this patch, two of the new tests failed on Alternator and
passed on DynamoDB.
Fixes#8511
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419123911.462579-1-nyh@scylladb.com>
Currently said method uses the system semaphore as a catch-all for all
scheduling groups it doesn't know about. This is incompatible with the
recent forward-porting of the service-level infrastructure as it means
that all service level related scheduling groups will fall back to the
system scheduling group, which causes two problems:
* They will experience much limited concurrency, as the system semaphore
is assigned much less count units, to match the much more limited
internal traffic.
* They compete with internal reads, severely impacting the respective
internal processes, potentially causing extreme slowdown, or even
deadlock in the case of an internal query executed on behalf of a
user query being blocked on the latter.
Even if we don't have any custom service level scheduling groups at the
moment, it is better to change this such that unknown scheduling groups
fall-back to using the user semaphore. We don't expect any new internal
scheduling group to pop up any time soon (and if they do we can adjust
get_reader_concurrency_semaphore() accordingly), but we do expect user
scheduling groups to be created in the future, even dynamically.
To minimize the chance of the wrong workload being associated with the
user semaphore, all statically created scheduling groups are now
explicitly listed in `get_reader_concurrency_semaphore()`, to make their
association with the respective semaphore explicit and documented.
Added a unit test which also checks the correct association for all
these scheduling groups.
Fixes: #8508
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210420105156.94002-1-bdenes@scylladb.com>
Back when rjson was only part of alternator, there was a hardcoded
limit of nested levels - 78. The number was calculated as:
- taking the DynamoDB limit (32)
- adding 7 to it to make alternator support more cases
- doubling it because rjson internals bump the level twice
for each alternator object (because the alternator object
is represented as a 2-level JSON object).
Since rjson is no longer specific to alternator, this limit
is now configurable, and the original default value is explained
in a comment.
Message-Id: <51952951a7cd17f2f06ab36211f74086e1b60d2d.1618916299.git.sarna@scylladb.com>
The Redis server started as a copy of the CQL server, but did not
receive all the fixes of the CQL server over time. For example, commit
1a8630e ("transport: silence "broken pipe" and "connection reset by
peer" errors") was only done on the CQL server.
To remedy the situation, this pull request unifies code between the CQL
and Redis servers by introducing a "generic_server" component, and
switching CQL and Redis to use it.
Test: dtest(dev)
Closes#8388
* github.com:scylladb/scylla:
generic_server: Rename "maybe_idle" to "maybe_stop"
generic_server: API documentation for connection and server classes
transport, redis: Use generic server::listen()
transport/server: Remove "redis_server" prefix from logging
transport/server: Remove "cql_server" prefix from logging
generic_server: Remove unneeded static_pointer_cast<>
transport, redis: Use generic server::do_accepts()
transport, redis: Use generic server::process()
redis: Move Redis specific code to handle_error()
transport: Move CQL specific error handling to handle_error()
transport, redis: Move connection tracking to generic_server::server class
transport, redis: Move _stopped and _connections_list to generic_server::server class
transport, redis: Move total_connections to generic_server::server class
transport, redis: Use generic server::maybe_idle()
transport, redis: Move list_base_hook<> inheritance to generic_server::connection
transport, redis: Use generic connection::shutdown()
The set recycles 16 bytes from the reader class, makes use of
rows collection sugar, generalizes range tombstones emission and
adds an invariant-check.
tests: unit(dev)
* xemul/br-cache-reader-cleanups-1.2:
cache_flat_mutation_reader: Generalize range tombstones emission
cache_flat_mutation_reader: Tune forward progress check
cache_flat_mutation_reader: Use rows insertion sugar
cache_flat_mutation_reader: Move state field
cache_flat_mutation_reader: Remove raiish comparator
cache_flat_mutation_reader: Remove unused captured variable
cache_flat_mutation_reader: Fix trace message text
Support for random-seed was added in 4ad06c7eeb
but the program still uses std::rand() to draw random keys.
Use tests::random::get_int instead so we can get reprodicible
sequence of keys given a particular random-seed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210418104455.82086-1-bhalevy@scylladb.com>
It's notoriously hard to find the service level controller symbol
(possible by guessing the offset based on system_distributed_keyspace
address, but it's very cumbersome). To make the debugging process
easier, the symbol is exported via the `debug` namespace.
Closes#8506
`system_distributed_everywhere` is a new keyspace that uses Everywhere
replication strategy. This is useful, for example, when we want to store
internal data that should be accessible by every node; the data can be
written using CL=ALL (e.g. during node operations such as node
bootstrap, which require all nodes to be alive - at least currently) and
then read by each node locally using CL=ONE (e.g. during node restarts).
Closes#8457
repair: Handle everywhere_topology in bootstrap_with_repair
The everywhere_topology returns the number of nodes in the cluster as
RF. This makes only streaming from the node losing the range impossible
since no node is losing the range after bootstrap.
Shortcut to stream from all nodes in local dc in case the keyspace is
everywhere_topology.
Fixes#8503Closes#8505
* github.com:scylladb/scylla:
repair: Make the log more accurate in bootstrap_with_repair
repair: Handle everywhere_topology in bootstrap_with_repair
We have logs
expected 1 node losing range but found *more* nodes
However, we can find zero node as well. Drop the word *more* in the log.
In addition, print the number of nodes found.
Refs #8503
Mention executables (scylla, tools and tests) as well as how to build
individual object files and how to verify individual headers. Also
mention the not-at-all obvious trick of how to build tests with debug
symbols.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210416131950.175413-1-bdenes@scylladb.com>
The everywhere_topology returns the number of nodes in the cluster as
RF. This makes only streaming from the node losing the range impossible
since no node is losing the range after bootstrap.
Shortcut to stream from all nodes in local dc in case the keyspace is
everywhere_topology.
Fixes#8503
By default, argparse will provide the value of the option as str.
Later, we compare it with int, which will be always False. Fix by
telling argparse to provide as int.
Message-Id: <20210415182149.686355-1-tgrabiec@scylladb.com>
Looks like in python 3, division automatically yields a double/float,
even if both operands are integers. This results in
get_base_class_offset() returning a double/float, which breaks pointer
arithmetics (which is what the returned value is used for), because now
instead of decrementing/incrementing the pointer, the pointer will be
converted to a double itself silently, then back to some corrupt pointer
value. One user visible effect is `intrusive_list` being broken, as it
uses the above method to calculate the member type pointer from the node
pointers.
Fix by coercing the returned value to int.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210415080034.167762-1-bdenes@scylladb.com>
The database.hh is the central recursive-headers knot -- it has ~50
includes. This patch leaves only 34 (it remains the champion though).
Similar thing for database.cc.
Both changes help the latter compile ~4% faster :)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210414183107.30374-1-xemul@scylladb.com>
The range tombstone can be added-to-buffer from two places:
when it was found in cache and when it was read from the
underlying reader. Both adders can now be generalized.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When adding a range tombstone to the buffer the need
to stop stuffing the already full one is only done if
this particular range timbstone changes the lower_bound.
This check can be tuned -- if the lower bound changed
_at_ _all_ after a range tombstone was added, we may
still abort the loop.
This change will allow to generalize range tombstone
emission by the next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When inserting a rows_entry via unique_ptr the ptr inquestion
can be pushed as is, the intrusive btree code releases the
pointer (to be exception safe) itself. This makes the code
a bit shorter and simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two alignment gaps in the middle of the
c_f_m_r -- one after the state and another one after
the set of bools. Keeping them togethers allows the
compiler to pack the c_f_m_r better.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The instance of position_in_partition::tri_compare sits
on the reader itself and just occupies memory. It can be
created on demand all the more so it's only one place that
needs it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
First commit:
In the first commit, add validation of `enable` and `postimage` CDC options. Both options are boolean options, but previously they were not validated, meaning you could issue a query:
```
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': 'dsfdsd'};
```
and it would be executed without any errors, silently interpreting `dsfdsd` as false.
The first commit narrows possible values of those boolean CDC options to `false`, `true`, `0`, `1`. After applying this change, issuing the query above would result in this error message:
```
ConfigurationException: Invalid value for CDC option "enabled": dsfdsd
```
I actually encountered this lacking validation myself, as I mistakenly issued a query:
```
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'preimage': true, 'postimage': 'full'};
```
incorrectly assigning `full` to `postimage`, instead of `preimage`. However, before this commit, this query ran correctly and it interpreted `full` as `false` and disabled postimages altogether.
Second commit:
The second commit improves the error message of invalid `ttl` CDC option:
Before:
```
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': 'invalid'};
ServerError: stoi
```
After:
```
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': 'kgjhfkjd'};
ConfigurationException: Invalid value for CDC option "ttl": kgjhfkjd
```
```
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': '75747885787487'};
ConfigurationException: Invalid CDC option: ttl too large
```
Closes#8486
* github.com:scylladb/scylla:
cdc: improve exception message of invalid "ttl"
cdc: add validation of "enable" and "postimage"
On Debian variants, sh -x ./install.sh will fail since our script in
written in bash, and /bin/sh in Debian variants is dash, not bash.
So detect non-bash shell and print error message, let users to run in
bash.
Fixes#8479Closes#8484
We start background reclaim after we bootstrap, so bootstrap doesn't
benefit from it, and sees long stalls.
Fix by moving background reclaim initialization early, before
storage_service::join_cluster().
(storage_service::join_cluster() is quite odd in that main waits
for it synchronously, compared to everything else which is just
a background service that is only initialized in main).
Fixes#8473.
Closes#8474
given that resharding is now a synchronous mandatory step, before
table is populated, snapshot() can now get rid of code which takes
into account whether or not a sstable is shared.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210414121549.85858-1-raphaelsc@scylladb.com>
* seastar d2dcda96bb...0b2c25d133 (4):
> reactor: reactor_backend_epoll: stop using signals for high resolution timers
> reactor: move task_quota_timer_thread_fn from reactor to reactor_backend_epoll
> Merge "Report maximum IO lenghts via file API" from Pavel E
> Merge "Improve efficiency of io-tester" from Pavel E
Improve the exception message of providing invalid "ttl" value to the
table.
Previously, if you executed a CREATE TABLE query with invalid "ttl"
value, you would get a non-descriptive error message:
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': 'invalid'};
ServerError: stoi
This commit adds more descriptive exception messages:
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': 'kgjhfkjd'};
ConfigurationException: Invalid value for CDC option "ttl": kgjhfkjd
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': '75747885787487'};
ConfigurationException: Invalid CDC option: ttl too large
Add validation of "enable" and "postimage" CDC options. Both options
are boolean options, but previously they were not validated, meaning
you could issue a query:
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': 'dsfdsd'};
and it would be executed without any errors, silently interpreting
"dsfdsd" as false.
This commit narrows possible values of those boolean CDC options to
false, true, 0, 1. After applying this change, issuing the query above
would result in this error message:
ConfigurationException: Invalid value for CDC option "enabled": dsfdsd
This patch fixes cql-pytest/run-cassandra to work on systems which
default to Java 11, including Fedora 33.
Recent versions of Cassandra can run on Java 11 fine, but requires a
bunch of weird JVM options to work around its JPMS (Java Platform Module
System) feature. Cassandra's start scripts require these options to
be listd in conf/jvm11-server.options, which is read by the startup
script cassandra.in.sh.
Because our "run-cassandra" builds its own "conf" directory, we need
to create a jvm11-server.options file in that directory. This is ugly,
but unfortunately necessary if cql-pytest/run-cassandra is to run with
on systems defaulting to Java 11.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210406220039.195796-1-nyh@scylladb.com>
We currently only update the failure detector for a node when a higher
version of application state is received. Since gossip syn messages do
not contain application state, so this means we do not update the
failure detector upon receiving gossip syn messages, even if a message
from peer node is received which implies the peer node is alive.
This patch relaxes the failure detector update rule to update the
failure detector for the sender of gossip messages directly.
Refs #8296Closes#8476
When a particular partition exists in at least one sstable, the cache
expects any single-partition query to this partition to return a `partition_start`
fragment, even if the result is empty.
In `time_series_sstable_set::create_single_key_sstable_reader` it could
happen that all sstables containing data for the given query get
filtered out and only sstables without the relevant partition are left,
resulting in a reader which immediately returns end-of-stream (while it
should return a `partition_start` and if not in forwarding mode, a
`partition_end`). This commit fixes that.
We do it by extending the reader queue (used by the clustering reader
merger) with a `dummy_reader` which will be returned by the queue as
the very first reader. This reader only emits a `partition_start` and,
if not in forwarding mode, a `partition_end` fragment.
Fixes#8447.
Closes#8448
Support native building & unit testing in the Nix ecosystem under
nix-shell.
Actual dist packaging for Nixpkgs/NixOS is not there (yet?), because:
* Does not exactly seem like a huge priority.
* I don't even have a firm idea of how much work it would entail (it
certainly does not need the ld.so trickery, so there's that. But at
least some work would be needed, seeing how ScyllaDB needs to
integrate with its environment and NixOS is a little unorthodox).
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20210413110508.5901-4-michael.livshin@scylladb.com>
"
These methods are generic ways to query a mutation source. At least they
used to be, but nowadays they are pretty specific to how tables are
queried -- they use a querier cache to lookup queriers from and save
them into. With the coming changes to how permits are obtained, they are
about to get even more specific to tables. Instead of forcing the
genericity and keep adding new parameters, this patchset bites the
bullet and moves them to table. `data_query()` is inlined into
`table::query()`, while `mutation_query()` is replaced with
`table::mutation_query()`.
The only other users besides table are tests and they are adjusted to
use similarly named local methods that just combine the right querier
with the right result builder. This combination is what the tests really
want to test, as this is also what is used by the table methods behind
the scenes.
Tests: unit(release, debug)
"
* 'mutation-query-move-query-methods-into-table/v1' of https://github.com/denesb/scylla:
mutation_query: remove now unused mutation_query()
test: mutation_query_test: use local mutation_query() implementation
database: mutation_query(): use table::mutation_query()
table: add mutation_query()
query: remove the now unused data_query()
test: mutation_query_test: use local data_query() implementation
table: query(): inline data_query() code into query()
table: make query() a coroutine
The cql_server and redis_server share the same ancestor of do_accepts().
Let's pull up the cql_server version of do_accept() (that has more
functionality) to generic_server::server and use it in the redis_server
too.
Pull up the cql_server process() to base class and convert redis_server
to use it.
Please note that this fixes EPIPE and connection reset issue in the
Redis server, which was fixed in the CQL server in commit 1a8630e6a
("transport: silence "broken pipe" and "connection reset by peer"
errors").
The cql_server and redis_server both have the same "_stopped" and
"_connections_list" member variables. Pull them up to the
generic_server::server base class.
The cql_server and redis_server classes have a maybe_idle() method,
which sets the _all_connections_stopped promise if server wants to stop
and can be stopped. Pull up the duplicated code to
generic_server::server class.
Both cql_server::connection and redis_server::connection inherit
boost::intrusive::list_base_hook<>, so let's pull up that to the
generic_server::connection class that both inherit.
This patch moves the duplicated connection::shutdown() method to to a
new generic_server::connection base class that is now inherited by
cql_server and redis_server.
It is possible that a partition is in cache but is not present in sstables that are underneath.
In such case:
1. cache_flat_mutation_reader will fast forward underlying reader to that partition
2. The underlying reader will enter the state when it's empty and its is_end_of_stream() returns true
3. Previously cache_flat_mutation_reader::do_fill_buffer would try to fast forward such empty underlying reader
4. This PR fixes that
Test: unit(dev)
Fixes#8435Fixes#8411Closes#8437
* github.com:scylladb/scylla:
row_cache: remove redundant check in make_reader
cache_flat_mutation_reader: fix do_fill_buffer
read_context: add _partition_exists
read_context: remove skip_first_fragment arg from create_underlying
read_context: skip first fragment in ensure_underlying
This check is always true because a dummy entry is added at the end of
each cache entry. If that wasn't true, the check in else-if would be
an UB.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Make sure that when a partition does not exist in underlying,
do_fill_buffer does not try to fast forward withing this nonexistent
partition.
Test: unit(dev)
Fixes#8435Fixes#8411
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This new state stores the information whether current partition
represented by _key is present in underlying.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This was previously done in create_underlying but ensure_underlying is
a better place because we will add more related logic to this
consumption in the following patches.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
alternator/expressions.g had both AGPL and proprietary licensing. The
proprietary one is removed.
gms/inet_address_serializer.hh had only a proprietary license; it is
replaced by the AGPL.
Fixes#8465.
Closes#8466
This series introduces service level syntax borrowed from https://docs.scylladb.com/using-scylla/workload-prioritization/ , but without workload prioritization itself - just for the sake of using identical syntax to provide different parameters later. The new parameters may include:
* per-service-level timeouts
* oltp/olap declaration, which may change the way Scylla treats long requests - e.g. time them out (the oltp way) or keep them sustained with empty pages (the olap way)
Refs #7617Closes#7867
* github.com:scylladb/scylla:
transport: initialize query state with service level controller
main: add initializing service level data accessor
service: make enable_shared_from_this inheritance public
cql3: add SERVICE LEVEL syntax (without an underscore)
unit test: Add unit test for per user sla syntax
cql: Add support for service level cql queries
auth: Add service_level resource for supporting in authorization of cql service_level
cql: Support accessing service_level_controller from query state
instantiate and initialize the service_level_controller
qos: Add a standard implementation for service level data accessor
qos: add waiting for the updater future
service/qos: adding service level controller
service_levels: Add documentation for distributed tables
service/qos: adding service level table to the distributed keyspace
service/qos: add common definitions
auth: add support for role attributes
In order for the syntax to be more natural, it's now possible
to use SERVICE LEVEL instead of SERVICE_LEVEL in all appropriate
places. The old syntax is supported as well.
This commit adds the infrastructure needed to test per user sla,
more specificaly, a service level accessor that triggers the
update_service_levels_from_distributed_data function uppon any
change to the dystributed sla data.
A test was added that indirectly consumes this infrastructure by
changing the distributed service level data with cql queries.
Message-Id: <23b2211e409446c4f4e3e57b00f78d9ff75fc978.1609249294.git.sarna@scylladb.com>
This patch adds support for new service level cql queries.
The queries implemented are:
CREATE SERVICE_LEVEL [IF NOT EXISTS] <service_level_name>
ALTER SERVICE_LEVEL <service_level_name> WITH param = <something>
DROP SERVICE_LEVEL [IF EXISTS] <service_level_name>
ATTACH SERVICE_LEVEL <service_level_name> TO <role_name>
DETACH SERVICE_LEVEL FROM <role_name>
LIST SERVICE_LEVEL <service_level_name>
LIST ALL SERVICE_LEVELS
LIST ATTACHED SERVICE_LEVEL OF <role_name>
LIST ALL ATTACHED SERVICE_LEVELS
queries
In order to be able to manage service_level configuration one must be authorized
to do so, or to be a superuser. This commit adds the support for service_levels
resource. Since service_levels are relative, reconfiguring one service level is not locallized
only to that service level and will affect the QOS for all of the service levels,
so there is not much sense of granting permissions to manage individual service_levels.
This is why only root resource named service_levels that represents all service levels is used.
This commit also implements the unit test additions for the newly introduced resource.
Message-Id: <81ab16fa813b61be117155feea405da6266921e3.1609237687.git.sarna@scylladb.com>
In order to implement service level cql queries, the queries objects
needs access to the service_level_controller object when processing.
This patch adds this access by embedding it into the query state object.
In order to accomplish the above the query processor object needs an
access to service_level_controller in order to instantiate the query state.
Message-Id: <68f5a7796068a49d9cd004f1cbf34bdf93b418bc.1609234193.git.sarna@scylladb.com>
service_level_controller defines an interface for accessing the service
level distributed data, this patch implements a standard implementation
of the interface that delegates to the system distributed keyspace.
Message-Id: <25e68302f6f4d4fe5fcb66ea19159ad68506ba64.1609175314.git.sarna@scylladb.com>
The distributed data updated used to spawn a future without waiting
for it. It was quite safe, since the future had its own abort source,
but it's better to remember it and wait for it during stop() anyway.
In the general case roles might come with attributes attached to them
these attributes can originate in mechanisms such as LDAP where in
the undelying directory each entity can have a key:value data structure.
This patch add support for such attributes in the role manager interface,
it also implements the attribute support in the standard role
manager in the form of a table with an attribute map in the distributed system keyspace.
Message-Id: <f53c74a7ac315c4460ff370ea6dbb1597821edc2.1609158013.git.sarna@scylladb.com>
fixes AddressSanitizer: stack-buffer-underflow on address 0x7ffd9a375820 at pc 0x555ac9721b4e bp 0x7ffd9a374e70 sp 0x7ffd9a374620
Backend registry holds a unique pointer to the backend implementation
that must outlive the whole tracing lifetime until the shutdown call.
So it must be catched/moved before the program exits its scope by
passing out the lambda chain.
Regarding deletion of the default destructor: moving object requires
a move constructor (for do_with) that is not implicitly provided if
there is a user-defined object destructor defined even tho its impl
is default.
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
Closes#8461
This pull request adds a "ninja help" build target in hopes of making
the different build targets more discoverable to developers.
Closes#8454
* github.com:scylladb/scylla:
building.md: Document "ninja help" target
configure.py: "ninja help" target
building.md: Document "ninja <mode>-dist" target
configure.py: Add <mode>-dist target as alias for dist-<mode>
Stop use seastar::pipe and use seastar::queue directly to pass log
entries to apply_fiber. The pipe is a layer above queue anyway and it
adds functionality that we do not need (EOS) and hinds functionality that
we do (been able to abort()). This fixes a crash during abort where the
pipe was uses after been destroyed.
Message-Id: <YHLkPZ9+sdLhwcjZ@scylladb.com>
* seastar 1c1f610ceb...d2dcda96bb (3):
> closeable: add with_closeable and with_stoppable helpers
> circleci: relax concurrency of the build process
> logger: failed_to_log: print source location and format string
Procedure is rewritten using std::partition, making it easier to
maintain and it also fixes a theoretical quadratic behavior because
list is entirely copied when extending it, which isn't harmful
because maintenance set will be rarely populated and there are only
2 sets at most.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210409171412.57729-1-raphaelsc@scylladb.com>
CDC log uses `bytes` to deal with cells and their values, and linearizes all
values indiscriminately. This series makes a switch from `bytes` to
`managed_bytes` to avoid that linearization.
Fixes#7506.
Closes#8429
* github.com:scylladb/scylla:
cdc: log: change yet another occurence of `bytes` to `managed_bytes`
cdc: log: switch the remaining usages of `bytes` to `managed_bytes` in collection_visitor
cdc: log: change `deleted_elements` in log_mutation_builder from bytes to managed_bytes
cdc: log: rewrite collection merge to use managed_bytes instead of bytes
cdc: log: don't linearize collections in get_preimage_col_value
cdc: log: change return type of get_preimage_col_value to managed_bytes
cdc: log: remove an unnecessary copy in process_row_visitor::live_atomic_cell
cdc: log: switch cell_map from bytes to managed_bytes
cdc: log: change the argument of log_mutation_builder::set_value to managed_bytes_view
cdc: log: don't linearize the primary key in log_mutation_builder
atomic_cell: add yet another variant of make_live for managed_bytes_view
compound: add explode_fragmented
Right now, binary_operator::lhs is a variant<column_value,
std::vector<column_value>, token>. The role of the second branch
(a vector of column values) is to represent a tuple of columns
e.g. "WHERE (a, b, c) = ?"), but this is not clear from the type
name.
Inroduce a wrapper type around the vector, column_value_tuple, to
make it clear we're dealing with tuples of CQL references (a
column_value is really a column_ref, since it doesn't actually
contain any value).
Closes#8208
This adds a "help" build target, which prints out important build
targets. The printing is done in a separate shell script, becaue "ninja"
insists on print out the "command" before executing it, which makes the
help text unreadable.
The merger could return end-of-stream if some (but not all) of the
underlying readers were empty (i.e. not even returning a
`partition_start`). This could happen in places where it was used
(`time_series_sstable_set::create_single_key_sstable_reader`) if we
opened an sstable which did not have the queried partition but passed
all the filters (specifically, the bloom filter returned a false
positive for this sstable).
The commit also extends the random tests for the merger to include empty
readers and adds an explicit test case that catches this bug (in a
limited scope: when we merge a single empty reader).
It also modifies `test_twcs_single_key_reader_filtering` (regression
test for #8432) because the time where the clustering key filter is
invoked changes (some invocations move from the constructor of the
merger to operator()). I checked manually that it still catches the bug
when I reintroduce it.
Fixes#8445.
Closes#8446
This is a translation of Cassandra's CQL unit test source file
validation/entities/SecondaryIndexOnStaticColumnTest.java into our
our cql-pytest framework.
This test file checks various features of indexing (with secondary index)
static rows. All these tests pass on Cassandra, but fail on Scylla because
of issue #2963 - we do not yet support indexing of a static row.
The failing test currently fail as soon as they try to create the index,
with the message:
"Indexing static columns is not implemented yet."
Refs #2963.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210411153014.311090-1-nyh@scylladb.com>
This patch avoids an annoying warning
Warning: Unknown config ini key: flake8-ignore
when running one of the pytest-based test projects (cql-pytest,
alternator and redis) on recent versions of pytest.
In commit 2022da2405, we added to the
toplevel Scylla directory a "tox.ini" file with some intention to
configure Python syntax checking. One of the configurations in this
tox.ini is:
[pytest]
flake8-ignore =
E501
It turns out that pytest, if a certain test directory does not have its
own pytest.ini file, looks up in ancestor directory for various
configuration files (the configuration file precedence is described in
https://docs.pytest.org/en/stable/customize.html), and this includes
this tox.ini configuration section. Recent versions of pytest complain
about the "flake8-ignore" configuration parameter, which they don't
recognize. This parameter may be ok (?) if you install a flake8 pytest
plugin, but we do not require users to do this for running these tests.
Moreover, whatever noble intentions this commit and its tox.ini had,
nobody ever followed up on it. The three pytest-based test directories
never adhered to flake8's recommended syntax, and never intended to do
so. None of the developers of these tests use flake8, or seem to wish
to do so. If this ever changes, we can change the pytest.ini or undo this
commit and go back to a top-level tox.ini, but I don't see this happening
anytime soon.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210411085708.300851-1-nyh@scylladb.com>
install.sh supports two different ways of redirecting paths:
--root for creating a chroot-style tree, and --prefix for changing
the installed file location. Document them.
Closes#8389
std::is_fundamental isn't a good constraint since it include nullptr_t
and void. Replace with std::integral which is sufficient. Use a concept
instead of enable_if to simplify the code.
Closes#8450
Currently, scripts/refresh-submodules.sh always refreshes all
submodules, i.e., takes the latest version of all of all of them and
commits it. But sometimes, a committer only wants to refresh a specific
submodule, and doesn't want to deal with the implications of updating
a different one.
As a recent example, for issue #8230, I wanted to update the tools/java
submodule, which included a fix for sstableloader, without updating the
Seastar submodule - which contained completely irrelevant changes.
So in this patch we add the ability to override the default list of
submodules that refresh-submodules.sh uses, with one or more command
line parameters. For example:
scripts/refresh-submodules.sh tools/java
will update only tools/java.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210411151421.309483-1-nyh@scylladb.com>
This was triggered by the test_total_space_limit_of_commitlog dtest.
When it passes a very large commitlog_segment_size_in_mb (1/6th of the
free memory size, in mb), segment_manager constructor limits max_size
to std::numeric_limits<position_type>::max() which is 0xffffffff.
This causes allocate_segment_ex to loop forever when writing the segment
file since `dma_write` returns 0 when the count is unaligned (seen 4095).
The fix here is to select a sligtly small maxsize that is aligned
down to a multiple of 1MB.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210407121059.277912-1-bhalevy@scylladb.com>
this function will be used on repair-based operation completion,
to notify table about the need to start offstrategy compaction
process on the maintenance sstables produced by the operation.
Function which notifies about bootstrap and replace completion
is changed to use this new function.
Removenode and decommission will reuse this function.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
There are some users of that tri_comparator which are also
converted to strong_ordering. Most of the code using those
is, in turn, already handling return values interchangeably.
The bound_view::tri_compare, which's used by the guy, is
still returning int.
tests: unit(dev)
* xemul/br-position-tri-compare:
code: Relax position_in_partition::tri_compare users
position_in_partition: Convert tri_compare to strong_ordering
test: Convert clustering_fragment_summary::tri_cmp to strong_ordering
repair: Convert repair_sync_boundary::tri_compare to strong_ordering
view: Don't expect int from position_in_partition::tri_compare
There are some pieces left doing res <=> 0 with the
res now being a strong_ordering itself. All these can
be just dropped.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All its users are now ready to accept both - int and
the strong_ordering value, so the change is pretty
straightforward.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The set puts the partition_snapshot_row_cursor on a
diet: 320 -> 224 bytes, makes use of btree API sugar
to save some CPU cycles and compacts the code.
tests: unit(dev)
* xemul/br-row-cursor-cleanups-2:
partition_snapshot_row_cursor: Rewrite row() with consume_row()
partition_snapshot_row-cursor: Add const consume_row() version
partition_snapshot_row_cursor: Add concept to .consume_row()
partition_snapshot_row_cursor: Don't carry end iterators
partition_snapshot_row_cursor: Move cells hash creation to reader
partition_snapshot_row_cursor: Move read_partition into test
partition_snapshot_row_cursor: Move is_in_latest_version inline
partition_snapshot_row_cursor: Use is_in_latest_version where appropriate
partition_snapshot_row_cursor: Less dereferences in key() method
partition_snapshot_row_cursor: Update change mark in prepare_heap
partition_snapshot_row_cursor: Clear current row when recreating
partition_snapshot_row_cursor: Use btree::lower_bound sugar
partition_snapshot_row_cursor: Factor out next() and erase_and_advance()
partition_snapshot_row_cursor: Relax vector of iterators
btree: Add operator bool()
clustering_row: Add new .apply() overload
If somebody wants to query a generic mutation source in the future, they
can still do it via `mutation_querier::consume_page()` and the right
result builder.
Add a local `mutation_query()` variant, which only contains the pieces
of logic the test really wants to test: invoking
`mutation_querier::consume_page()` with a `reconcilable_result_builder`.
This allows us to get rid of the now otherwise unused
`mutation_query()`.
Instead of `mutation_query()` from `mutation_query.hh`. The latter is
about to be retired as we want to migrate all users to
`table::mutation_query()`.
As part of this change, move away from `mutation_query_stage` too. This
brings the code paths of the two query variants closer together, as they
both have an execution stage declared in `database`.
We want to migrate `database::mutation_query()` off `mutation_query()`
to use `table::mutation_query()` instead. The reason is the same as for
making `table::query()` standalone: the `mutation_query()`
implementation increasingly became specific to how tables are queried
and is about to became even more specific due to impending changes to
how permits are obtained. As no-one in the codebase is doing generic
mutation queries on generic mutation sources we can just make this a
member of table.
This patch just adds `table::mutation_query()`, no user exists yet.
`table::mutation_query()` is identical to `mutation_query()`, except
that it is a coroutine.
If somebody wants to query a generic mutation source in the future, they
can still do it via `data_querier::consume_page()` and the right result
builder.
The test only wants to test result size calculation so it doesn't need
the whole `data_query()` logic. Replace the call to `data_query()` with
one to a local alternative which contains just the necessary bits --
invoking `data_querier::consume_page()` with the right result builder.
This allows us get rid of the now otherwise unused `data_query()`.
`data_query()` is now just a thin wrapper over
`data_querier::consume_page()`. Furthermore, contrary to the old data
query method, it is not a generic way of querying a mutation source, it
is now closely tied to how we query tables. It does a querier lookup and
save. In the future we plan on tying it even closer to the table in how
permits are obtained. For this reason it is better to just inline it
into the `query()` method which invokes it.
It's the same as the existing one, but doesn't modify
anything (cursor and pointing rows_entry's) and calls
consumer with const row reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The btree's iterator can be checked to reach the tree's end
without holding the ending iterator itself. This makes the
whole p_s_r_c 20% smaller (288 bytes -> 224 bytes) since it
now keeps 4 extra iterators on-board -- inside small vectors
for heap and current_row.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now call to .row() method may create hash on row's cells.
It's counterintuitive to see a const method that transparently
changes something it points to. Since the only caller of a row()
who knows whether the hash creation is required is the cache
reader, it's better to move the call to prepare_hash() into it.
Other than making the .row() less surprising this also helps to
get rid of the whole method by the next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method in question is test-only helper, there's no
need in keeping it as a part of the API.
Another reason to move is that the method is O(number of
rows) and doesn't preempt while looping, but cursor code
users try hard not to stall the reactor. So even though
this method has a meaningful semantics within the class,
it will better be reinvented if needed in core code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method is currently defined outside of the class which
gives compiler less chances to really inline it when needed.
Also, keeping this simple piece of code inline is less code
to read (and compile).
Mark the guy noexcept while at it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
appropriate
Checking for _current_row[0].version being 0 (or not being 0)
is better understood if done with a well named existing helper.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The valid cursor's key is kept on the _position as well,
but getting it from there is 1 defererence less:
_current_row -(*)-> row -> key
_position -(**)-> std::optional -> key
* iterator's -> is pointer dereference
** std::optional is designed not to be a pointer
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The heap's iterators validity is checked with the change mark,
which is updated every time heap is recreated. Factor these
updates out and keep the mark together with the heap it protects.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The cursor keeps current row in a separate vector of iterators
and reconstructs it in a dedicated method, which _expects_ that
the vector is empty on entry.
It's better to keep the logic of current row construction in one
place.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When checking if the lower-bound entry matched the search
key it's possible to avoid extra comparison with the help
of the collection used to store the rows (btree).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Both helpers do the same -- advance the cursor to the next row.
The latter may additionally remove the row from the uniquely
owned version.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The cursor maintains a vector of iterators that correspond to
each of the versions scanned. However, only the iterator in
the latest one is really needed, so the whole vector can be
reduced down to an optional.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This method is very hard to read or modify in its current form due to
all the continuation-chain boilerplate. Make it a coroutine to
facilitate future changes in the next patches but not just.
The btree's iterators allow for simple checking for '== tree.end()'
condition. For this check neither the tree itself, nor the ending
iterator is required. One just need to check if the _idx value is
the npos.
One additional change to make it work is required -- when removing
an entry from the inline node the _idx should be set to npos.
This change is, well, a bugfix. An iterator left with 0 in _idx is
treated as a valid one. However, the bug is non-triggerable. If such
an "invalid" iterator is compared against tree.end() the check would
return true, because the tree pointers would conside.
So this patch adds an operator bool() to btree iterator to facilitate
simpler checking if it reached the end of the collection or not.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The clustering_row is a wrapper over the deletable_row and
facilitates the apply-creation of the latter from some other
objects. Soon it will accept the deletable_row itself for
apply()-ing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It is dangerous to print a formatted string as is, like
sslog.warn(err.c_str()) since it might hold curly braces ('{}')
and those require respective runtime args.
Instead, it should be logged as e.g. sslog.warn("{}", err.c_str()).
This will prevent issues like #8436.
Refs #8436
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210408173048.124417-2-bhalevy@scylladb.com>
The filter passed to `min_position_reader_queue`, which was used by
`clustering_order_reader_merger`, would incorrectly include sstables as
soon as they passed through the PK (bloom) filter, and would include
sstables which didn't pass the PK filter (if they passed the CK
filter). Fortunately this wouldn't cause incorrect data to be returned,
but it would cause sstables to be opened unnecessarily (these sstables
would immediately return eof), resulting in a performance drop. This commit
fixes the filter and adds a regression test which uses statistics to
check how many times the CK filter was invoked.
Fixes#8432.
Closes#8433
On current implementation, we may re-run ntp configuration even it
already configured.
Also, the system may configured with non-default ntp client, we just
ignoring that and configure with default ntp client.
This patch minimize unnecessary re-configuration of ntp client.
It run in following order:
1. Check NTP client is already running. If it running, skip setup
2. Check NTP client is alrady installed. If it installed, use it
3. If there is non of NTP client package installed,
- if it's CentOS, install chrony
- if it's on other distributions, install systemd-timesyncd
Closes#8431
* github.com:scylladb/scylla:
scylla_ntp_setup: detect already installed ntp client
scylla_util.py: return bool value on systemd_unit.is_active()
On current implementation, we may re-run ntp configuration even it
already configured.
Also, the system may configured with non-default ntp client, we just
ignoring that and configure with default ntp client.
This patch minimize unnecessary re-configuration of ntp client.
It run in following order:
1. Check NTP client is already running. If it running, skip setup
2. Check NTP client is alrady installed. If it installed, use it
3. If there is non of NTP client package installed,
- if it's CentOS, install chrony
- if it's on other distributions, install systemd-timesyncd
Related with #8344, #8339
The series contains code cleanups and fixes for stepdown process
and quorum check code.
Note this is re-send of already posted patches lumped together for
convenience.
* scylla-dev/raft-fixes-v1:
raft: add test for check quorum on a leader
raft: fix quorum check code for joint config and non-voting members
raft: do not hang on waiting for entries on a leader that was removed from a cluster
raft: add more tracing to stepdown code
raft: use existing election_elapsed() function instead of redo the calculation
raft: test: add test case for stepdown process
raft: check that a node is still the leader after initiating stepdown process
Currently, 'if unit.is_active():' is always True since is_active()
returns result in string (active, inactive, unknown).
To avoid such scripting bug, change return value in bool.
1) start n1, n2, n3
2) decommission n3
3) remove /var/lib/scylla for n3
4) start n4 with the same ip address as n3 to replace n3
5) replace will be successful
If a node has left the ring, we should reject the replace operation.
This patch makes the check during replace operation more strict and
rejects the replace if the node has left the ring.
After the patch, we will see
ERROR 2021-04-07 08:02:14,099 [shard 0] init - Startup failed:
std::runtime_error (Cannot replace_adddress 127.0.0.3 because it has left
the ring, status=LEFT)
Fixes#8419Closes#8420
Those functions were defined in a header, but not marked inline.
This made including the header from two source files impossible,
as the linker would complain about duplicate symbols.
Rather than making them inline, put them in a new source file
perf.cc as they don't need to be inline.
Fixes#8363Fixes#8376
Delete segements has two issues when running with size-limited
commit log and strict adherence to said limit.
1.) It uses parallel processing, with deferral. This means that
the disk usage variables it looks at might not be fully valid
- i.e. we might have already issued a file delete that will
reduce disk footprint such that a segment could instead be
recycled, but since vars are (and should) only updated
_post_ delete, we don't know.
2.) It does not take into account edge conditions, when we only
delete a single segment, and this segment is the border segment
- i.e. the one pushing us over the limit, yet allocation is
desperately waiting for recycling. In this case we should
allow it to live on, and assume that next delete will reduce
footprint. Note: to ensure exact size limit, make sure
total size is a multiple of segment size.
if we had an error in recycling (disk rename?), and no elements
are available, we could have waiters hoping they will get segements.
abort the queue (not permanent, but wakes up waiters), and let them
retry. Since we did deletions instead, disk footprint should allow
for new allocs at least. Or more likely, everything is broken, but
we will at least make more noise.
Closes#8372
* github.com:scylladb/scylla:
commitlog: Add signalling to recycle queue iff we fail to recycle
commitlog: Fix race and edge condition in delete_segments
commitlog: coroutinize delete_segments
commitlog_test: Add test for deadlock in recycle waiter
The "most important" major changes are:
1. storage_service: simplify CDC generation management during node replace
Previously, when node A replaced node B, it would obtain B's
generation timestamp from its application state (gossiped by other
nodes) and start gossiping it immediately on bootstrap.
But that's not necessary:
- if this is the timestamp of the last (current) generation, we would
obtain it from other nodes anyway (every node gossips the last known
timestamp),
- if this is the timestamp of an earlier generation, we would forget
it immediately and start gossiping the last timestamp (obtained from
other nodes).
This commit simplifies the bootstrap code (in node-replace case) a bit:
the replacing node no longer attempts to retrieve the CDC generation
timestamp from the node being replaced.
2. tree-wide: introduce cdc::generation_id type
Each CDC generation has a timestamp which denotes a logical point in time
when this generation starts operating. That same timestamp is
used to identify the CDC generation. We use this identification scheme
to exchange CDC generations around the cluster.
However, the fact that a generation's timestamp is used as an ID for
this generation is an implementation detail of the currently used method
of managing CDC generations.
Places in the code that deal with the timestamp, e.g. functions which
take it as an argument (such as handle_cdc_generation) are often
interested in the ID aspect, not the "when does the generation start
operating" aspect. They don't care that the ID is a `db_clock::time_point`.
They may sometimes want to retrieve the time point given the ID (such as
do_handle_cdc_generation when it calls `cdc::metadata::insert`),
but they don't care about the fact that the time point actually IS the ID.
In the future we may actually change the specific type of the ID if we
modify the generation management algorithms.
This commit is an intermediate step that will ease the transition in the
future. It introduces a new type, `cdc::generation_id`. Inside it contains
the timestamp, so:
- if a piece of code doesn't care about the timestamp, it just passes
the ID around
- if it does care, it can access it using the `get_ts` function.
The fact that `get_ts` simply accesses the ID's only field is an
implementation detail.
3. cdc: handle missing generation case in check_and_repair_cdc_streams
check_and_repair_cdc_streams assumed that there is always at least
one generation being gossiped by at least one of the nodes. Otherwise it
would enter undefined behavior.
I'm not aware of any "real" scenario where this assumption wouldn't be
satisfied at the moment where check_and_repair_cdc_streams makes it
except perhaps some theoretical races. But it's best to stay on the safe
side.
---
Additionally the PR does some simplifications, stylistic improvements,
removes some dead code, coroutinizes some functions, uncoroutinizes others
(due to miscompiles), adds additional logging, updates some stale comments.
Read commit messages for more details.
Closes#8283
* github.com:scylladb/scylla:
cdc: log a message when creating a new CDC generation
cdc: handle missing generation case in check_and_repair_cdc_streams
tree-wide: introduce cdc::generation_id type
tree-wide: rename "cdc streams timestamp" to "cdc generation id"
cdc: remove some functions from generation.hh
storage_service: make set_gossip_tokens a static free-function
db: system_keyspace: group cdc functions in single place
cdc: get rid of "get_local_streams_timestamp"
sys_dist_ks: update comment at quorum_if_many
storage_service: simplify CDC generation management during node replace
check_and_repair_cdc_streams assumed that there is always at least
one generation being gossiped by at least one of the nodes. Otherwise it
would enter undefined behavior.
I'm not aware of any "real" scenario where this assumption wouldn't be
satisfied at the moment where check_and_repair_cdc_streams makes it
except perhaps some theoretical races. But it's best to stay on the safe
side.
This is a follow-up to the previous commit.
Each CDC generation has a timestamp which denotes a logical point in time
when this generation starts operating. That same timestamp is
used to identify the CDC generation. We use this identification scheme
to exchange CDC generations around the cluster.
However, the fact that a generation's timestamp is used as an ID for
this generation is an implementation detail of the currently used method
of managing CDC generations.
Places in the code that deal with the timestamp, e.g. functions which
take it as an argument (such as handle_cdc_generation) are often
interested in the ID aspect, not the "when does the generation start
operating" aspect. They don't care that the ID is a `db_clock::time_point`.
They may sometimes want to retrieve the time point given the ID (such as
do_handle_cdc_generation when it calls `cdc::metadata::insert`),
but they don't care about the fact that the time point actually IS the ID.
In the future we may actually change the specific type of the ID if we
modify the generation management algorithms.
This commit is an intermediate step that will ease the transition in the
future. It introduces a new type, `cdc::generation_id`. Inside it contains
the timestamp, so:
1. if a piece of code doesn't care about the timestamp, it just passes
the ID around
2. if it does care, it can simply access it using the `get_ts` function.
The fact that `get_ts` simply accesses the ID's only field is an
implementation detail.
Using the occasion, we change the `do_handle_cdc_generation_intercept...`
function to be a standard function, not a coroutine. It turns out that -
depending on the shape of the passed-in argument - the function would
sometimes miscompile (the compiled code would not copy the argument to the
coroutine frame).
compound set isn't overriding create_single_key_sstable_reader(), so
default implementation is always called. Although default impl will
provide correct behavior, specialized ones which provides better perf,
which currently is only available for TWCS, were being ignored.
compound set impl of single key reader will basically combine single key
readers of all sets managed by it.
Fixes#8415.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210406205009.75020-1-raphaelsc@scylladb.com>
Currently the heuristics for picking an index for a query
are not very well defined. It would be best if we used
statistics to pick the index which is likely to perform
the fastest, but for starters we should at least let the user
decide which index to pick by picking the first one by the
order of restrictions passed to the query.
The (failing) test case from this patch shows the expected
results.
Ref: #7969Closes#8414
* github.com:scylladb/scylla:
cql-pytest: add a failing test for index picking order
cql3: add tracing used secondary index
Currently the heuristics for picking an index for a query
are not very well defined. It would be best if we used
statistics to pick the index which is likely to perform
the fastest, but for starters we should at least let the user
decide which index to pick by picking the first one by the
order of restrictions passed to the query.
The (failing) test case from this patch shows the expected
results.
Ref: #7969
Current leader code check for most nodes to be alive, but this is
incorrect since some nodes may be non-voting and hence should not
cause a leader to stepdown if dead. It also incorrect with joint config
since quorum is calculated differently there. Fix it by introducing
activity_tracker class that knows how to handle all the above details.
If a leader is removed from a cluster it will never know when entries
that it did not committed yet will be committed, so abort the wait in
this case with uncertainty error.
Add the test for the case where C_new entry is not the last one in a
leader that is been removed from a cluster. In this case a leader will
continue replication even after committing C_new and will start stepdown
process later, when at least one follower is fully synchronized.
Usually initiation of stepdown process does not immediately depose the
current leader, but if the current leader is no longer part of the
cluster it will happen. We were missing the check after initiating
stepdown process in append reply handling.
We inherited very low threshold for warning and failing multi-partition
batches, but these warnings aren't useful. The size of a batch in bytes
as no impact on node stability. In fact the warnings can cause more
problems if they flood the log.
Fix by raising the warning threshold to 128 kiB (our magic size)
and the fail threshold to 1 MiB.
Fixes#8416.
Closes#8417
Fixes#8376
If a recycle should fail, we will sort of handle it by deleting
the segment, so no leaks. But if we have waiter(s) on the recycle
queue, we could end up deadlocked/starved because nothing is
incoming there.
This adds an abort of the queue iff we failed and no objects are
available. This will wake up any waiter, and he should retry,
and hopefully at least be able to create a new segment.
We then reset the queue to a new one. So we can go on.
v2:
* Forgot to reset queue
v3:
* Nicer exception handling in allocate_segment_ex
Fixes#8363
Delete segements has two issues when running with size-limited
commit log and strict adherence to said limit.
1.) It uses parallel processing, with deferral. This means that
the disk usage variables it looks at might not be fully valid
- i.e. we might have already issued a file delete that will
reduce disk footprint such that a segment could instead be
recycled, but since vars are (and should) only updated
_post_ delete, we don't know.
2.) It does not take into account edge conditions, when we only
delete a single segment, and this segment is the border segment
- i.e. the one pushing us over the limit, yet allocation is
desperately waiting for recycling. In this case we should
allow it to live on, and assume that next delete will reduce
footprint. Note: to ensure exact size limit, make sure
total size is a multiple of segment size.
Fixed by
a.) Doing delete serialized. It is not like being parallel here will
win us speed awards. And now we can know exact footprint, and
how many segments we have left to delete
b.) Check if we are a block across the footprint boundry, and people
might be waiting for a segment. If so, don't delete segment, but
recycle.
As a follow-up, we should probably instead adjust the commitlog size
limit (per shard) to be a multiple of segment sizes, but there is
risks in that too.
The indexed queries will now record which index was chosen
for fetching the base table keys.
Example output:
activity
------------------------------------------------------------------------------------------------------------------------
Parsing a statement
Processing a statement
Consulting index my_v2_idx for a single slice of keys
Creating read executor for token -3248873570005575792 with all: {127.0.0.1} targets: {127.0.0.1} repair decision: NONE
read_data: querying locally
Start querying singular range {{-3248873570005575792, pk{000400000002}}}
Querying cache for range {{-3248873570005575792, pk{000400000002}}} and slice {(-inf, +inf)}
Querying is done
Done processing - preparing a result
* scylla-dev/raft-cleanup-v2:
raft: test: add test that leader behaves as expected when it gets unexpended messages
raft: do not assert when receiving unexpected messages in a leader state
raft: use existing function to check if election timeout elapsed
A follow up for the patch for #7611. This change was requested
during review and moved out of #7611 to reduce its scope.
The patch switches UUID_gen API from using plain integers to
hold time units to units from std::chrono.
For one, we plan to switch the entire code base to std::chrono units,
to ensure type safety. Secondly, using std::chrono units allows to
increase code reuse with template metaprogramming and remove a few
of UUID_gen functions that beceme redundant as a result.
* switch get_time_UUID(), unix_timestamp(), get_time_UUID_raw(), switch
min_time_UUID(), max_time_UUID(), create_time_safe() to
std::chrono
* remove unused variant of from_unix_timestamp()
* remove unused get_time_UUID_bytes(), create_time_unsafe(),
redundant get_adjusted_timestamp()
* inline get_raw_UUID_bytes()
* collapse to similar implementations of get_time_UUID()
* switch internal constants to std::chrono
* remove unnecessary unique_ptr from UUID_gen::_instance
Message-Id: <20210406130152.3237914-2-kostja@scylladb.com>
By far the two slowest Alternator tests when running a development build on
my laptop are
test_gsi.py::test_gsi_projection_include
and
test_gsi.py::test_gsi_projection_keys_only
Each of those takes around 3.2, and the sum of just these two tests is as
much as 10% (!) of all other 600 tests.
The reason why these tests are slow is that they check scanning a GSI
with *projection*. Scylla currently ignores the projection, so the scan
returns the wrong value. Because this is a GSI, which supports only
eventually- consistent reads, we need to retry the read - and did it for
up to 3 seconds!
But this retry only makes sense if the GSI read did not *yet* return
the expected data. But in these xfailing test, we read a *wrong* item
(with too many attributes) almost immediately, and this should indicate
an immediate failure - no amount of retry would help. So in this patch
we detect this case and fail the test immediately instead of wasting
3 seconds in retries.
On my laptop with dev build, this patch reduces the time to run the
entire Alternator test suite from 70 seconds to 63 seconds.
Also, now that we never just waste time until the timeout, we can
increase it to any number, and in this patch we increase it from 3
seconds to 5.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210317183918.1775383-1-nyh@scylladb.com>
In conftest.py we have several fixtures creating shared tables which many
test files can share, so they are marked with the "session" scope - all
the tests in the testing session may share the same instance. This is fine.
Some of test files have additional fixtures for creating special tables
needed only in those files. Those were also, unnecessarily, marked
"session" scope as well. This means that these temporary tables are
only deleted at the very end of test suite, event though they can be
deleted at the end of the test file which needed them. This is exactly
what the "module" fixture scope is, so this patch changes all the
fixtures private to one test file to be "module".
After this patch, the teardown of the last test in the suite goes down
from 4 seconds to just 1.5 seconds (it's still long because there are
still plenty of session-scoped fixtures in conftest.py).
Another small benefit is that the peak disk usage of the test suite is
lower, because some of the temporary tables are deleted sooner.
This patch does not change any test functionality, and also does not
make any test faster - it just changes the order of the fixture
teardowns.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210317175036.1773774-1-nyh@scylladb.com>
Each CDC generation always has a timestamp, but the fact that the
timestamp identifies the generation is an implementation detail.
We abstract away from this detail by using a more generic naming scheme:
a generation "identifier" (whatever that is - a timestamp or something
else).
It's possible that a CDC generation will be identified by more than a
timestamp in the (near) future.
The actual string gossiped by nodes in their application state is left
as "CDC_STREAMS_TIMESTAMP" for backward compatibility.
Some stale comments have been updated.
This function retrieves the persisted timestamp of the last known CDC
generation (which this node is currently gossiping to other nodes).
It checks that the timestamp is present; if not, it throws an error.
The check is unnecessary. It's used only in a quite esoteric place
(start_gossiping, which implements an almost-never-used API call),
and it's fine if the timestamp is gone - in start_gossiping,
we can start gossiping the tokens without the CDC generation timestamp
(well, if the timestamp is not present in system tables, something
weird must have happened, but that doesn't mean we can't resume
gossiping - fixing CDC generation management in such a case is
a separate problem).
The comment mentioned tables that no longer exist: their names have
changed some time ago. Update the comment to be name-agnostic.
Furthemore, the second part of the comment related to a case of "joining
a node without bootstrapping". Fortunately this operation is no longer
possible (after #6848 which became part of Scylla 4.3) so we can shorten
the comment.
Previously, when node A replaced node B, it would obtain B's
generation timestamp from its application state (gossiped by other
nodes) and start gossiping it immediately on bootstrap.
But that's not necessary:
1. if this is the timestamp of the last (current) generation, we would
obtain it from other nodes anyway (every node gossips the last known
timestamp),
2. if this is the timestamp of an earlier generation, we would forget
it immediately and start gossiping the last timestamp (obtained from
other nodes).
This commit simplifies the bootstrap code (in node-replace case) a bit:
the replacing node no longer attempts to retrieve the CDC generation
timestamp from the node being replaced.
We have scripting bug, when /var/log/journal exists, install.sh does not generate scylla_sysconfdir.py.
Stop generating scylla_sysconfdir.py in if else condition, do that
unconditionally in install.sh, also drop pre-generated
scylla_sysconfdir.py from dist/common/scripts.
Also, $rsysconfdir is correct path to point nonroot mode sysconfdir,
instead of $sysconfdir.
Fixes#8385Closes#8386
- it does not support using interface names
- listen_interface is not supported
- 0.0.0.0 will work (and is reasonable) if you set broadcast_address
- empty setting is not supported
Fixes#8381.
Closes#8409
If a table that is not replicated to a certain DC (rf=0) is accessed
with LOCAL_QUORUM on that DC the current code will crash since the
'targets' array will be empty and read executor does not handle it.
Fix it by replying with empty result.
Fixes#8354
Message-Id: <YGro+l2En3fF80CO@scylladb.com>
caching_options is by no means performance sensitive, but it is
included in many places (via schema.hh), and it turn it pulls in
other includes. Reduce include load by moving deinlining it.
Ref #1.
Closes#8408
I wanted to move caching_option.hh's content to .cc (so that it doesn't
pull in rjson.hh everywhere), but for that, we must first make from_map()
a non-template function.
Luckily, it is only called with one parameter type, so just substitute
that type for the template parameter.
Closes#8406
has_empty is a textbook example of a concept: it checks whether a type
has an empty() method that returns bool. It is now implemented with
enable_if, simplify it to a concept.
I verified that the debug build doesn't contain any incorrect
emtpyable<T> (e.g. for strings).
Closes#8404
The copying and comparing utilities for FragmentedView are not prepared to
deal with empty fragments in non-empty views, and will fall into an infinite
loop in such case.
But data coming in result_row_view can contain such fragments, so we need to
fix that.
Fixes#8398.
Closes#8397
This just causes unneeded and slower recompliations. Instead replace
with forward declarations, or includes of smaller headers that were
incidentally brought in by the one removed. The .cc files that really
need it gain the include, but they are few.
Ref #1.
Closes#8403
These functions were not used anywhere but had to be maintained anyway.
When (if) the expiration algorithm actually gets implemented (see issue #7300),
the functions can be added back (perhaps they will need to look differently
at that time, and it's likely that the `expire` column won't be used in the
expiration algorithm in the end anyway).
In commit c82250e0cf (gossip: Allow deferring
advertise of local node to be up), the replacing node is changed to postpone
the responding of gossip echo message to avoid other nodes sending read
requests to the replacing node. It works as following:
1) replacing node does not respond echo message to avoid other nodes to
mark replacing node as alive
2) replacing node advertises hibernate state so other nodes knows
replacing node is replacing
3) replacing node responds echo message so other nodes can mark
replacing node as alive
This is problematic because after step 2, the existing nodes in the
cluster will start to send writes to the replacing node, but at this
time it is possible that existing nodes haven't marked the replacing
node as alive, thus failing the write request unnecessarily.
For instance, we saw the following errors in issue #8013 (Cassandra
stress fails to achieve consistency when only one of the nodes is down)
```
scylla:
[shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2
required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1},
pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send
EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond
gossip echo message)
c-s:
java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]:
Error executing: (UnavailableException): Not enough replicas available
for query at consistency QUORUM (2 required but only 1 alive
```
To solve this problem, we can do the replacing operation in multiple stages.
One solution is to introduce a new gossip status state as proposed
here: gossip: Introduce STATUS_PREPARE_REPLACE #7416
1) replacing node does not respond echo message
2) replacing node advertises prepare_replace state (Remove replacing
node from natural endpoint, but do not put in pending list yet)
3) replacing node responds echo message
4) replacing node advertises hibernate state (Put replacing node in
pending list)
Since we now have the node ops verb introduced in
829b4c1438 (repair: Make removenode safe
by default), we can do the multiple stage without introducing a new
gossip status state.
This patch uses the NODE_OPS_CMD infrastructure to implement replace
operation.
Improvements:
1) It solves the race between marking replacing node alive and sending
writes to replacing node
2) The cluster reverts to a state before the replace operation
automatically in case of error. As a result, it solves when the
replacing node fails in the middle of the operation, the repacing
node will be in HIBERNATE status forever issue.
3) The gossip status of the node to be replaced is not changed until the
replace operation is successful. HIBERNATE gossip status is not used
anymore.
4) Users can now pass a list of dead nodes to ignore explicitly.
Fixes#8013Closes#8330
* github.com:scylladb/scylla:
repair: Switch to use NODE_OPS_CMD for replace operation
gossip: Add advertise_to_nodes
gossip: Add helper to wait for a node to be up
gossip: Add is_normal_ring_member helper
Current code assert when it gets InstallSnapshot/AppendRequest in a leader
state and the term in the message is equal current term. It is true that
such messages cannot be received if the protocol works correctly, but we
should not crash on a network input nonetheless.
In `storage_service::removenode`, in "Step 5", services which implement
`endpoint_lifecycle_subscriber` are first notified about the node
leaving the cluster, and only after that the gossiper state is updated
(comments added by me):
// This function indirectly notifies subscribers
ss.excise(std::move(tmp), endpoint);
// This function updates the gossiper state
ss._gossiper.advertise_token_removed(endpoint, host_id).get();
This order is confusing for those subscribers which expect the fact
that the node is leaving to be reflected in the gossiper state - more
specifically, for hints manager.
The hints manager has a function `can_send()` which determines if it is
OK for it to try send hints. More specifically, it looks at the
gossiper state to see if the destination node is ALIVE or if it has
left the ring. The first case is obvious as the destination node will
be able to receive the hints as writes, while the other means that the
hints will be sent with CL=ALL to its new replicas.
When a node leaves the cluster, all hint queues either to or from that
node enter the "drain" mode - the queue will attempt to send out all
hints and will drop those hints which failed to be sent. This mode is
triggered by a notification from the storage_service (hints manager is
a lifecycle subscriber).
The core drain logic for a queue looks as follows:
manager_logger.trace("Draining for {}: start", end_point_key());
set_draining();
send_hints_maybe();
_ep_manager.flush_current_hints().handle_exception([] (auto e) {
manager_logger.error("Failed to flush pending hints: {}. Ignoring...", e);
}).get();
send_hints_maybe();
manager_logger.trace("Draining for {}: end", end_point_key());
And `send_hints_maybe` contains the following loop:
while (replay_allowed() && have_segments() && can_send()) {
if (!send_one_file(*_segments_to_replay.begin())) {
break;
}
_segments_to_replay.pop_front();
++replayed_segments_count;
}
Coming back to the `storage_service::removenode` - because of the order
of `excise` and `advertise_token_removed`, draining starts before the
node which is being removed is removed from gossiper state. In turn,
it might happen that the drain logic calls `send_hints_maybe` twice and
does not send any hints - the loop in that function will immediately
stop because `can_send()` is false because the gossiper state still
reports that the target node is not alive. The logic expects `can_send`
to be true here because the node has left the ring.
This patch changes the order of `excise` and `advertise_token_removed`
in `storage_service::removenode` - now, the first one is called after
the other. This ensures that the gossiper state is updated before
listeners are called, and the race descrbed in the commit message does
not happen anymore - `can_send` is true when the node is being drained.
The race described here was exposed by the following commit:
77a0f1a153Fixes: #5087
Tests:
- unit(dev)
- dtest(hintedhandoff_additional_test.py)
- dtest(topology_test.py)
Closes#8284
When we want to parse a linearized buffer of bytes, we're copying them
into the first and only element of the _read_bytes vector. Thus
_read_bytes often contains only one element, which makes a small_vector
a better alternative.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
This PR changes the `can_send` function so that it looks at the `token_metadata` in order to tell if the destination node is in the ring. Previously, gossiper state was used for that purpose and required a relatively complicated condition to check. The new logic just uses `token_metadata::is_member` which reduces complexity of the `can_send` function.
Additionally, `storage_service` is slightly modified so that during a removenode operation the `token_metadata` is first updated and only then endpoint lifecycle subscribers are notified. This was done in order to prevent a race just like the one which happened in #5087 - hints manager is a lifecycle subscriber and starts a draining operation when a node is removed, and in order for draining to work correctly, `can_send` should keep returning true for that node.
Tests:
- unit(dev)
- dtest(hintedhandoff_additional_test.py)
- dtest(topology_test.py)
Closes#8387
* github.com:scylladb/scylla:
hints: clarify docstring comment for can_send
hints: use token_metadata to tell if node is in the ring
hints: slightly reogranize "if" statement in can_send
storage_service: release token_metadata lock before notify_left
storage_service: notify_left after token_metadata is replicated
The entire sstable cell value is currently stored in a single
temporary_buffer. Cells may be very large, so to avoid large
contiguous allocations, the buffer is changed to
a fragmented_temporary_buffer.
Fixes#7457Fixes#6376
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
As a part of the effort of removing big, contiguous buffers from the codebase,
cql3::raw_value should be made fragmented. Unfortunately a straightforward
rewrite to a fragmented buffer type is not possible, because we want
cql3::raw_value to be compatible with cql3::raw_value_view, and we want that
view to be based on fragmented_temporary_buffer::view, so that it can be
used to view data coming directly from seastar without copying.
This patch makes cql3::raw_value fragmented by making cql3::raw_value_view
a `variant` of managed_bytes_view and fragmented_temporary_buffer::view.
Code users which depended on `cql3::raw_value` being `bytes`,
and cql::raw_value_view being `fragmented_temporary_buffer::view` underneath
were adjusted to the new, dual representation, mainly through the
`cql3::raw_value_view::with_value` visitor and deserialization/validation
helpers added to `cql3::raw_value_view`.
The second part of this series gets rid of linearizations occuring when processing
compound types in the CQL layer. This is achieved by storing their elements in
`managed_bytes` instead of `bytes` in the partially deserialized form (`lists::value`
`tuples::value`, etc.) outputting `managed_bytes` instead of `bytes` in functions
which go from the partially deserialized form to the atomic cell format (for frozen
types), and avoiding calling deserialize/serialize on individual elements when
it's not necessary. (It's only necessary for CQLv2, because since CQLv3 the format
on the wire is the same as our internal one).
The above also forces some changes to `expression.cc`, and `restrictions`, mainly because
`IN` clauses store their arguments as `lists` and `tuples`, and the code which handled
this clause expected `bytes`.
After this series, the path from prepared CQL statements to `atomic_cell_or_collection`
is almost completely linearization-free. The last remaining place is `collection_mutation_description`,
where map keys are linearized to `bytes`.
Closes#8160
* github.com:scylladb/scylla:
cql3: update_parameters: remove unused version of make_cell for bytes_view
types: collection: remove an unused version of pack_fragmented
cql3: optimize the deserialization of collections
cql3: maps, sets: switch the element type from bytes to managed_bytes
cql3: expression: use managed_bytes instead of bytes where possible
cql3: expr: expression: make the argument of to_range a forwarding reference
cql3: don't linearize elements of lists, tuples, and user types
cql3: values: add const managed_bytes& constructor to raw_value_view
cql3: output managed_bytes instead of bytes in get_with_protocol_version
types: collection: add versions of pack for fragmented buffers
types: add write_collection_{value,size} for managed_bytes_mutable_view
cql3: tuples, user_types: avoid linearization in from_serialized() and get()
types: tuple: add build_value_fragmented
cql3: update_parameters: add make_cell version for managed_bytes_view
cql3: remove operation::make_*cell
cql3: values: make raw_value fragmented
cql3: values: remove raw_value_view::operator==
cql3: switch users of cql3::raw_value_view to internals-independent API
cql3: values: add an internals-independent API to raw_value_view
utils: managed_bytes: add a managed_bytes constructor from FragmentedView
utils: managed_bytes: add operator<< and to_hex for managed_bytes
utils: fragment_range: add to_hex
configure: remove unused link dependencies from UUID_test
"
This small patchset restructures the do_wait_admission() to facilitate
future changes to the wait/admission conditions. The changes we want to
facilitate are Benny's flat mutation reader close series and my stalled
readers series. As an added benefit the code is more readable and a
small theoretical corner-case bug is fixed. No logical changes (besides
the small bug-fix).
Tests: unit(dev)
"
* 'reader-concurrency-semaphore-refactor/v1' of https://github.com/denesb/scylla:
reader_concurrency_semaphore: remove now unused may_proceed()
reader_concurrency_semaphore: restructure do_wait_admission()
reader_concurrency_semaphore: extract enqueueing logic into enqueue_waiter()
reader_concurrency_semaphore: make admission conditions consistent
It turned out that all the users of btree can already be converted
to use safer std::strong_ordering. The only meaningful change here
is the btree code itself -- no more ints there.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210330153648.27049-1-xemul@scylladb.com>
Before this patch, deserializing a collection from a (prepared) CQL request
involved deserializing every element and serializing it again. Originally this
was a hacky method of validation, and it was also needed to reserialize nested
frozen collections from the CQLv2 format to the CQLv3 format.
But since then we started doing validation separately (before calls to
from_serialized) and CQLv2 became irrelevant, making reserialization of
elements (which, among other things, involves a memory alocation for every
element) pure waste.
This patch adds a faster path for collections in the v3 format, which does not
involve linearizing or reserializing the elements (since v3 is the same as
our internal format).
After this patch, the path from prepared CQL statements to
atomic_cell_or_collection is almost completely linearization-free. The last
remaining place is collection_mutation_description, where map keys are
linearized.
Make to_range able to handle rvalues. We will pass managed_bytes&& to it
in the next patch to avoid pointless copying.
The public declaration of to_range is changed to a concrete function to avoid
having to explicitly instantiate to_range for all possible reference types of
clustering_key_prefix.
This patch switches the type used to store collection elements inside the
intermediate form used in lists::value, tuples::value etc. from bytes
to managed_bytes. After this patch, tuple and list elements are only linearized
in from_serialized, which will be corrected soon.
This commit introduces some additional copies in expression.cc, which
will be dealt with in a future commit.
We will need them to port the representation of collection types
in cql3/ from bytes to managed_bytes.
The version which takes an iterator of `bytes` as an argument will
be removed after that transition is complete.
We will use them to avoid linearization when going from the intermediate
std::vector<bytes> form in cql3/ to the atomic_cell format, by outputting
managed_bytes instead of bytes in get_with_protocol_version.
We will use it to port the representation of collections in cql3/
from bytes to managed_bytes.
The duplicate version for bytes_view will be removed after that transition
is complete.
The operation::make_*cell functions are useless aliases to methods of
update_parameters, and are used interchangeably with them throughout the code.
Remove them.
Also, remove the now-unused update_parameters::make_cell version for
fragmented_temporary_buffer::view.
As a part of the effort of removing big, contiguous buffers from the codebase,
cql3::raw_value should be made fragmented. Unfortunately the change involves
some nontrivial work, because raw_value must be viewable with raw_value_view,
and raw_value_view must accomodate both raw_value (that's where we store
values in prepared queries) and fragmented_temporary_buffer::view
(because that's the type of values coming from the wire).
This patch makes raw_value fragmented, by changing the backing type from
bytes to managed_bytes. raw_value_view is modified accordingly by changing
the backing type from fragmented_temporary_buffer::view to a variant of
fragmented_temporary_buffer::view and managed_bytes_view.
We have prepared the users of raw_value{_view} for this change in preceding
commits.
It's only used in a single test, and there is no reason why it should ever
be used anywhere else. So let's remove it from the public header and move
it to that test.
We want to change the internals of cql3::raw_value{_view}.
However, users of cql3::raw_value and cql3::raw_value_view often
use them by extracting the internal representation, which will be different
after the planned change.
This commit prepares us for the change by making all accesses to the value
inside cql3::raw_value(_view) be done through helper methods which don't expose
the internal representation publicly.
After this commit we are free to change the internal representation of
raw_value_{view} without messing up their users.
Currently, raw_value_view is backed by a fragmented_temporary_buffer::view,
and many users of this type use it by extracting that internal representation.
However, we want to change raw_value_view so that it can be created both
from fragmented_temporary_buffer and from managed_bytes, so that we can switch
the internals of raw_value from bytes to managed_bytes. To do that we need
to prepare all users for that more general representation.
This commit adds an API which allow using raw_value_view without accessing its
internal representation. In the next commits of this series we will switch all
callers who currently depend on that representation to the new API,
and then we will remove the old accessors and change the internals.
Just for convenience. We will use it in an upcoming patch where we switch
the inner representation of cql3::raw_value from bytes to managed_bytes, and we
will want to construct managed_bytes from fragmented_temporary_buffer::view.
We will need them to replace bytes with managed_bytes in some places in an
upcoming patch.
The change to configure.py is necessary because opearator<< links to to_hex
in bytes.cc.
Now, the docstring comment next to can_send better represents the
condition that is checked inside that function. The statement about
returning true when destination left the NORMAL state is replaced with a
statement about returning true when the destination has left the ring.
Now, instead of looking at the gossiper state to check if the
destination node is still in the ring, we are using token_metadata as a
source of truth. This results in much simpler code in can_send() as
token_metadata has an is_member method which does exactly what we want.
In commit c82250e0cf (gossip: Allow deferring
advertise of local node to be up), the replacing node is changed to postpone
the responding of gossip echo message to avoid other nodes sending read
requests to the replacing node. It works as following:
1) replacing node does not respond echo message to avoid other nodes to
mark replacing node as alive
2) replacing node advertises hibernate state so other nodes knows
replacing node is replacing
3) replacing node responds echo message so other nodes can mark
replacing node as alive
This is problematic because after step 2, the existing nodes in the
cluster will start to send writes to the replacing node, but at this
time it is possible that existing nodes haven't marked the replacing
node as alive, thus failing the write request unnecessarily.
For instance, we saw the following errors in issue #8013 (Cassandra
stress fails to achieve consistency when only one of the nodes is down)
```
scylla:
[shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2
required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1},
pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send
EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond
gossip echo message)
c-s:
java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]:
Error executing: (UnavailableException): Not enough replicas available
for query at consistency QUORUM (2 required but only 1 alive
```
To solve this problem, we can do the replacing operation in multiple stages.
One solution is to introduce a new gossip status state as proposed
here: gossip: Introduce STATUS_PREPARE_REPLACE #7416
1) replacing node does not respond echo message
2) replacing node advertises prepare_replace state (Remove replacing
node from natural endpoint, but do not put in pending list yet)
3) replacing node responds echo message
4) replacing node advertises hibernate state (Put replacing node in
pending list)
Since we now have the node ops verb introduced in
829b4c1438 (repair: Make removenode safe
by default), we can do the multiple stage without introducing a new
gossip status state.
This patch uses the NODE_OPS_CMD infrastructure to implement replace
operation.
Improvements:
1) It solves the race between marking replacing node alive and sending
writes to replacing node
2) The cluster reverts to a state before the replace operation
automatically in case of error. As a result, it solves when the
replacing node fails in the middle of the operation, the repacing
node will be in HIBERNATE status forever issue.
3) The gossip status of the node to be replaced is not changed until the
replace operation is successful. HIBERNATE gossip status is not used
anymore.
4) Users can now pass a list of dead nodes to ignore explicitly.
Refs #8013
gossiper::advertise_to_nodes() is added to allow respond to gossip echo
message with specified nodes and the current gossip generation number
for the nodes.
This is helpful to avoid the restarted node to be marked as alive during
a pending replace operation.
After this patch, when a node sends a echo message, the gossip
generation number is sent in the echo message. Since the generation
number changes after a restart, the receiver of the echo message can
compare the generation number to tell if the node has restarted.
Refs #8013
Check if a node is in NORMAL or SHUTDOWN status which means the node is part of
the token ring from the gossip point of view and operates in normal status or
was in normal status but is shutdown.
Refs #8013
Previously, at the end of the removenode operation, endpoint lifecycle
subscribers are informed about the node being removed (the
"on_leave_cluster" method) before the token_metadata is updated to
reflect the fact that a node was removed.
Although no subscriber currently depends on token_metadata being
up-to-date when "on_leave_cluster" is called, the hints manager will
become sensitive to this in a later commit in this series.
This commit gets rid of the future problem by notifying subscribers
later, only after token_metadata is fully updated and replicated to
all shards.
Currently, the primitive_consumer parses all values in contiguous buffers.
A string of bytes may be very long, so parsing it in a single buffer
can cause a big allocation. This patch allows parsing into
fragmented_temporary_buffers instead of temporary_buffers.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
We are going to store sstable cells' values in fragmented_temporary_buffers.
This patch will allow checking these values with loggers.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
compound_type::serialize_value is currently implemented for arguments
of type 'bytes_view', 'managed_bytes', or 'managed_bytes_view'.
We will want to use it for a fragmented_temporary_buffer::view, so we
extend it for all FragmentedView types.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Tracing is created in two steps and is destroyed in two too.
The 2nd step doesn't have the corresponding stop part, so here
it is -- defer tracing stop after it was started.
But need to keep in mind, that tracing is also shut down on
drain, so the stopping should handle this.
Fixes#8382
tests: unit(dev), manual(start-stop, aborted-start)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210331092221.1602-1-xemul@scylladb.com>
Don't allow users to disable MC sstables format any more.
We would like to retire some old cluster features that has been around
for years. Namely MC_SSTABLE and UNBOUNDED_RANGE_TOMBSTONES. To do this
we first have to make sure that all existing clusters have them enabled.
It is impossible to know that unless we stop supporting
enable_sstables_mc_format flag.
Test: unit(dev)
Refs #8352
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Closes#8360
* seastar 48376c76a...72e3baed9 (3):
> file: Add RFW_NOWAIT detection case for AuFS
> sharded: provide type info on no sharded instance exception
> iotune: Estimate accuarcy of measurement
Added missing include "database.hh" to api/lsa.cc since seastar::sharded<>
now needs full type information.
This patch is another series on removing big allocations from scylla.
The buffers in `compare_visitor` were replaced with `managed_bytes_view`, similiar change was also needed in tuple_deserializing_iterator and listlike_partial_deserializing_iterator, and was applied as well.
Tests:unit(dev)
Closes#8357
* github.com:scylladb/scylla:
types: remove linearization from abstract_type::compare
types: replace buffers in tuple_deserializing_iterator with fragmented ones
types: make tuple_type_impl::split work with any FragmentedViews
types: move read_collection_size/value specialization to header file
To avoid high latencies caused by large contigous allocations
needed by linearizing, work on fragmented buffers instead.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
In preparation for removing linearization from abstract_type::compare,
add options to avoid linearization in tuple_deserializing_iterator.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
We may want to store a tuple in a fragmented buffer. To split it
into a vector of optional bytes, tuple_type_impl::split can be used.
To split a contiguous buffer(bytes_view), simply pass
single_fragmented_view(bytes_view).
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Addressed https://github.com/scylladb/scylla/pull/8314#issuecomment-803671234
(write issue: "Tracing: slow query fast mode documentation request")
adds a fast slow queries tracing mode documentation to the docs/guide/tracing.md
patch to the scylla-doc will be dup-ed after this one merged
cc @nyh
cc @vladzcloudius
Closes#8373
* github.com:scylladb/scylla:
tracing: api: fast mode doc improvement
tracing: fast slow query tracing mode docs
Currently the code is structured such that first the conditions required
for admission are checked. The success paths have early returns and if
all of them fail, we fall back to enqueueing the permit.
This patch restructures the code such that the wait conditions are
checked first, and if all of them fail, we fall back to admitting the
permit. This structure allows for easier introduction of additional
wait/admit conditions in the future.
Besides making the code more readable, this also enables restructuring
`do_wait_admission()`, without moving too much code around.
As a bonus, queue length is now only checked when the permit actually
has to be enqueued.
Currently there are two places where we check admission conditions:
`do_wait_admission()` and `signal()`. Both use `has_available_units()`
to check resource availability, but the former has some additional
resource related conditions on top (in `may_proceed()`), which lead to
the two paths working with slightly different conditions. To fix, push
down all resource availability related checks to `has_available_units()`
to ensure admission conditions are consistent across all paths.
Following the work started in 253a7640e, a new batch of old features is assumed to be always available. They are all still announced via gossip, but the code assumes that the feature is always true, because we only support upgrades from a previous release, and the release window is considerably smaller than 2 years.
Features picked this time via `git blame`, along with the date of their introduction:
* fe4afb1aa3 (Asias He 2018-09-05 14:52:10 +0800 109) static const sstring ROW_LEVEL_REPAIR = "ROW_LEVEL_REPAIR";
* ff5e541335 (Calle Wilund 2019-02-05 13:06:07 +0000 110) static const sstring TRUNCATION_TABLE = "TRUNCATION_TABLE";
* fefef7b9eb (Tomasz Grabiec 2019-03-05 19:08:07 +0100 111) static const sstring CORRECT_STATIC_COMPACT_IN_MC = "CORRECT_STATIC_COMPACT_IN_MC";
Tests: unit(dev)
Closes#8235
* github.com:scylladb/scylla:
sstables,test: remove variables depending on old features
gms: make CORRECT_STATIC_COMPACT_IN_MC ft unconditionally true
sstables: stop relying on CORRECT_STATIC_COMPACT_IN_MC feature
gms: make TRUNCATION_TABLE feature unconditionally true
gms: make ROW_LEVEL_REPAIR feature unconditionally true
repair: stop relying on ROW_LEVEL_REPAIR feature
Fixes#8369
This was originally found (and fixed) by @gleb-cloudius, but the patch set with
the fix was reverted at some point, and the fix went away. Now the error remains
even in new, nice coroutine code.
We check the wrong var in the inner loop of the pre-fill path of
allocate_segment_ex, often causing us to generate giant writev:s of more or less
the whole file. Not intended.
Closes#8370
8a8589038c ("test: increase quota for tests to 6GB") increased
the quota for tests from 2GB to 6GB. I later found that the increased
requirement is related to the page size: Address Sanitizer allocates
at least a page per object, and so if the page size is larger the
memory requirement is also larger.
Make use of this by only increasing the quota if the page size
is greater than 4096 (I've only seen 4096 and 65536 in the wild).
This allows greater parallelism when the page size is small.
Closes#8371
Tests are short-lived and use a small amount of data. They
are also often run repeatly, and the data is deleted immediately
after the test. This is a good scenario for using the kernel page
cache, as it can cache read-only data from test to test, and avoid
spilling write data to disk if it is deleted quickly.
Acknowledge this by using the new --kernel-page-cache option for
tests.
This is expected to help on large machines, where the disk can be
overloaded. Smaller machines with NVMe disks probably will not see
a difference.
Closes#8347
In order to maintain backward compatibility wrt. cluster features,
two boolean variables were kept in sstable writers:
- correctly_serialize_non_compound_range_tombstones
- correctly_serialize_static_compact_in_mc
Since these features are assumed to always be present now,
the above variables are no longer needed and can be purged.
This pull request adds partial admission control to Thrift frontend. The solution is partial mostly because the Thrift layer, aside from allowing Thrift messages, may also be used as a base protocol for CQL messages. Coupling admission control to this one is a little bit more complicated due to how the layer currently works - a Thrift handler, created once per connection, keeps a local `query_state` instance for the occasion of handling CQL requests. However, `query_state` should be kept per query, not per connection, so adding admission control to this aspect of the frontend is left for later.
Finally, the way service permits are passed from the server, via the handler factory, handler and then to queries is hacky. I haven't figured out how to force Thrift to pass custom context per query, so the way it works now is by relying on the fact that the server does not yield (in Seastar sense) between having read the request and launching the proper handler. Due to that, it's possible to just store the service permit in the server itself, pass the reference (address) to it down to the handler, and then read it back from the handling code and claim ownership of it. It works, but if anyone has a better idea, please share.
Refs #4826Closes#8313
* github.com:scylladb/scylla:
thrift: add support for max_concurrent_requests_per_shard
thrift: add metrics for admission control
thrift: add a counter for in-flight requests
thrift: add a counter for blocked requests
thrift: partially add admission control
service_permit: add a getter for the number of units held
thrift: coroutinize processing a request
memory_limiter: add a missing seastarx include
LCS reshape is currently inefficient for repair-based operation, because
the disjoint run of 256 sstables is reshaped into bigger L0 files, which
will be then integrated into the main sstable set.
On reshape completion, LCS has to compact those big L0 files onto higher
levels, until last level is reached, producing bad write amplification.
A much better approach is to instead compact that disjoint run into the
best possible level L, which can be figured out with:
log (base fan_out) of (total_size / max_sstable_size)
This compaction will be essentially a copy operation. It's important
to do it rather than only mutating the level of sstables because we have
to reshape the input run according to LCS parameters like sstable size.
For repair-based bootstrap/replace, the input disjoint run is now efficiently
reshaped into an ideal level L, so there's no compaction backlog once
reshape completes.
This behavior will manifest in the log as this:
LeveledManifest - Reshaping 256 disjoint sstables in level 0 into level 2
For repair-based decommission/removenode though, which reshape wasn't
wired on yet, level L may temporarily hold 2 disjoint runs, which overlap
one another, but LCS itself will incrementally merge them through either
promotion of L-1 into L, or by detecting overlapping in level L and
merging the overlapping sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210329171826.42873-1-raphaelsc@scylladb.com>
Failures in this test typically happen inside the test consumer object.
These however don't stop the test as the code invoking the consumer
object handles exceptions coming from it. So the test will run to
completion and will fail again when comparing the produced output with
the expected one. This results in distracting failures. The real problem
is not the difference in the output, but the first check that failed,
which is however buried in the noise. To prevent this add an "ok" flag
which is set to false if the consumer fails. In this case the additional
checks are skipped in the end to not generate useless noise.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210326083147.26113-2-bdenes@scylladb.com>
"
When a permit is destroyed we check if it still holds on to any
resources in the destructor. Any resources the permit still holds on are
leaked resources, as users should have released these. Currently we just
invoke `on_internal_error_noexcept()` to handle this, which -- depending
on the configuration -- will result in an error message or an assert. In
the former case, the resources will be leaked for good. This mini-series
fixes this, by signaling back these resources to the semaphore. This
helps avoid an eventual complete dry-up of all semaphore resources and a
subsequent complete shutdown of reads.
Tests: unit(release, debug)
"
* 'reader-permit-signal-leaked-resources/v1' of https://github.com/denesb/scylla:
reader_permit: signal leaked resources
test: test_reader_lifecycle_policy: keep semaphores alive until all ops cease
sstables: generate_summary(): extend the lifecycle of the reader concurrency semaphore
Said test has two separate logical readers, but they share the same
permit, which is illegal. This didn't cause any problems yet, but soon
the semaphore will start to keep score of active/inactive permits which
will be confused by such sharing, so have them use separate permits.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210326083147.26113-1-bdenes@scylladb.com>
Clean up step() function by moving state specific processing into per
state functions. This way it is easier to see how each state handles
individual messages. No functional changes here.
Message-Id: <YGHCiTWjq+L/jVCB@scylladb.com>
The command can be used to inspect IO queues of a local reactor.
Example output:
```
(gdb) scylla io-queues
Dev 0:
Class: |shares: |ptr:
--------------------------------------------------------------------------------
"default" |1 |(seastar::priority_class_data *)0x6000002c6500
"commitlog" |1000 |(seastar::priority_class_data *)0x6000003ad940
"memtable_flush" |1000 |(seastar::priority_class_data *)0x6000005cb300
"streaming" |200 |(seastar::priority_class_data *)0x0
"query" |1000 |(seastar::priority_class_data *)0x600000718580
"compaction" |1000 |(seastar::priority_class_data *)0x6000030ef0c0
Max request size: 2147483647
Max capacity: Ticket(weight: 4194303, size: 4194303)
Capacity tail: Ticket(weight: 73168384, size: 100561888)
Capacity head: Ticket(weight: 77360511, size: 104242143)
Resources executing: Ticket(weight: 2176, size: 514048)
Resources queued: Ticket(weight: 384, size: 98304)
Handles: (1)
Class 0x6000005d7278:
Ticket(weight: 128, size: 32768)
Ticket(weight: 128, size: 32768)
Ticket(weight: 128, size: 32768)
Pending in sink: (0)
```
Created when debugging a core dump. Turned out not to be immediately useful for this use case, but I'm publishing it since it may come in handy in future investigations.
Closes#8362
* github.com:scylladb/scylla:
scylla-gdb: add io-queues command
scylla-gdb.py: add parsing std::priority_queue
scylla-gdb.py: add parsing std::atomic
scylla-gdb.py: add parsing std::shared_ptr
scylla-db.py: add parsing intrusive_slist
`flat_mutation_reader::consume_pausable` is widely used in Scylla. Some places
worth mentioning are memtables and combined readers but there are others as
well.
This patchset improves `consume_pausable` in three ways:
1. it removes unnecessary allocation
2. it rearranges ifs to not check the same thing twice
3. for a consumer that returns plain stop_iteration not a future<stop_iteration>
it reduces the amount of future usage
Test: unit(dev, release, debug)
Combined reader microbenchmark has shown from 2% to 22% improvement in median
execution time while memtable microbenchmark has shown from 3.6% to 7.8%
improvement in median execution time.
Before the change:
```
./build/release/test/perf/perf_mutation_readers --random-seed 3549335083
single run iterations: 0
single run duration: 1.000s
number of runs: 5
number of cores: 16
random seed: 3549335083
test iterations median mad min max
combined.one_row 1316234 140.120ns 0.020ns 140.074ns 140.141ns
combined.single_active 7332 91.484us 31.890ns 91.453us 91.778us
combined.many_overlapping 945 870.973us 429.720ns 868.625us 871.403us
combined.disjoint_interleaved 7102 85.989us 7.847ns 85.973us 85.997us
combined.disjoint_ranges 7129 85.570us 7.840ns 85.562us 85.596us
combined.overlapping_partitions_disjoint_rows 5458 124.787us 56.738ns 124.731us 125.370us
clustering_combined.ranges_generic 1920688 217.940ns 0.184ns 217.742ns 218.275ns
clustering_combined.ranges_specialized 1935318 194.610ns 0.199ns 194.210ns 195.228ns
memtable.one_partition_one_row 624001 1.600us 1.405ns 1.599us 1.605us
memtable.one_partition_many_rows 79551 12.555us 1.829ns 12.549us 12.558us
memtable.many_partitions_one_row 40557 24.748us 77.083ns 24.644us 25.135us
memtable.many_partitions_many_rows 3220 310.429us 57.628ns 310.295us 311.189us
```
After the change:
```
./build/release/test/perf/perf_mutation_readers --random-seed 3549335083
single run iterations: 0
single run duration: 1.000s
number of runs: 5
number of cores: 16
random seed: 3549335083
test iterations median mad min max
combined.one_row 1358839 109.222ns 0.122ns 109.089ns 109.348ns
combined.single_active 7525 87.305us 25.540ns 87.273us 87.362us
combined.many_overlapping 962 853.195us 1.904us 851.244us 855.142us
combined.disjoint_interleaved 7310 81.988us 28.877ns 81.949us 82.032us
combined.disjoint_ranges 7315 81.699us 37.144ns 81.662us 81.874us
combined.overlapping_partitions_disjoint_rows 5591 120.964us 15.294ns 120.949us 121.120us
clustering_combined.ranges_generic 1954722 211.993ns 0.052ns 211.883ns 212.084ns
clustering_combined.ranges_specialized 2042194 187.807ns 0.066ns 187.732ns 188.289ns
memtable.one_partition_one_row 648701 1.542us 0.339ns 1.542us 1.543us
memtable.one_partition_many_rows 85007 11.759us 1.168ns 11.752us 11.782us
memtable.many_partitions_one_row 43893 22.805us 17.147ns 22.782us 22.843us
memtable.many_partitions_many_rows 3441 290.220us 41.720ns 290.172us 290.306us
```
Closes#8359
* github.com:scylladb/scylla:
flat_mutation_reader: optimize consume_pausable for some consumers
flat_mutation_reader: special case consumers in consume_pausable
flat_mutation_reader: Change order of checks in consume_pausable
flat_mutation_reader: fix indentation in consume_pausable
flat_mutation_reader: Remove allocation in consume_pausable
perf: Add benchmarks for large partitions
This commit adds admission control in the form of passing
service permits to the Thrift server.
The support is partial, because Thrift also supports running CQL
queries, and for that purpose a query_state object is kept
in the Thrift handler. However, the handler is generally created
once per connection, not once per query, and the query_state object
is supposed to keep the state of a single query only.
In order to keep this series simpler, the CQL-on-top-of-Thrift
layer is not touched and is left as TODO.
Moreover, the Thrift layer does not make it easy to pass custom
per-query context (like service_permit), so the implementation
uses a trick: the service permit is created on the server
and then passed as reference to its connections and their respective
Thrift handlers. Then, each time a query is read from the socket,
this service permit is overwritten and then read back from the Thrift
handler. This mechanism heavily relies on the fact that there are
zero preemption points between overwriting the service permit
and reading it back by the handler. Otherwise, races may occur.
This assumption was verified by code inspection + empirical tests,
but if somebody is aware that it may not always hold, please speak up.
The Thrift layer is functional, but it's not usually the first-choice protocol for Scylla users, so it's hereby disabled by default.
Fixes#8336Closes#8338
* github.com:scylladb/scylla:
docs: mention disabling Thrift by default
db,config: disable Thrift by default
Raft instance needs to update RPC subsystem on changes in
configuration, so that RPC can deliver messages to the new nodes
in configuration, as well as dispose of the old nodes.
I.e. the nodes which are not the part of the most recent
configuration anymore.
The effective scope of RPC mappings is limited by the piece of
code which sends messages to both the "new" nodes (which
are added to the cluster with the most recent configuration
change) and the "old" nodes which are removed from the cluster.
Until the messages are successfully delivered to at least
the majority of "old" nodes and we have heard back from them,
the mappings should be kept intact. After that point the RPC
mappings for the removed nodes are no longer of interest
and thus can be immediately disposed.
There is also another problem to be solved: in Raft an instance may
need to communicate with a peer outside its current configuration.
This may happen, e.g., when a follower falls out of sync with the
majority and then a configuration is changed and a leader not present
in the old configuration is elected.
The solution is to introduce the concept of "expirable" updates to
the RPC subsystem.
When RPC receives a message from an unknown peer, it also adds the
return address of the peer to the address map with a TTL. Should
we need to respond to the peer, its address will be known.
An outgoing communication to an unconfigured peer is impossible.
* manmanson/raft_mappings_wiring_v12:
raft: update README.md with info on RPC server address mappings
raft: wire up `rpc::add_server` and `rpc::remove_server` for configuration changes
raft/fsm: add optional `rpc_configuration` field to fsm_output
raft: maintain current rpc context in `server_impl`
raft: use `.contains` instead of `.count` for std::set in `raft::configuration::diff`
raft: unit-tests for `raft_address_map`
raft: support expiring server address mappings for rpc module
consumers that return stop_iteration not future<stop_iteration> don't
have to consume a single fragment per each iteration of repeat. They can
consume whole buffer in each iteration.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
consume_pausable works with consumers that return either stop_iteration
or future<stop_iteration>. So far it was calling futurize_invoke for
both. This patch special cases consumers that return
future<stop_iteration> and don't call futurize_invoke for them as this
is unnecessary work.
More importantly, this will allow the following patch to optimize
consumers that return plain stop_iteration.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This way we can avoid checking is_buffer_empty twice.
Compiler might be able to optimize this out but why depend on it
when the alternative is not less readable.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Code was left with wrong indentation by the previous commit that
removed do_with call around the code that's currently present.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The allocation was introduced in 515bed90bb but I couldn't figure out
why it's needed. It seems that the consumer can just be captured inside
lambda. Tests seem to support the idea.
Indentation will be fixed in the following commit to make the review
easier.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
dbuild detects if the kernel is using cgroupv2 by checking if the
cgroup2 filesystem is mounted on /sys/fs/cgroup. However, on Ubuntu
20.10, the cgroup filesystem is mounted on /sys/fs/cgroup and the
cgroup2 filesystem is mounted on /sys/fs/cgroup/unified. This second
mount matches the search expression and gives a false positive.
Fix by adding a space at the end; this will fail to match
/sys/fs/cgroup/unified.
Closes#8355
Describe the high-level scheme of managing RPC mappings and
also expand on the introduction of "expirable" RPC mappings concept
and why these are needed.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Raft instance needs to update RPC subsystem on changes in
configuration, so that RPC can deliver messages to the new nodes
in configuration, as well as dispose of the old nodes.
I.e. the nodes which are not the part of the most recent
configuration anymore.
The effective scope of RPC mappings is limited by the piece of
code which sends messages to both the "new" nodes (which
are added to the cluster with the most recent configuration
change) and the "old" nodes which are removed from the cluster.
Until the messages are successfully delivered to at least
the majority of "old" nodes and we have heard back from them,
the mappings should be kept intact. After that point the RPC
mappings for the removed nodes are no longer of interest
and thus can be immediately disposed.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The field is set in `fsm.get_output` whenever
`_log.last_conf_idx()` or the term changes.
Also, add `_last_conf_idx` and `_last_term` to
`fsm::last_observed_state`, they are utilized
in the condition to evaluate current rpc
configuration in `fsm.get_output()`.
This will be used later to update rpc config state
stored in `server_impl` and maintain rpc address map.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The start-stop code is drifting towards a straightforward
scheme of a bunch of
service foo;
foo.start();
auto stop_foo = defer([&foo] { foo.stop(); });
blocks. The drain_on_shutdown() and its relation to drain()
and decommission() is a big hurdle on the way of this effort.
This set unifies drain() and drain_on_shutdown() so that drain
really becomes just some first steps of the regular shutdown,
i.e. -- what it should be. Some synchronisation bits around it
are still needed, though.
This unification also closes a bunch not-yet-caught bugs when
parts of the system remained running in case shutdown happens
after nodetool drain. In this case the whole drain_on_sutdown()
becomes a noop (just returns drain()'s future) and what's
missing in drain() becomes missing on shutdown.
tests: unit(dev), dtest(simple_boot_shutdown : dev),
manual(start+stop, start+drain+stop : dev)
refs: #2737
* xemul/br-drain-on-shutdown:
drain_on_shutdown: Simplify
drain: Fix indentation
storage_service: Unify drain and drain_on_shutdown
storage_proxy: Drain and unsubscribe in main.cc
migration_manager: Stop it in two phases
stream_manager: Stop instances on drain
batchlog_manager: Stop its instances on shutdown
tracing: Shutdown tracing in drain
tracing: Stop it in main.cc
system_distributed_keyspace: Stop it in main.cc
storage_service: Move (un)subscription to migration events
Introduce rpc server_address that represents the
last observed state of address mappings
for RPC module.
It does not correspond to any kind of configuration
in the raft sense, just an artificial construct
corresponding to the largest set of server
addresses coming from both previous and current
raft configurations (to be able to contact both
joining and leaving servers).
This will be used later to update rpc module mappings
when cluster configuration changes.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
`std::unordered_set::contains` is introduced in C++20 and provides
clearer semantics to check existence of a given element in a set.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This patch introduces `raft_address_map` class to abstract
the notion of expirable address mappings for a raft rpc module.
In Raft an instance may need to communicate with a peer outside
its current configuration. This may happen, e.g., when a follower
falls out of sync with the majority and then a configuration is
changed and a leader not present in the old configuration is elected.
The solution is to introduce the concept of "expirable" updates to
the RPC subsystem.
When RPC receives a message from an unknown peer, it also adds the
return address of the peer to the address map with a TTL. Should
we need to respond to the peer, its address will be known.
An outgoing communication to an unconfigured peer is impossible.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The modern version of this method doesn't need the
run_with_no_api_lock(), as it's launched on shard 0
anyway, neither it needs logging before and after
as it's done by the deferred action from main that
calls it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now they only differ in one bit -- compaction manager is
drained on drain and is left running (until regular stop)
on shutdown. So this unification adds a boolean flag for
this case.
Also the indentation is deliberately left broken for the
sake of patch readability.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently shutdown after drain leaves storage proxy
subscribed on storage_service events and without the
storage_proxy::drain_on_shutdown being called. So it
seems safe if the whole thing is relocated closer to
its starting peers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Before the patch the migration manager was stopped in
two ways and one was buggy.
Plain shutdown -- it's just sharded::stop-ed by defer
in main(), but this happens long after the shutdown
of commitlog, which is not correct.
Shutdown after drain -- it's stopped twice, first time
right before the commitlog shutdown, second -- the
same defer in main(). And since the sharded::stop is
reentrable, the 2nd stop works noop.
This patch splits the stop into two phases: first it
stops the instances and does this in _both_ -- plain
shutdown and shutdown after drain. This phase is done
before commitlog shutdown in both cases. Second, the
existring deferred sharded::stop in main.cc.
This changes needs the migration_manager::stop() to
become re-entrable, but that's easily checked with the
help of abort_source the migration_manager has.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's not seen directly from ths patch itself, but the only
difference between first several calls that drain() makes
and the stop_transport() is the do_stop_stream_manager()
in the latter.
Again, it's partially a bugfix (shutdown after drain leaves
streaming running), partially a must-have thing (streaming
is not expected in the background after drain), partially a
unification of two drains out there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's now stopped (not sharded::stop(), but batchlog_manager::stop)
on plain drain, but plain shutdown leaves it running, so fill this
gap.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
First of all, shutdown that happens after nodetoo drain leaves
tracing up-n-running, so it's effectively a bugfix. But also
a step towards unified drain and drain_on_shutdown.
Keeping this bit in drain seems to be required because drain
stops transport, flushes column families and shuts commitlog
down. Any tracing activity happening after it looks uncalled
for.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The tracing::stop() just checks that it was shutdown()-ed and
otherwise a noop, so it's OK to stop tracing later. This brings
drain() and drain_on_shutdown() closer to each other and makes
main.cc look more like it should.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's now stopped in drain_on_shutdown, but since its stop()
method is a noop, it doesn't matter where it is. Keeping it
in main.cc next to related start brings drain_on_shutdown()
closer to drain() and the whole thing closer to the Ideal
start-stop sequence.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After the patch the subscription effectively happens at
the same time as before, but is now located in main.cc,
so no real change here.
The unsubscription was in the drain_on_shutdown before
the patch, but after it it happens to be a defer next
to its peer, i.e. later, but it shouldn't be disastrous
for two reasons. First -- client services and migration
manager are already stopped. Second -- before the patch
this subscription was _not_ cancelled if shutdown ran
after nodetool drain and it didn't cause troubles.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When destroying a permit with leaked resources we call
`on_internal_error_noexcept()` in the destructor. This method logs an
error or asserts depending on the configuration. When not asserting, we
need to return the leaked units to the semaphore, otherwise they will be
leaked for good. We can do this because we know exactly how many
resources the user of the permit leaked (never signalled).
Currently the guy copies and merges all range tombstones from all partition
versions (that match the given range, but still) when being initialized or
decides to refresh iterators. This is a lot of potentially useless work and
memory, as the reader may be dropped before it emits all the mutations from
the given range(s).
It's better to walk the tombstones step-by-step, like it's done for rows.
fixes: #1671
tests: unit(dev)
* xemul/br-partiion-snapshot-reader-on-demand-range-tombstones-2:
range_tombstone_stream: Remove unused methods
partition_snapshot_reader: Emit range tombstones on demand
partition_snapshot_reader: Introduce maybe_refresh_state
partition_snapshot_reader: Move range tombstone stream member
partition_snapshot_reader: Add reset_state method to helper class
partition_snapshot_reader: Downgrade heap comparator
partition_snapshot_reader: Use on-demand comparators
range_tombstone_list: Add new slice() helper
range_tombstone_list: Introduce iterator_range alias
Constructors of classes inherited from file_impl copy alignment
values by hands, but miss the overwrite one, thus on a new file
it remains default-initialized.
To fix this and not to forget to properly initalize future fields
from file_impl, use the impl's copy constructor.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210325104830.31923-1-xemul@scylladb.com>
Seven etcd unit tests as boost tests.
* alejo/raft-tests-etcd-08-v4-communicate-v5:
raft: etcd unit tests: test proposal handling scenarios
raft: etcd unit tests: test old messages ignored
raft: etcd unit tests: test single node precandidate
raft: etcd unit tests: test dueling precandidates
raft: etcd unit tests: test dueling candidates
raft: etcd unit tests: test cannot commit without new term
raft: etcd unit tests: test single node commit
raft: etcd unit tests: update test_leader_election_overwrite_newer_logs
raft: etcd unit tests: fix test_progress_leader
raft: testing: log comparison helper functions
raft: testing: helper to make fsm candidate
raft: testing: expose log for test verification
raft: testing: use server_address_set
raft: testing: add prevote configuration
raft: testing: make become_follower() available for tests
TestProposal
For multiple scenarios, check proposal handling.
Note, instead of expecting an explicit result for each specified case,
the test automatically checks for expected behavior when quorum is
reached or not.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
TestSingleNodePreCandidate
Checks a single node configuration with precandidate on works to
automatically elect the node.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
TestDuelingPreCandidates
In a configuration of 3 nodes, two nodes don't see each other and they
compete for leadership. Loser (3) should revert to follower when prevote
is rejected and revert to term 1.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
TestDuelingCandidates
In a configuration of 3 nodes, two nodes don't see each other and they
compete for leadership. Once reconnected, loser should not disrupt.
But note it will remain candidate with current algorithm without
prevoting and other fsms will not bump term.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
TestCannotCommitWithoutNewTermEntry tests the entries cannot be
committed when leader changes, no new proposal comes in and ChangeTerm
proposal is filtered.
NOTE: this doesn't check committed but it's implicit for next round;
this could also use communicate() providing committed output map
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Port etcd TestSingleNodeCommit
In a single node configuration elect the node, add 2 entries and check
number of committed entries.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Make test_leader_election_overwrite_newer_logs use newer communicate()
and other new helpers.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Make implementation follow closer to original test.
Use newer boost test helpers.
NOTE: in etcd it seems a leader's self progress is in PIPELINE state.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Two helper functions to compare logs. For now only index, term, and data
type are used. Data content comparison does not seem to be necessary for now.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Current election_timeout() helper might bump the term twice.
It's convenient and less error prone to have a more fine grained helper
that stops right when candidate state is reached.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
One INSERT statement was unnecessary for the test, so delete it.
Another was necessary, so explain it.
Tests: cql-pytest/test_null on both Scylla and Cassandra
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8304
Test raft configuration changes:
a node with empty configuration, transitioning
to an entirely different cluster, transitioning
in presence of down nodes, leader change during
configuration change, stray replies, etc.
* scylla-dev/raft-empty-confchange-v5: (21 commits)
raft: (testing) stray replies from removed followers
raft: always return a non-zero configuration index from the log
raft: (testing) leader change during configuration change
raft: (testing) test confchange {ABCDE} -> {ABCDEFG}
raft: (testing) test confchange {ABCDEF} -> {ABCGH}
raft: (testing) test confchange {ABC} -> {CDE}
raft: (testing) test confchange {AB} -> {CD}
raft: (testing) test confchange {A} -> {B}
raft: (testing) test a server with empty configuration
raft: (testing) introduce testing utilities
raft: (testing) simplify id allocation in test
raft: (testing) add select_leader() helper
raft: (testing) introduce communicate() helper
raft: (testing) style cleanup in raft_fsm_test
raft: (testing) fix bug in election_threshold
raft: minor style changes & comments
raft: do not assert when transitioning to empty config
raft: assert we never apply a snapshot over uncommitted entries (leader)
raft: improve tracing
raft: add fsm_output::empty() helper to aid testing
...
The template method needs to be specialized in each file that is
using it. To avoid rewriting the specialization into multiple files,
move it to the header file.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Right now toppartitions can only be invoked on one column family at a time.
This change introduces a natural extension to this functionality,
allowing to specify a list of families.
We provide three ways for filtering in the query parameter "name_list":
1. A specific column family to include in the form "ks:cf"
2. A keyspace, telling the server to include all column families in it.
Specified by omitting the cf name, i.e. "ks:"
3. All column families, which is represented by an empty list
The list can include any amount of one or both of the 1. and 2. option.
Fixes#4520Closes#7864
LCS reshape is basically 'major compacting' level 0 until it contains less than
N sstables.
That produces terrible write amplification, because any given byte will be
compacted (initial # of sstables / max_threshold (32)) times. So if L0 initially
contained 256 ssts, there would be a WA of about 8.
This terrible write amplification can be reduced by performing STCS instead on
L0, which will leave L0 in a good shape without hurting WA as it happens
now.
Fixes#8345.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210322150655.27011-1-raphaelsc@scylladb.com>
During the retry mechanism, it's possible to encounter a gate
closed exception, which should simply be ignored, because
it indicates that the server is shutting down.
Closes#8337
This open option tells seastar that the file in question
will be truncated to the needed size right at once and all
the subsequent writes will happen within this size. This
hint turns off append optimization in seastar that's not
that cheap and helps so save few cpu cycles.
The option was introduced in seastar by 8bec57bc.
tests: unit(dev), dtest(commitlog:
test_batch_commitlog,
test_periodic_commitlog,
test_commitlog_replay_on_startup)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210323115409.31215-1-xemul@scylladb.com>
Add comments explaining the rationale from transfer_leadership()
(more PhD quotes), encapsulate stable leader check in tick()
into a lambda and add more detailed comments to it.
Until clang figures things out with the now infamous
`-llvm -inline-threshold X` parameter, let's allow customizing
it to make the compilation of release builds less tiresome.
For instance, scylla's row_level.o object file currently does not compile
for me until I decrease the inline threshold to a low value (e.g. 50).
Message-Id: <54113db9438e3c3371410996f49b7fbe9a1b7257.1616422536.git.sarna@scylladb.com>
Row marker has a cell name which sorts after the row tombstone's start
bound. The old code was writing the marker first, then the row
tombstone, which is incorrect.
This was harmeless to our sstable reader, which recognized both as
belonging to the current clustering row fragment, and collects both
fine.
However, if both atoms trigger creation of promoted index blocks, the
writer will create a promoted index with entries wich violate the cell
name ordering. It's very unlikely to run into in practice, since to
trigger promoted index entries for both atoms, the clustering key
would be so large so that the size of the marker cell exceeds the
desired promoted index block size, which is 64KB by default (but
user-controlled via column_index_size_in_kb option). 64KB is also the
limit on clustering key size accepted by the system.
This was caught by one of our unit tests:
sstable_conforms_to_mutation_source_test
...which runs a battery of mutation reader tests with various
desired promoted index block sizes, including the target size of 1
byte, which triggers an entry for every atom.
The test started to fail for some random seeds after commit ecb6abe
inside the
test_streamed_mutation_forwarding_is_consistent_with_slicing test
case, reporting a mutation mismatch in the following line:
assert_that(*sliced_m).is_equal_to(*fwd_m, slice_with_ranges.row_ranges(*m.schema(), m.key()));
It compares mutations read from the same sstable using different
methods, slicing using clustering key restricitons, and fast
forwarding. The reported mismatch was that fwd_m contained the row
marker, but sliced_m did not. The sstable does contain the marker, so
both reads should return it.
After reverting the commit which introduced dynamic adjustments, the
test passes, but both mutations are missing the marker, both are
wrong!
They are wrong because the promoted index contians entries whose
starting positions violate the ordering, so binary search gets confused
and selects the row tombstone's position, which is emitted after the
marker, thus skipping over the row marker.
The explanation for why the test started to fail after dynamic
adjustements is the following. The promoted index cursor works by
incrementally parsing buffers fed by the file input stream. It first
parses the whole block and then does a binary search within the parsed
array. The entries which cursor touches during binary search depend on
the size of the block read from the file. The commit which enabled
dynamic adjustements causes the block size to be different for
subsequent reads, which allows one of the reads to walk over the
corrupted entries and read the correct data by selecting the entry
corresponding to the row marker.
Fixes#8324
Message-Id: <20210322235812.1042137-1-tgrabiec@scylladb.com>
Clustering rows are now stored in intrusive btree, cells are
now stored in radix tree, but scylla-gdb tries to walk the
intrusive_set and vector/set union respectively.
For the former case -- the btree wrapper is introduced.
For the latter -- compiler optimizes-away too many important
bits and walking the tree turns into a bunch of hard-coded
hacks and reiterpret-casts. Untill better solution is found,
just print the address of the tree root.
* xemul/br-gdb-btree-rows:
gdb: Show address of the row::_cells tree (or "empty" mark)
gdb: Add support for intrusive B tree
gdb: Use helper to get rows from mutation_partition
Currently clang optimizes-out lots of critical stuff from
compact radix tree. Untill we find out the way to walk the
tree in gdb, it's better to at least show where it is in
memory.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Rows inside partition are now stored in an intrusive B-tree,
so here's the helper class that wraps this collection.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
validate_partial() is declared in the internal namespace, but defined
outside it. This causes calls to validate_partial() to be ambiguous
on architectures that haven't been SIMD-optimized yet (e.g. s390x).
Fix by defining it in the internal namespace.
Closes#8268
The manifest manipulation commands stopped working with podman 3;
the containers-storage: prefix now throws errors.
Switch to `buildah manifest`; since we're building with buildah,
we might as well maintain the manifest with buildah as well.
Closes#8231
crc has some code to reverse endianness on big-endian machines, but does
not handle the case of a 1-byte object (which doesn't need any adjustement).
This causes clang to complain that the switch statement doesn't handle that
case.
Fix by adding a no-op case.
Closes#8269
Although a seastar::smart_ptr is trivial to dereference manually, so is
adding support for it to dereference_smart_ptr(), avoiding the annoying
(but brief) detour which is currently needed.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210322150149.84534-1-bdenes@scylladb.com>
Concepts are easier to read and result in better error messages.
This change also tightens the constraint from "std::is_fundamental" to
"std::integral". The differences are floating point values, nullptr_t,
and void. The latter two are illegal/useless to write, and nobody uses
floating point values for list lengths, so everything still compiles.
Closes#8326
It will still be possible to use Thrift once it's enabled
in the yaml file, but it's better to not open this port
by default, since Thrift is definitely not the first choice
for Scylla users.
Fixes#8336
This series implements leader stepdown extension. See patch 4 for
justification for its existence. First three patches either implement
cleanups to existing code that future patch will touch or fix bugs
that need to be fixed in order for stepdown test to work.
* 'raft-leader-stepdown-v3' of github.com:scylladb/scylla-dev:
raft: add test for leader stepdown
raft: introduce leader stepdown procedure
raft: fix replication when leader is not part of current config
raft: do not update last election time if current leader is not a part of current configuration
raft: move log limiting semaphore into the leader state
"
Scylla suffers with aggressive compaction after repair-based operation has initiated. That translates into bad latency and slowness for the operation itself.
This aggressiveness comes from the fact that:
1) new sstables are immediately added to the compaction backlog, so reducing bandwidth available for the operation.
2) new sstables are in bad shape when integrated into the main sstable set, not conforming to the strategy invariant.
To solve this problem, new sstables will be incrementally reshaped, off the compaction strategy, until finally integrated into the main set.
The solution takes advantage there's only one sstable per vnode range, meaning sstables generated by repair-based operations are disjoint.
NOTE: off-strategy for repair-based decommission and removenode will follow this series and require little work as the infrastructure is introduced in this series.
Refs #5226.
"
* 'offstrategy_v7' of github.com:raphaelsc/scylla:
tests: Add unit test for off-strategy sstable compaction
table: Wire up off-strategy compaction on repair-based bootstrap and replace
table: extend add_sstable_and_update_cache() for off-strategy
sstables/compaction_manager: Add function to submit off-strategy work
table: Introduce off-strategy compaction on maintenance sstable set
table: change build_new_sstable_list() to accept other sstable sets
table: change non_staging_sstables() to filter out off-strategy sstables
table: Introduce maintenance sstable set
table: Wire compound sstable set
table: prepare make_reader_excluding_sstables() to work with compound sstable set
table: prepare discard_sstables() to work with compound sstable set
table: extract add_sstable() common code into a function
sstable_set: Introduce compound sstable set
reshape: STCS: preserve token contiguity when reshaping disjoint sstables
Section 3.10 of the PhD describes two cases for which the extension can
be helpful:
1. Sometimes the leader must step down. For example, it may need to reboot
for maintenance, or it may be removed from the cluster. When it steps
down, the cluster will be idle for an election timeout until another
server times out and wins an election. This brief unavailability can be
avoided by having the leader transfer its leadership to another server
before it steps down.
2. In some cases, one or more servers may be more suitable to lead the
cluster than others. For example, a server with high load would not make
a good leader, or in a WAN deployment, servers in a primary datacenter
may be preferred in order to minimize the latency between clients and
the leader. Other consensus algorithms may be able to accommodate these
preferences during leader election, but Raft needs a server with a
sufficiently up-to-date log to become leader, which might not be the
most preferred one. Instead, a leader in Raft can periodically check
to see whether one of its available followers would be more suitable,
and if so, transfer its leadership to that server. (If only human leaders
were so graceful.)
The patch here implements the extension and employs it automatically
when a leader removes itself from a cluster.
When a leader orchestrates its own removal from a cluster there is a
situation where the leader is still responsible for replication, but it
is no longer part of active configuration. Current code skips replication
in this case though. Fix it by always replicating in the leader state.
Since we use external failure detector instead of relying on empty
AppendRequests from a leader there can be a situation where a node
is no longer part of a certain raft group but is still alive (and also
may be part of other raft groups). In such case last election time
should not be updated even if the node is alive. It is the same as if
it would have stopped to send empty AppendRequests in original raft.
Since we switched scylla-python3 build directory to tools/python3/build
on Jenkins, we nolonger need compat-python3 targets, drop them.
Related scylladb/scylla-pkg#1554
Closes#8328
"
Due to bad interaction of recent changes (913d970 and 4c8ab10) inctive
readers that are not admitted have managed to completely fly under the
radar, avoiding any sort of limitation. The reason is that pre-admission
the permits don't forward their resource cost to the semaphore, to
prevent them possibly blocking their own admission later. However this
meant that if such a reader is registered as inactive, it completely
avoids the normal resource based eviction mechanism and can accumulate
without bounds.
The real solution to this is to move the semaphore before the cache and
make all reads pass admission before they get started (#4758). Although
work has been started towards this, it is still a while until it lands.
In the meanwhile this patchset provides a workaround in the form of a
new inactive state, which -- like admitted -- causes the permit to
forward its cost to the semaphore, making sure these un-admitted
inactive reads are accounted for and evicted if there is too much of
them.
Fixes: #8258
Tests: unit(release), dtest(oppartitions_test.py:TestTopPartitions.test_read_by_gause_key_distribution_for_compound_primary_key_and_large_rows_number)
"
* 'reader-concurrency-semaphore-limit-inactive-reads/v4' of https://github.com/denesb/scylla:
test: mutation_reader_test: add test for permit cleanup
test: querier_cache_test: add memory based cache eviction test
reader_permit: add inactive state
querier: insert(): account immediately evicted querier as resource based eviction
reader_concurrency_semaphore: fix clear_inactive_reads()
reader_concurrency_semaphore: make inactive_read_handle a weak reference
reader_concurrency_semaphore: make evict() noexcept
reader_concurrency_semaphore: update out-of-date comments
After commit 0bd201d3ca ("cql3: Skip indexed
column for CK restrictions") fixed issue #7888, the test
cassandra_tests/validation/entities/frozen_collections_test.py::testClusteringColumnFiltering
began passing, as expected. So we can remove its "xfail" label.
Refs #7888.
cassandra_tests/validation/entities/frozen_collections_test.py::testClusteringColumnFiltering
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210321080522.1831115-1-nyh@scylladb.com>
* seastar ea5e529f30...83339edb04 (21):
> cmake: filter out -Wno-error=#warnings from pkgconfig (seastar.pc)
> Merge 'utils/log.cc: fix nested_exception logging (again)' from Vlad Zolotarov
Fixes#8327.
> file: Add option to refuse the append-challenged file
> Merge "Teach io-tester to work on block device" from Pavel E
> Merge "Cleanup files code" from Pavel E
> install-dependencies: Support rhel-8.3
> install-dependencies: Add some missing rh packages
> file, reactor: reinstate RWF_NOWAIT support
> file: Prevent fsxattr.fsx_extsize from overflow
> cmake: enable clang's -Wno-error=#warnings if supported
> cmake: harden seastar_supports_flag aginst inputs with spaces or #
> cmake: fix seastar_supports_flag failing after first invocation
> thread: Stop backtraces in main() on s390x architecture
> intent: Explicitly declare constructors for references
> test: file_io_test: parallel_overwrite: use testing::local_random_engine
> util: log-impl: rework log_buf::inserter_iterator
> rwlock: pass timeout parameter to get_units
> concepts: require lib support to enable concepts
> rpc: print more info on bad protocol magic
> seastar-addr2line: strip input line to restore multiline support
> log: skip on unknown nested mixing instead of stopping the logging
Ref #8327.
This is a translation of Cassandra's CQL unit test source file
validation/entities/SecondaryIndexOnMapEntriesTest.java into our
our cql-pytest framework.
This test file checks various features of indexing (with secondary index)
individual entries of maps. All these tests pass on Cassandra, but fail on
Scylla because of issue #2962 - we do not yet support indexing of the content
of unfrozen collections. The failing test currently fail as soon as they
try to create the index, with the message:
"Cannot create secondary index on non-frozen collection or UDT column v".
Refs #2962.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210310124638.1653606-1-nyh@scylladb.com>
Thrift used to be quite unsafe with regard to its retry mechanism, which caused very rapid use of resources, namely the number of file descriptors. It was also prone to use-after-free due to spawning futures without guarding the captured objects with anything.
The mechanism is now cleaned up, and a simple exponential backoff replaced previous constant backoff policy.
Fixes#8317
Tests: unit(dev), manual(see #8317 for a simple reproducer)
Closes#8318
* github.com:scylladb/scylla:
thrift: add exponential backoff for retries
thrift: fix and simplify retry logic
When a tuple value is serialized, we go through every element type and
use it to serialize element values. But an element type can be
reversed, which is artificially different from the type of the value
being read. This results in a server error due to the type mismatch.
Fix it by unreversing the element type prior to comparing it to the
value type.
Fixes#7902
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8316
When querying an index table, we assemble clustering-column
restrictions for that query by going over the base table token,
partition columns, and clustering columns. But if one of those
columns is the indexed column, there is a problem; the indexed column
is the index table's partition key, not clustering key. We end up
with invalid clustering slice, which can cause problems downstream.
Fix this by skipping the indexed column when assembling the clustering
restrictions.
Tests: unit (dev)
Fixes#7888
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8320
A trichotomic comparator returning an int an easily be mistaken
for a less comparator as the return types are convertible.
Use the new std::strong_ordering instead.
A caller in cql3's update_parameters.hh is also converted, following
the path of least resistance.
Ref #1449.
Test: unit (dev)
Closes#8323
Constrain inject() with a requires clause rather than enable_if,
simplifying the code and compiler diagnostics.
Note that the second instance could not have been called, since
the template argument does not appear in the function parameter
list and thus could not be deduced. This is corrected here.
Closes#8322
time_point_to_string ensures its input is a time_point with
millisecond resolution (though it neglects to verify the epoch
is what it expects). Change the test from a clunky enable_if to
a nicer concept.
Closes#8321
Currently send_snapshot is the only two-way RPC used by Raft.
However, the sender (the leader) does not look at the receiver's
reply, other than checks it's not an error. This has the following
issues:
- if the follower has a newer term and rejects the snapshot for
that reason, the leader will not learn about a newer follower
term and will not step down
- the send_snapshot message doesn't pass through a single-endpoint
fsm::step() and thus may not follow the general Raft rules
which apply for all messages.
- making a general purpose transport that simply calls fsm::step()
for every message becomes impossible.
Fix it by actually responding with snapshot_reply to send_snapshot
RPC, generating this reply in fsm::step() on the follower,
and feeding into fsm::step() on the leader.
* scylla-dev/raft-send-snapshot-v2:
raft: pass snapshot_reply into fsm::step()
raft: respond with snapshot_reply to send_snapshot RPC
raft: set follower's next_idx when switching to SNAPSHOT mode
raft: set the current leader upon getting InstallSnapshot
The original backoff mechanism which just retries after 1ms
may still lead to rapid resource depletion.
Instead, an exponential backoff is used, with a cap of ~2s.
Tests: manual, with cassandra-stress and browsing logs
The retry logic for Thrift frontend had two bugs:
1. Due to missing break in a switch statement,
two retry calls were always performed instead of one,
which acts a little bit like a Seastar forkbomb
2. The delayed action was not guarded with any gate,
so it was theoretically possible to access a captured `this`
pointer of an object which already got deallocated.
In order to fix the above, the logic is simplified to always
retry with backoff - it makes very little sense to skip the backoff
and immediate retries are not needed by anyone, while they cause
severe overload risk.
Tests: manual - a simple cassandra-stress invocation was able to crash
scylla with a segfault:
$ cassandra-stress write -mode thrift -rate threads=2000
Fixes#8317
enable_if is hard to understand, especially its error messages. Convert
enable_if in sstable code to concepts.
A new concept is introduced, self_describing, for the case of a type
that follows the obj.describe_type() protocol. Otherwise this is quite
straightforward.
Closes#8315
* github.com:scylladb/scylla:
sstables: vector write: convert to concepts
sstables: check_truncated_and_assign: convert to concept
sstables: convert write() to concepts
sstables: convert write_vint() to concepts
sstables: vector parse(): convert to concept
sstables: convert parse() for a self-describing type to concept
sstables: read_vint(): convert enable_if to concepts
sstables: add concept for self-describing type
We have an integral and a non-integral overload, each constrained
with enable_if. We use std::integral to constrain the integral
overload and leave the other unconstrained, as C++ will choose the
more constrained version when applicable.
There are three variants: integral, enum, and self-describing
(currently expressed as not integral and not enum). Convert to
concepts by using the standard concepts or the new self_describing
concept.
By default the boto3 library waits up to 60 second for a response,
and if got no response, it sends the same request again, multiple
times. We already noticed in the past that it retries too many times
thus slowing down failures, so in our test configuration lowered the
number of retries to 3, but the setting of 60-second-timeout plus
3 retries still causes two problems:
1. When the test machine and the build are extremely slow, and the
operation is long (usually, CreateTable or DeleteTable involving
multiple views), the 60 second timeout might not be enough.
2. If the timeout is reached, boto3 silently retries the same operation.
This retry may fail because the previous one really succeeded at
least partially! The symptom is tests which report an error when
creating a table which already exists, or deleting a table which
dooesn't exist.
The solution in this patch is first of all to never do retries - if
a query fails on internal server error, or times out, just report this
failure immediately. We don't expect to see transient errors during
local tests, so this is exactly the right behavior.
The second thing we do is to increase the default timeout. If 1 minute
was not enough, let's raise it to 5 minutes. 5 minutes should be enough
for every operation (famous last words...).
Even if 5 minutes is not enough for something, at least we'll now see
the timeout errors instead of some wierd errors caused by retrying an
operation which was already almost done.
Fixes#8135
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210222125630.1325011-1-nyh@scylladb.com>
The two vector parse() overloads select between integral members
and non-integral members. Use std::integral to constrain the
integral overload and leave the other unconstrained; C++ will choose
the more constrained version when it applies.
This parse() overload uses "not integral and not enum" to reject
non-self-describing types. Express it directly with the self_describing
concept instead.
Convert read_vint() to a concept. The explicitly deleted version
is no longer needed since wrongly-typed inputs will be rejected
by the constraint. Similarly the static assert can be dropped
for the same reason.
Our sstable parsing and writing code contains a self-describing
type concept, where a type can advertise its members via a
describe_types() member function with a specific protocol.
Formalize that into a C++ concept. This is a little tricky, since
describe_type() accepts a parameter that is itself a template, and
requires clauses only work with concrete type. To handle this problem,
create such a concrete example type and use it in the concept.
Now, sstables created by bootstrap and replace will be added to the
maintenance set, and once the operation completes, off-strategy compaction
will be started.
We wait until the end of operation to trigger off-strategy, as reshaping
can be more efficient if we wait for all sstables before deciding what
to compact. Also, waiting for completion is no longer an issue because
we're able to read from new sstables using partitioned_sstable_set and
their existence aren't accounted by the compaction backlog tracker yet.
Refs #5226.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This new variant will allow its caller to submit off-strategy job
asynchronously on behalf of a given table.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Off-strategy compaction is about incrementally reshaping the off-strategy
sstables in maintenance set, using our existing reshape mechanism, until
the set is ready for integration into the main sstable set.
The whole operation is done in maintenance mode, using the streaming
scheduling group.
We can do it this way because data in maintenance set is disjoint, so
effects on read amplification is avoided by using
partitioned_sstable_set, which is able to efficiently and incrementally
retrieve data from disjoint sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
SSTables that are off-strategy should be excluded by this function as
it's used to select candidates for regular compaction.
So in addition to only returning candidates from the main set, let's
also rename it to precisely reflect its behavior.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This new sstable set will hold sstables created by repair-based
operations. A repair-based op creates 1 sstable per vrange (256),
so sstables added to this new set are disjoint, therefore they
can be efficiently read from using partitioned_sstable_set.
Compound set is changed to include this new set, so sstables in
this new set are automatically included when creating readers,
computing statistics, and so on.
This new set is not backlog tracked, so changes were needed to
prevent a sstable in this set from being added or removed from
the tracker.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
From now own, _sstables becomes the compound set, and _main_sstables refer
only to the main sstables of the table. In the near future, maintenance
set will be introduced and will also be managed by the compound set.
So add_sstable() and on_compaction_completion() are changed to
explicitly insert and remove sstables from the main set.
By storing compound set in _sstables, functions which used _sstables for
creating reader, computing statistics, etc, will not have to be changed
when we introduce the maintenance set, so code change is a lot minimized
by this approach.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Compound set will not be inserted or erased directly, so let's change
this function to build a new set from scratch instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
After compound set, discard_sstables() will have to prune each set
individually and later refresh the compound set. So let's change
the function to support multiple sstable sets, taking into account
that a sstable set may not want to be backlog tracked.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The purpose is to allow the code to be eventually reused by maintenance
sstable set, which will be soon introduced.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This new sstable set implementation is useful for combining operation of
multiple sstable sets, which can still be referenced individually via
its shared ptr reference.
It will be used when maintenance set is introduced in table, so a
compound set is required to allow both sets to have their operations
efficiently combined.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When reshaping hundreds of disjoint sstables, like on bootstrap,
contiguity wasn't being preserved because the heuristic for picking
candidates didn't take into account their token range, which resulted
in reshape messing with the contiguity that could otherwise be
preserved by respecting the token order of the disjoint sstables.
In other words, sstables with the smallest first tokens should be
compacted first. By doing that, the contiguity is preserved even
across size tiers, after reshape has completed its possible multiple
rounds to get all the data in shape.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Relaxed mode, used during initialization, of reshape only tolerates min_threshold
(default: 4) L0 sstables. However, relaxed mode should tolerate more sstables in
level 0, otherwise boot will have to reshape level 0 every time it crosses the
min threshold. So let's make LCS reshape tolerate a max of max_threshold and 32.
This change is beneficial because once table is populated, LCS regular compaction
can decide to merge those sstables in level 0 into level 1 instead, therefore
reducing WA.
Refs #8297.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210318131442.17935-1-raphaelsc@scylladb.com>
By the time we receive snapshot_reply from a follower
we may no longer be the leader. Follower term may be
different from snapshot term, e.g. the follower may
be aware of a new leader already and have a higher term.
We should pass this information into (possibly ex-) leader FSM via
fsm::step() so that it can correctly change its state, and
not call FSM directly.
Raft send_snapshot RPC is actually two-way, the follower
responds with snapshot_reply message. This message until now
was, however, muted by RPC.
Do not mute snapshot_reply any more:
- to make it obvious the RPC is two way
- to feed the follower response directly into leader's FSM and
thus ensure that FSM testing results produced when using a test
transport are representative of the real world uses of
raft::rpc.
Set follower's next_idx to snapshot index + 1 when switching
it to snapshot mode. If snapshot transfer succeeds, that's the
best match for the follower's next replication index. If it fails,
the leader will send a new probe to find out the follower position
again and re-try sending a possibly newer snapshot.
The change helps reduce protocol state managed outside FSM.
If tracing::tracing::_ignore_trace_events is enabled then
the tracing system must ignore all sessions events
for non full_tracing sessions (probability tracing and
user requested) and creating subsessions with the
make_trace_info.
Patch introduces the slow query tracing fast mode that
omits all events during tracing.
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
This state will be used for permits that are not in admitted state when
registered as inactive. We can have such reads if a read can be served
entirely from cache/memtables and it doesn't have to go to disk and
hence doesn't go through admission. These permits currently don't
forward their cost to the semaphore so they won't prevent their own
admission creating a deadlock. However, when in inactive state, we do
want to keep tabs on their resource consumption so we don't accumulate
too much of these inactive reads. So introduce a new state for these
non-admitted inactive reads. When entering the inactive state, the
permit registers its cost with the semaphore, and when unregistered as
inactive, it retracts it. This is a workaround (khm hack) until #4758 is
solved and all permits will be admitted on creation.
`reader_concurrency_semaphore::register_inactive_read()` drops the
registered inactive read immediately if there is a resource shortage.
This is in effect a resource based eviction, so account it as such in
`querier::insert()`.
Broken by the move to an intrusive container (9cbbf40), which caused
said method to only clear the container but not destroy the inactive
reads contained therein. This patch restores the previous behaviour and
also adds a call the destructor (to ensure inactive reads are cleaned up
under any circumstances), as well as a unit test.
Having the handle keep an owning reference to the inactive read lead to
awkward situations, where the inactive read is destroyed during eviction
in certain situations only (querier cache) and not in other cases.
Although the users didn't notice anything from this, it lead to very
brittle code inside the reader concurrency semaphore. Among others, the
inactive read destructor has to be open coded in evict() which already
lead to mistakes.
This patch goes back to the weak pointer paradigm used a while ago,
which is a much more natural fit for this. Inactive reads are still kept
in an intrusive list in the semaphore but the handle now keeps a weak
pointer to them. When destroyed the handler will destroy the inactive
read if it is still alive. When evicting the inactive read, it will
set the pointer in the handle to null.
As #1449 notes, trichotomic comparators returning int are dangerous as they
can be mistaken for less comparators. This series converts dht::ring_position
and dht::decorated_key, as well as a few closely related downstream types, to
return std::strong_ordering.
Closes#8225
* github.com:scylladb/scylla:
dht: ring_position, decorated_key: convert tri_comparators to std::strong_ordering
pager: rephrase misleading comparison check
test: total_order_checks: prepare for std::strong_ordering
test: mutation_test: prepare merge_container for std::strong_ordering
intrusive_array: prepare for std::strong_ordering
utils: collection-concepts: prepare for std::strong_ordering
Convert tri_comparators to return std::strong_ordering rather than int,
to prevent confusion with less comparators. Downstream users are either
also converted, or adjust the return type back to int, whichever happens
to be simpler; in all cases the change it trivial.
We check !result_of_tri_compare, which makes it look like we're
checking a boolean predicate, whereas we're really checking for
equality. Change to result_of_tri_compare == 0, which is less likely
to be confusing, and is also compatible with std::strong_ordering.
Adjust the total_order_check template to work with comparators
returning either int (as a temporary compatibility measure) or
std::strong_ordering (for #1449 safety).
The function merge_container() accepts a trichotomic comparator returning
an int. As #1449 explains, this is dangerous as it could be mistaken for
a less comparator. Switch to std::strong_ordering, but leave a compatible
merge_container() in place as it is still needed (even after this series).
collection-concepts includes a Comparable concept for a trichotomic
comparator function, used in intrusive btree and double_decker. Prepare
for std::strong_ordering by also allowing std::strong_ordering as a
return type. Once we've cleaned the code base, we can tighten it to
only allow std::strong_ordering.
This set removes few more calls for global storage service and prevents
more of them to happen in thrift that's about to start using the memory
limiter semaphore too.
The set turns this semaphore into a sharded one living in the scope of
main(), makes others use the local instance and removes the no longer
needed bits from storage service.
tests: unit(dev)
branch: https://github.com/xemul/scylla/commits/br-global-memory-limiter-sem
* xemul_drop_memory_limiter:
storage_service: Drop memory limiter
memory_limiter: Use main-local instance everyehere
main: Have local memory limiter and carry where needed
memory_limiter: Encapsulate memory limiting facility
cql_server: Remove semaphore getter fn from config
The cql_server and alternator both need the limiter, so
patch them to stop using storage service's one and use
the main-local one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Prepare memory limiters to have non-global instance of
the service. For now the main-local instance is not
used and (!) is not stopped for real, just like the
storage_service's one is.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The storage service carries sempaphore and a size_t value
to facilitate the memory limiting for client services.
This patch encapsulates both fields on a separate helper
class that will be used by whoever needs it without
messing with the storage service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The cql_server() need to get the memory limiter semaphore
from local storage service instance. To make this happen
a callback in introduced on the config structure. The same
can be achieved in a simler manner -- by providing the
local storage service instances directly.
Actually, the storage service will be removed in further
patches from this place, so this patch is mostly to get
rid of the callback from the config.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The slowest test in test_streams.py is test_list_streams_paged. It is meant
to test the ListStreams operation with paging. The existing test repeated
its test four times, for four different stream types. However, there is
no reason to suspect that the ListStreams operation might somehow be
different for the four stream types... We already have other tests which
create streams of the four types, and uses these streams - we don't
need the test for ListStreams to also test creating the four types.
By doing this test just once, not four times, we can save around 1.5
seconds of test time.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210318073755.1784349-1-nyh@scylladb.com>
In the test test_tracing.py::test_tracing_all, we do some operations and
then need to wait until they appear in the tracing table.
The current code used an exponentially-increasing delay during this wait,
starting with 0.1 seconds and then doubling the delay until we find what
we're looking for.
However, it turns out that the delay until the data appears in the table
is deliberately chosen by Scylla - and is always around 2 seconds.
In this case, an exponential delay is really bad - we will usually wait
for around 1 seconds too long after the needed wait of 2 seconds.
So in this patch we replace the exponential delay by a constant delay -
we wait 0.3 seconds between each retry.
This change makes the test test_tracing.py::test_tracing_all finish
in a little over 2 seconds, instead of a little over 3 seconds
before this patch. We cannot reduce this 2 second time any further
unless we make the 2-second tracing delay configurable.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210318000040.1782933-1-nyh@scylladb.com>
The test test_table.py::test_table_streams_on creates tables with various
stream types, and then immediately deletes them without testing anything.
This is a slow test (taking almost a full second on my laptop), and is
redundant because in test_streams.py we have tests which create tables
with streams in the same way - but then actually test that things work
with these streams. So this test might as well be removed, and this is
what we do in this patch.
Removing this test shaves another second from the Alternator test suite's
run time.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210317230530.1780849-1-nyh@scylladb.com>
The test
test_condition_expression.py::test_condition_expression_with_forbidden_rmw
takes half a second to run (dev build, on my laptop), one of the slowest
tests in Alternator's test suite. Part of the reason was that it needlessly
set the same table to forbidden_rmw, multiple times.
Instead of doing that, we switch to using the test_table_s_forbid_rmw
fixture, which is a table like test_table_s but created just once in
forbid_rmw mode.
The result is a faster test (0.05 seconds instead of 0.5 seconds), but
also safer if we ever want to run tests in parallel. It also fixes a
bug in the test: At the end of the test, we intended to double-check
that although the forbid_rmw table forbids read-modify-write operations,
it does allow pure writes. Yet the test did this after clearing the
forbid_rmw mode... So after this patch the test verifies this on the
forbid_rmw table, as intended.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210317222703.1779992-1-nyh@scylladb.com>
The test
test_condition_expression.py::test_condition_expression_with_permissive_write_isolation
Currently takes (on my laptop, dev build) a full two seconds, one of
the slowest tests. It is not surprising it is slow - it runs five other
tests three times each (for three different write isolation modes),
but it doesn't have to be this slow. Before this patch, for each of
the five tests we switch the write isolation mode three times, and
these switches involve schema changes and are fairly slow. So in
this patch we reverse the loop - and switch the write isolation mode
to the outer loop.
This patch halves the runtime of this test - from two seconds to one.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210317221045.1779329-1-nyh@scylladb.com>
This series makes configure.py output slightly more helpful in case of incorrect parameters passed to the compiler/linker.
Closes#8267
* github.com:scylladb/scylla:
configure: print more context if the linking attempt failed
configure: provide more context on failed ./configure.py run
configure: add verbose option to try_compile_and_link
We should run install-dependencies.sh with -e option to prevent ignoring
error in the script.
Also, need to add tools/jmx/install-dependencies.sh and
tools/java/install-dependencies.sh, to fix 'No such file or directory'
error on them.
Fixes#8293Closes#8294
[avi: did not regenerate toolchain image, since no new packages are
installed]
In some places we use the `*reinterpret_cast<const net::packed<T>*>(&x)`
pattern to reinterpret memory. This is a violation of C++'s aliasing rules,
which invokes undefined behaviour.
The blessed way to correctly reinterpret memory is to copy it into a new
object. Let's do that.
Note: the reinterpret_cast way has no performance advantage. Compilers
recognize the memory copy pattern and optimize it away.
Closes#8241
* github.com:scylladb/scylla:
treewide: get rid of unaligned_cast
treewide: get rid of incorrect reinterpret casts
The existing implementation wrongfully shares _all sstables
rather than cloning it. This caused a use-after-free
in `repair_meta::do_estimate_partitions_on_local_shard`
when traversing a shared sstable_set, during which
`table::make_reader_excluding_sstables` erased an entry.
The erase should have happened on a cloned copy
of the sstable_list, not on a shared copy.
The regression was introduced in
c3b8757fa1.
Added a unit test that reproduces the share-on-copy issue
for partitioned_stable_set (sstables::sstable_set).
Fixes#8274
Test: unit(release, debug)
DTest: materialized_views_test.py:TestMaterializedViews.simple_repair_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210317145552.701559-1-bhalevy@scylladb.com>
This series adds slow query logging capability to alternator. Queries which last longer than the specified threshold are logged in `system_traces.node_slow_log` and traced.
In order to be better prepared for https://github.com/scylladb/scylla/issues/2572, this series also expands the tracing API to allow custom key-value params and adds a custom `alternator_op` parameter to the slow node log. This information can also be deduced from the tracing session id by consulting the system_traces.events table, but https://github.com/scylladb/scylla/issues/2572 's assumption is that this tracing might not always be available in the future.
This series comes with a simple test case which checks if operation logs indeed end up in `system_traces.node_slow_log`.
Tests:
unit(dev, alternator pytest)
manual: verified that no operations are logged if slow query logging is disabled; verified that operations that take less time than the threshold are not logged; verified with test_batch.py::test_batch_write_item_large that a large-enough operation is indeed logged and traced.
Fixes#8292
Example trace:
```cql
cqlsh> select parameters, duration from system_traces.node_slow_log where start_time=b7a44589-8711-11eb-8053-14c6c5faf955;
parameters | duration
---------------------------------------------------------------------------------------------+----------
{'alternator_op': 'DeleteTable', 'query': '{"TableName": "alternator_Test_1615979572905"}'} | 75732
```
Closes#8298
* github.com:scylladb/scylla:
alternator: add test for slow query logging
alternator: allow enabling slow query logging
tracing: allow providing a custom session record param
"
All compaction types can now be stopped with the nodetool stop
command, example: nodetool stop SCRUB
Supported types are: COMPACTION, CLEANUP, VALIDATION, SCRUB,
INDEX_BUILD, RESHARD, UPGRADE, RESHAPE.
"
* 'stop_compaction_types_v2' of github.com:raphaelsc/scylla:
compaction: Allow all supported compaction types to be stopped
compaction: introduce function to map compaction name to respective type
compaction: refactor mapping of compaction type to string
compaction: move compaction_name() out of line
index_entry.hh (the current home of `promoted_index_blocks_reader`) is
included in `sstables.hh` and thus in half our code-base. All that code
really doesn't need the definition of the promoted index blocks reader
which also pulls in the sstables parser mechanism. Move it into its own
header and only include it where it is actually needed: the promoted
index cursor implementations.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210317093654.34196-1-bdenes@scylladb.com>
If the current leader is set, the follower will not vote
for another candidate. This is also known as "sticky leadership" rule.
Before this change, the rule was enacted only upon receiving
AppendEntries RPC from the leader. Turn it on also upon receiving
InstallSnapshot RPC.
In some places we use the `*reinterpret_cast<const net::packed<T>*>(&x)`
pattern to reinterpret memory. This is a violation of C++'s aliasing rules,
which invokes undefined behaviour.
The blessed way to correctly reinterpret memory is to copy it into a new
object. Let's do that.
Note: the reinterpret_cast way has no performance advantage. Compilers
recognize the memory copy pattern and optimize it away.
The test checks whether slow queries are properly logged
in the system_traces.node_slow_log system table.
The test is deterministic because it uses the threshold of 0ms
to qualify a query as slow, which effectively makes all queries
"slow enough".
The mechanism of session record params is currently only used
to store query strings and a couple more params like consistency level,
but since we now have more frontends than just CQL and Thrift,
it would be nice to also allow the users to put custom parameters in
there.
An immediate first user of this mechanism would be alternator,
which is going to put the operation type under the "alternator_op" key.
The operation type is not part of the query string due to how DynamoDB's
protocol works - the op type is stored separately in the HTTP header.
While it's possible to extract the operation type from the session_id,
it might not be the case once #2572 is implemented.
As the existing comment explains a progress can be deleted at the point
of logging. The logging should only be done if the progress still
exists.
Message-Id: <YFDFVRQU1iVYhFdM@scylladb.com>
Previously, we crashed when the IN marker is bound to null. Throw
invalid_request_exception instead.
Fixes#8265
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8287
... `writes_coordinator_outside_replica_set`' from Juliusz Stasiewicz
With this change, coordinator prefers himself as the "counter leader", so if
another endpoint is chosen as the leader, we know that coordinator was not a
member of replica set. With this guarantee we can increment
`scylla_storage_proxy_coordinator_writes_coordinator_outside_replica_set` metric
after electing different leader (that metric used to neglect the counter
updates).
The motivation for this change is to have more reliable way of counting
non-token-aware queries.
Fixes#4337Closes#8282
* github.com:scylladb/scylla:
storage_proxy: Include counter writes in `writes_coordinator_outside_replica_set`
counters: Favor coordinator as leader
Refs #7794
Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.
Closes#8250
* github.com:scylladb/scylla:
commitlog: Make pre-allocation drop O_DSYNC while pre-filling
commitlog: coroutinize allocate_segment_ex
Before this change, we would print an expression like this:
((ColumnDefinition{name=c, type=org.apache.cassandra.db.marshal.Int32Type, kind=CLUSTERING_COLUMN, componentIndex=0, droppedAt=-9223372036854775808}) = 0000007b)
Now, we print the same expression like this:
(c = 0000007b)
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8285
There are a bunch of such calls in schema altering statements and
there's currently no way to obtain the migration manager for such
statements, so a relatively big rework needed.
The solution in this set is -- all statements' execute() methods are
called with query processor as first argument (now the storage proxy
is there), query processor references and provides migration manager
for statements. Those statements that need proxy can already get it
from the query processor.
Afterwards table_helper and thrift code can also stop using the global
migration manager instance, since they both have query processor in
needed places. While patching them a couple of calls to global storage
proxy also go away.
The new query processor -> migration manager dependency fits into
current start-stop sequence: the migration manager is started early,
the query processor is started after it. On stop the query processor
remains alive, but the migration manager stops. But since no code
currently (should) call get_local_migration_manager() it will _not_
call the query_processor::get_migration_manager() either, so this
dangling reference is ugly, but safe.
Another option could be to make storage proxy reference migration
manager, but this dependency doesn't look correct -- migration manager
is higher-level service than the storage proxy is, it is migration
manager who currently calls storage proxy, but not the vice versa.
* xemul/br-kill-some-migration-managers-2:
cql3: Get database directly from query processor
thrift: Use query_processor::get_migration_manager()
table_helper: Use query_processor::get_migration_manager()
cql3: Use query_processor::get_migration_manager() (lambda captures cases)
cql3: Use query_processor::get_migration_manager() (alter_type statement)
cql3: Use query_processor::get_migration_manager() (trivial cases)
query_processor: Keep migration manager onboard
cql3: Pass query processor to announce_migration:s
cql3: Switch to qp (almost) in schema-altering-stmt
cql3: Change execute()'s 1st arg to query_processor
compound set will select runs from all of its managed sets, so let's
adjust select_sstable_runs() to only return runs which belong to it.
without this adjustment, selection of runs would fail because
function would try to unconditionally retrieve the run which may
live somewhere else.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312042255.111060-3-raphaelsc@scylladb.com>
after compound set is introduced, select_sstable_runs() will no longer
work because the sstable runs live in sstable_set, but they should
actually live in the sstable_set being written to.
Given that runs is a concept that belongs only to strategies which
use partitioned_sstable_set, let's move the implementation of
select_sstable_runs() to it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312042255.111060-2-raphaelsc@scylladb.com>
When internode_encryption is "rack" or "dc", we should enforce incoming
connections are from the appropriate address spaces iff answering on
non-tls socket.
This is implemented by having two protocol handlers. One for tls/full notls,
and one for mixed (needs checking) connections. The latter will ask
snitch if remote address is kosher, and refuse the connection otherwise.
Note: requires seastar patches:
"rpc: Make is possible for rpc server instance to refuse connection"
"RPC: (client) retain local address and use on stream creation"
Note that ip-level checks are not exhaustive. If a user is also using
"require_client_auth" with dc/rack tls setting we should warn him that
there is a possibility that someone could spoof himself pass the
authentication.
Closes#8051
Basically SLES support is already done in f20736d93d, but it was for offline installer.
This fixes few more problems to install our rpm to SLES.
After this change, we can just install our rpm for both CentOS/RHEL and SLES in single image, like unified deb.
SLES uses original package manager called 'zypper', but it does support yum repository so no need to change required for repo.
Closes#8277
* github.com:scylladb/scylla:
scylla_coredump_setup: support SLES
scylla_setup: use rpm to check package availability for SLES
dist: install optional packages for SLES
Automatically initialize and start a timer in
`raft_services::add_server` for each raft server instance created.
The patch set also changes several other things in order
for tickers to work:
1. A bug in `raft_sys_table_storage` which caused an exception
if `raft::server::start` is called without any persisted state.
2. `raft_services::add_server` now automatically calls
`raft::server::start()` since a server instance should be started
before any of its methods can be called.
3. Raft servers can now start with initial term = 0. There was an
artificial restriction which is now lifted.
4. Raft schema state machine now returns a ready future instead of
throwing "not implemented" exception in `abort()`.
* github.com/ManManson/scylla.git/raft_services_tickers_v9_next_rebase:
raft/raft_services: provide a ticker for each raft server
raft/raft_services: switch from plain `throw` to `on_internal_error`
raft/raft_services: start server instance automatically in `add_server`
raft: return ready future instead of throwing in schema_raft_state_machine
raft: allow raft server to start with initial term 0
raft/raft_sys_table_storage: fix loading term/vote and snapshot from empty state
The log structured allocator's background reclaimer tries to
allocate CPU power proportional to memory demand, but a
bug made that not happen. Fix the bug, add some logging,
and future-proof the timer. Also, harden the test against
overcommitted test machines.
Fixes#8234.
Test: logalloc_test(dev), 20 concurrent runs on 2 cores (1 hyperthread each)
Closes#8281
* github.com:scylladb/scylla:
test: logalloc_test: harden background reclain test against cpu overcommit
logalloc: background reclaim: use default scheduling group for adjusting shares
logalloc: background reclaim: log shares adjustment under trace level
logalloc: background reclaim: fix shares not updated by periodic timer
Automatically initialize a ticker for each raft server
instance when `raft_services::add_server` is called.
A ticker is a timer which regularly calls `raft::server::tick`
in order to tick its raft protocol state machine.
Note that the timer should start after the server calls
its `start()` method, because otherwise it would crash
since fsm is not initialized yet.
Currently, the tick interval is hardcoded to be 100ms.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Raft server instance cannot be used in any way prior
to calling the `start()` method, which initializes
its internal state, e.g. raft protocol state machine.
Otherwise, it will likely result in a crash.
Also, properly stop the servers on shutdown via
`raft_services::stop_servers()`.
In case some exception happened inside `add_server`,
the `init` function will de-initialize what it already
initialized, i.e. raft rpc verbs. This is important
since otherwise it would break further initialization
process and, what is more important, will prevent raft
rpc verbs deinitialization. This will cause a crash in
`messaging_service` uninit procedure, because raft rpc
handlers would still be initialized.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The current implementation throws an exception, which will cause
a crash when stopping scylla. This will be used in the next patch.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Prior to the fix there was an assert to check in
`raft::server_impl::start` that the initial term is not 0.
This restriction is completely artificial and can be lifted
without any problems, which will be described below.
The only place that is dependent on this corner case is in
`server_impl::io_fiber`. Whenever term or vote has changed,
they will be both set in `fsm::get_output`. `io_fiber` checks
whether it needs to persist term and vote by validating that
the term field is set (by actually executing a `term != 0`
condition).
This particular check is based on an unobvious fact that the
term will never be 0 in case `fsm::get_output` saves
term and vote values, indicating that they need to be
persisted.
Vote and term can change independently of each other, so that
checking only for term obscures what is happening and why
even more.
In either case term will never be 0, because:
1. If the term has changed, then it's naturally greater than 0,
since it's a monotonically increasing value.
2. If the vote has changed, it means that we received
a vote request message. In such case we have already updated
our term to the requester's term.
Switch to using an explicit optional in `fsm_output` so that
a reader don't have to think about the motivation behind this `if`
and just checks that `term_and_vote` optional is engaged.
Given the motivation described above, the corresponding
assert(_fsm->get_current_term() != term_t(0));
in `server_impl::start` is removed.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
When a raft server is started for the first time and there isn't
any persisted state yet, provide default return values for
`load_term_and_vote` and `load_snapshot`. The code currently
does not handle this corner case correctly and fail with an
exception.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Coordinator prefers himself as the "counter leader", so if another
endpoint is chosen as the leader, we know that coordinator was
not a member of replica set. We can use this information to
increment relevant metric (which used to neglect the counters
completely).
Fixes#4337
This not only reduces internode traffic but is also needed for a
later change in this PR: metrics for non-token-aware writes
including counter updates.
Both methods apply a list of tombstones to the stream. One
was unused even before the set, the other one became unused
after previous patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the reader gets all range tombstones from the given
range and places them into a stream. When filling the buffer
with fragments the range tombstones are extracted from the
stream one by one.
This is memory consuming, the reader's memory usage shouldn't
depend on the number of inhabitants in the partition range.
The patch implements the heap-based cursor for range tombstones
almost like it's done for rows.
The heap contains range_tombstone_list::iterator_ranges, the
tombstones are popped from the heap when needed, are applied
into the stream and then are emitted from it into the buffer.
The refresh_state() is called on each new range to set up the
iterators, and when lsa reports references invalidation to
refresh the iterators. To let the refresh_state revalidate the
iterators, the position at which the last range tombstone was
emitted is maintained.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The existing refresh_state() is supposed to setup or revalidate
iterators to rows inside partition versions if needed. It will
be called in more than one place soon, so here's the helper.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The lsa_partition_reader is the helper sub-class for
partition_snapshot_reader that, among other things, is
responsible for filling the stream of range tombstones,
that's then used by the reader itself.
Next patches will change the way range tombstones are
emitted by the reader, so hide the stream inside the
helper subclass in advance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This method "notifies" the lsa_reader helper class when the owning
reader moves to a new range. This method is now empty, but will be
used by next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patch will extend the comparator to manage heap of
range tombstones. Not to add yet another comparator to
it (and not to create another heap comparator class) just
use the comparator that's common for both -- rows and range
tombstones.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are already two raii-sh comparators on reader, next patch will
need to add the third. This just bloats the reader, the comparators
in question are state-less and can be created on demand for free.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two of them now -- one to return iterator_range that
covers the given query::clustering_range, the other to return
it for two given positions.
In the next patch the 3rd one is needed -- the slice() to get
iterator_range that's
a) starts strictly after a given position
b) ends after the given clustering_range's end
It will be used to refresh the range tombstones iterators after
some of them will have been emitted. The same thing is currently
done by partition_snapshot_reader's refresh_state wrt rows:
if (last_row)
start = rows.upper_bound(last_row) // continuation
else
start = rows.lower_bound(range.start) // initial
end = rows.upper_bound(range.end) // end is the same in
// either case
Respectively for range tombstones the goal is the same.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The range_tombstone_list::slice() set of methods return
back pair of iterators represending a range. In the next
patches this pair will be actively used, and it's handy
to have a shorter alias for it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Previously, when a linking attempt failed, configure.py immediately
printed that neither lld nor gold was found, which might be misleading
if the linkers are installed, but the compilation failed anyway.
The printed information is now more specific, and combined with the
previous commit, it will also provide more information why the
compilation attempt failed.
If the configuration step failed, it used to only inform that
it must be due to the wrong GCC version, which can be misleading.
For instance, trying to compile on clang with incorrect flags
also resulted in an "wrong GCC version" message.
Now, the message is more generic, but it also prints the stderr
output from the miscompilation, which may help pinpoint the problem:
$ ./configure.py --mode release --cflags='-fhello -fcolor-diagnostics -mllvm -opt-bisect-limit=10000' --compiler=clang++ --c-compiler=clang
Note: neither lld nor gold found; using default system linker
Compilation failed: clang++ -x c++ -o build/tmp/tmp1177gojf /home/sarna/repo/scylla/build/tmp/tmp_u3voys6 -fhello -fcolor-diagnostics -mllvm -opt-bisect-limit=10000 []
// clang pretends to be gcc (defined __GNUC__), so we
// must check it first
\#ifdef __clang__
\#if __clang_major__ < 10
#error "MAJOR"
\#endif
\#elif defined(__GNUC__)
\#if __GNUC__ < 10
#error "MAJOR"
\#elif __GNUC__ == 10
#if __GNUC_MINOR__ < 1
#error "MINOR"
#elif __GNUC_MINOR__ == 1
#if __GNUC_PATCHLEVEL__ < 1
#error "PATCHLEVEL"
#endif
#endif
\#endif
\#else
\#error "Unrecognized compiler"
\#endif
int main() { return 0; }
clang-11: error: unknown argument: '-fhello'
distcc[4085341] ERROR: compile (null) on localhost failed
Wrong compiler version or incorrect flags. Scylla needs GCC >= 10.1.1 with coroutines (-fcoroutines) or clang >= 10.0.0 to compile.
After previous patches some places in cql3 code take a
long path to get database reference:
query processor -> storage proxy -> database
The query processor can provide the database reference
by itself, so take this chance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Thrift needs migration manager to call announce_<something> on
it and currently it grabs blobak migration manager instance.
Since thrift handler has query processor rerefence onboard and
the query processor can provide the migration manager reference,
it's time to remove few more globals from thrift code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After the migration manager can be obtained from the query
processor the table heler can also benefit from it and not
call for global migration manager instance any longer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are few schema altering statements that need to have
the query processor inside lambda continuations. Fortunately,
they all are continuations of make_ready_future<>()s, so the
query processor can be simply captured by reference and used.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This statement needs the query processor one step below the
stack from its .announce_migration method. So here's the
dedicated patch for it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Most of the schema altering statements implementations can now
stop calling for global migration manager instance and get it
from the query processor.
Here are the trivial cases when the query processor is just
avaiable at the place where it's needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The query processor sits upper than the migration manager,
in the services layering, it's started after and (will be)
stopped before the migration manager.
The migration manager is needed in schema altering statements
which are called with query processor argument. They will
later get the migration manager from the query processor.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the only call to .announce_migration gas the
query processor at hands -- pass it to the real statements.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The schema altering statements are all inherited from the same
base class which delcares a pure virtual .announce_migration()
method. All the real statements are called with storage proxy
argument, while the need the migration manager. So like in the
previous patch -- replace storage proxy with query processor.
While doing the replacement also get the database instance from
the querty processor, not from proxy.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the statement's execute() method accepts storage
proxy as the first argument. This is enough for all of them
but schema altering ones, because the latter need to call
migration manager's announce.
To provide the migration manager to those who need it it's
needed to have some higher-level service that the proxy. The
query processor seems to be good candidate for it.
Said that -- all the .execute()s now accept the querty
processor instead of the proxy and get the proxy itself from
the query processor.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If the shares are currently low, we might not get enough CPU time to
adjust the shares in time.
This is currently no-op, since Seastar runs the callback outside
scheduling groups (and only uses the scheduling group for inherited
continuations); but better be insulated against such details.
adjust_shares() thinks it needs to do nothing if the main loop
is running, but in reality it can only avoid waking the main loop;
it still needs to adjust the shares unconditionally. Otherwise,
the background reclaim shares can get locked into a low value.
Fix by splitting the conditional into two.
Move boost tests to tests/raft and factor out common helpers.
* alejo/raft-tests-reorg-5-rebase-next-2:
raft: tests: move common helpers to header
raft: tests: move boost tests to tests/raft
Refs #7794
Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.
v2:
* More comment
v3:
* Add missing flush
v4:
* comment
v5:
* Split coroutine and fix into separate patches
Prior to 463d0ab, only one table could be cleaned up at a time on a given shard.
Since then, all tables belonging to a given keyspace are cleaned up in parallel.
Cleanup serialization on each shard was enforced with a semaphore, which was
incorrectly removed by the patch aforementioned.
So space requirement for cleanup to succeed can be up to the size of keyspace,
increasing the chances of node running out of space.
Node could also run out of memory if there are tons of tables in the keyspace.
Memory requirement is at least #_of_tables * 128k (not taking into account write
behind, etc). With 5k tables, it's ~0.64G per shard.
Also all tables being cleaned up in parallel will compete for the same
disk and cpu bandwidth, so making them all much slower, and consequently
the operation time is significantly higher.
This problem was detected with cleanup, but scrub and upgrade go through the
same rewrite procedure, so they're affected by exact the same problem.
Fixes#8247.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312162223.149993-1-raphaelsc@scylladb.com>
When failing to rebuild a node, we would print the error with the useless
explanation "<no exception>". The problem was a typo in the logging command
which used std::current_exception() - which wasn't relevant in that point -
instead of "ep".
Refs #8089
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210314113118.1690132-1-nyh@scylladb.com>
This is how PhD explain the need for prevoting stage:
One downside of Raft's leader election algorithm is that a server that
has been partitioned from the cluster is likely to cause a disruption
when it regains connectivity. When a server is partitioned, it will
not receive heartbeats. It will soon increment its term to start
an election, although it won't be able to collect enough votes to
become leader. When the server regains connectivity sometime later, its
larger term number will propagate to the rest of the cluster (either
through the server's RequestVote requests or through its AppendEntries
response). This will force the cluster leader to step down, and a new
election will have to take place to select a new leader.
Prevoting stage is addressing that. In the Prevote algorithm, a
candidate only increments its term if it first learns from a majority of
the cluster that they would be willing to grant the candidate their votes
(if the candidate's log is sufficiently up-to-date, and the voters have
not received heartbeats from a valid leader for at least a baseline
election timeout).
The Prevote algorithm solves the issue of a partitioned server disrupting
the cluster when it rejoins. While a server is partitioned, it won't
be able to increment its term, since it can't receive permission
from a majority of the cluster. Then, when it rejoins the cluster, it
still won't be able to increment its term, since the other servers
will have been receiving regular heartbeats from the leader. Once the
server receives a heartbeat from the leader itself, it will return to
the follower state(in the same term).
In our implementation we have "stable leader" extension that prevents
spurious RequestVote to dispose an active leader, but AppendEntries with
higher term will still do that, so prevoting extension is also required.
* scylla-dev/raft-prevote-v5:
raft: store leader and candidate state in state variant
raft: add boost tests for prevoting
raft: implement prevoting stage in leader election
raft: reset the leader on entering candidate state
raft: use modern unordered_set::contains instead of find in become_candidate
We already have server state dependant state in fsm, so there is no need
to maintain "voters" and "tracker" optionals as well. The upside is that
optional and variant sates cannot drift apart now.
This is how PhD explain the need for prevoting stage:
One downside of Raft's leader election algorithm is that a server that
has been partitioned from the cluster is likely to cause a disruption
when it regains connectivity. When a server is partitioned, it will
not receive heartbeats. It will soon increment its term to start
an election, although it won't be able to collect enough votes to
become leader. When the server regains connectivity sometime later, its
larger term number will propagate to the rest of the cluster (either
through the server's RequestVote requests or through its AppendEntries
response). This will force the cluster leader to step down, and a new
election will have to take place to select a new leader.
Prevoting stage is addressing that. In the Prevote algorithm, a
candidate only increments its term if it first learns from a majority of
the cluster that they would be willing to grant the candidate their votes
(if the candidate's log is sufficiently up-to-date, and the voters have
not received heartbeats from a valid leader for at least a baseline
election timeout).
The Prevote algorithm solves the issue of a partitioned server disrupting
the cluster when it rejoins. While a server is partitioned, it won't
be able to increment its term, since it can't receive permission
from a majority of the cluster. Then, when it rejoins the cluster, it
still won't be able to increment its term, since the other servers
will have been receiving regular heartbeats from the leader. Once the
server receives a heartbeat from the leader itself, it will return to
the follower state(in the same term).
In our implementation we have "stable leader" extension that prevents
spurious RequestVote to dispose an active leader, but AppendEntries with
higher term will still do that, so prevoting extension is also required.
This new method is preferred over all() for iterations purposes, because
all() may have to copy sstables into a temporary.
For example, all() implementation of the upcoming compound_sstable_set
will have no choice but to merge all sstables from N managed sets into
a temporary.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210311163009.42210-1-raphaelsc@scylladb.com>
"
Currently the sstable reader code is scattered across several source
files as following (paths are relative to sstables/):
* partition.cc - generic reader code;
* row.hh - format specific code related to building mutation fragments
from cells;
* mp_row_consumer.hh - format specific code related to parsing the raw
byte stream;
This is a strange organization scheme given that the generic sstable
reader is a template and as such it doesn't itself depend on the other
headers where the consumer and context implementations live. Yet these
are all included in partition.cc just so the reader factory function can
instantiate the sstable reader template with the format specific
objects.
This patchset reorganizes this code such that the generic sstable reader
is exposed in a header. Furthermore, format specific code is moved to
the kl/ and mx/ directories respectively. Each directory has a
reader.hh with a single factory function which creates the reader, all
the format specific code is hidden from sight. The added benefit is that
now reader code specific to a format is centralized in the format
specific folder, just like the writer code.
This patchset only moves code around, no logical changes are made.
Tests: unit(dev)
"
* 'sstable-reader-separation/v1' of https://github.com/denesb/scylla:
sstables: get rid of mp_row_consumer.{hh,cc}
sstables: get rid of row.hh
sstables/mp_row_consumer.hh: remove unused struct new_mutation
sstables: move mx specific context and consumer to mx/reader.cc
sstables: move kl specific context and consumer to kl/reader.cc
sstables: mv partition.cc sstable_mutation_reader.hh
Let's make stop_compaction() use sstables::to_compaction_type(),
so all supported compaction types can now be aborted.
Refs #7738.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This will make it easier to introduce new type and also to map type to
string and vice-versa, using reverse lookup.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Move stuff contained therein to `sstable_mutation_reader.{hh,cc}` which
will serve as the collection point of utility stuff needed by all reader
implementations.
Move stuff contained therein to `sstable_mutation_reader.{hh,cc}` which
will serve as the collection point of utility stuff needed by all reader
implementations.
Move all the mx format specific context and consumer code to
mx/reader.cc and add a factory function `mx::make_reader()` which takes
over the job of instantiating the `sstable_mutation_reader` with the mx
specific context and consumer.
Move all the kl format specific context and consumer code to
kl/reader* and add a factory function `kl::make_reader()` which takes
over the job of instantiating the `sstable_mutation_reader` with the kl
specific context and consumer. Code which is used by test is moved to
kl/reader_impl.hh, while code that can be hidden us moved to
kl/reader.cc. Users who just want to create a reader only have to
include kl/reader.hh.
The sstable reader currently knows the definition of all the different
consumers and contexts. But it doesn't really need to, as it is a
template. Exploit this and prepare for a organization scheme where the
consumers and contexts live hidden in a cc file which includes and
instantiates the sstable reader template. As a first step expose
`sstable_mutation_reader` in a header.
- 3 nodes in the cluster with rf = 3
- run repair on node1 with ignore_nodes to ignore node2 and node3
- node1 has no followers to repair with
However, currently node1 will walk through the repair procedure to read
data from disk and calculate hashes which are unnecessary.
This patch fixes this issue, so that in case there are no followers, we
skip the range and avoid the unnecessary work.
Before:
$ curl -X POST http://127.0.0.1:10000/storage_service/repair_async/myks3?ignore_nodes="127.0.0.2,127.0.0.3"
repair - repair id [id=1, uuid=ff39151b-2ce9-4885-b7e9-89158b14b5c2] on shard 0 stats:
repair_reason=repair, keyspace=myks3, tables={standard1},
ranges_nr=769, sub_ranges_nr=769, round_nr=1456,
round_nr_fast_path_already_synced=1456,
round_nr_fast_path_same_combined_hashes=0,
round_nr_slow_path=0, rpc_call_nr=0, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.19 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.0.1, 2822972}},
row_from_disk_nr={{127.0.0.1, 6218}},
row_from_disk_bytes_per_sec={{127.0.0.1, 14.1695}} MiB/s,
row_from_disk_rows_per_sec={{127.0.0.1, 32726.3}} Rows/s,
tx_row_nr_peer={}, rx_row_nr_peer={}
Data was read from disk.
After:
$ curl -X POST http://127.0.0.1:10000/storage_service/repair_async/myks3?ignore_nodes="127.0.0.2,127.0.0.3"
repair - repair id [id=1, uuid=c6df8b23-bd3b-4ebc-8d4c-a11d1ebcca39] on shard 0 stats:
repair_reason=repair, keyspace=myks3, tables={standard1}, ranges_nr=769,
sub_ranges_nr=0, round_nr=0, round_nr_fast_path_already_synced=0,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=0, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.0 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={},
row_from_disk_nr={},
row_from_disk_bytes_per_sec={} MiB/s,
row_from_disk_rows_per_sec={} Rows/s,
tx_row_nr_peer={}, rx_row_nr_peer={}
No data was read from disk.
Fixes#8256Closes#8257
Instead of using the `restrictions` class hierarchy, calculate the clustering slice using the `expr::expression` representation of the WHERE clause. This will allow us to eventually drop the `restrictions` hierarchy altogether.
Tests: unit (dev, debug)
Closes#8227
* github.com:scylladb/scylla:
cql3: Make get_clustering_bounds() use expressions
cql3/expr: Add is_multi_column()
cql3/expr: Add more operators to needs_filtering
cql3: Replace CK-bound mode with comparison_order
cql3/expr: Make to_range globally visible
cql3: Gather slice-defining WHERE expressions
cql3: Add statement_restrictions::_where
test: Add unit tests for get_clustering_bounds
Not resetting a leader causes vote requests to be ignored instead of
rejected which will make voting round to take more time to fail and may
slow down new leader election.
Use expressions instead of _clustering_columns_restrictions. This is
a step towards replacing the entire restrictions class hierarchy with
expressions.
Update some expected results in unit tests to reflect the new code.
These new results are equivalent to the old ones in how
storage_proxy::query() will process them (details:
bound_view::from_range() returns the same result for an empty-prefix
singular as for (-inf,+inf)).
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Omitting these operators didn't cause bugs, because needs_filtering()
is never invoked on them. But that will likely change in the future,
so add them now to prevent problems down the road.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Instead of defining this enum in multi_column_restriction::slice, put
it in the expr namespace and add it to binary_operator. We will need
it when we switch bounds calculation from multi_column_restriction to
expr classes.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
It will be used in statement_restrictions for calculating clustering
bounds. And it will come in handy elsewhere in the future, I'm sure.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Add statement_restrictions::_clustering_prefix_restrictions and fill
it with relevant expressions. Explain how to find all such
expressions in the WHERE clause.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Fixes#8212
Some snapshotting operations call in on a single table at a time.
When checking for existing snapshots in this case, we should not
bother with snapshots in other tables. Add an optional "filter"
to check routine, which if non-empty includes tables to check.
Use case is "scrub" which calls with a limited set of tables
to snapshot.
Closes#8240
Currently, if the data_size is greater than
max_chunk_size - sizeof(chunk), we end up
allocating up to max_chunk_size + sizeof(chunk) bytes,
exceeding buf.max_chunk_size().
This may lead to allocation failures, as seen in
https://github.com/scylladb/scylla/issues/7950,
where we couldn't allocate 131088 (= 128K + 16) bytes.
This change adjusted the expose max_chunk_size()
to be max_alloc_size (128KB) - sizeof(chunk)
so that the allocated chunks would normally be allocated
in 128KB chunks in the write() path.
Added a unit test - test_large_placeholder that
stresses the chunk allocation path from the
write_place_holder(size) entry point to make
sure it handles large chunk allocations correctly.
Refs #7950
Refs #8081
Test: unit(release), bytes_ostream_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210303143413.902968-1-bhalevy@scylladb.com>
On the environment hard limit of coredump is set to zero, coredump test
script will fail since the system does not generate coredump.
To avoid such issue, set ulimit -c 0 before generating SEGV on the script.
Note that scylla-server.service can generate coredump even ulimit -c 0
because we set LimitCORE=infinity on its systemd unit file.
Fixes#8238Closes#8245
* 'preparatory_work_for_compound_set' of github.com:raphaelsc/scylla:
sstable_set: move all() implementation into sstable_set_impl
sstable_set: preparatory work to change sstable_set::all() api
sstables: remove bag_sstable_set
This test checks that `mutation_partition::difference()` works correctly.
One of the checks it does is: m1 + m2 == m1 + (m2 - m1).
If the two mutations are identical but have compactable data, e.g. a
shadowable tombstone shadowed by a row marker, the apply will collapse
these, causing the above equality check to fail (as m2 - m1 is null).
To prevent this, compact the two input mutations.
Fixes: #8221
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210310141118.212538-1-bdenes@scylladb.com>
The main motivation behind this is that by moving all() impl into
sstable_set_impl, sstable_set no longer needs to maintain a list
with all sstables, which in turn may disagree with the respective
sstable_set_impl.
This will be very important for compound_sstable_set_impl which
will be built from existing sets, and will implement all() by
combining the all() of its managed sets.
Without this patch, we'd have to insert the same sstable at
both compound set and also the set managed by it, to guarantee
all() of compound set would return the correct data, which would
be expensive and error prone.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
users of sstable_set::all() rely on the set itself keeping a reference
to the returned list, so user can iterate through the list assuming
that it is alive all the way through.
this will change in the future though, because there will be a
compound set impl which will have to merge the all() of multiple
managed sets, and the result is a temporary value.
so even range-based loops on all() have to keep a ref to the returned
list, to avoid the list from being prematurely destroyed.
so the following code
for (auto& sst : *sstable_set.all()) { ...}
becomes
for (auto sstables = sstable_set.all(); auto& sst : *sstables) { ... }
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"
This class is basically a wrapper around a unique pointer and a few
short convenience methods, but is otherwise a distraction in trying to
untangle the maze that is the sstable reader class hierachy.
So this patchset folds it into its only real user: the sstable reader.
"
* 'data_consume_context_bye' of https://github.com/denesb/scylla:
sstable: move data_consume_* factory methods to row.hh
sstables: fold data_consume_context: into its users
sstables: partition.cc: remove data_consume_* forward declarations
The indentation level is significantly reduced, and so is the number
of allocations. The function signature is changed from taking an rvalue
ref to taking the unique_ptr by value, because otherwise the coroutine
captures the request as a reference, which results in use-after-free.
Tests: unit(dev)
Closes#8249
* github.com:scylladb/scylla:
alternator: drop read_content_and_verify_signature
alternator: coroutinize handle_api_request
The indentation level is significantly reduced, and so is the number
of allocations.
The function signature is changed from taking an rvalue ref to taking
the unique_ptr by value, because otherwise the coroutine captures
the request as a reference, which results in use-after-free.
`data_consume_context` is a thin wrapper over the real context object
and it does little more than forward method calls to it. The few
methods doing more then mere forwarding can be folded into its single
real user: `sstable_reader`.
Alternator request sizes can be up to 16 MB, but the current implementation
had the Seastar HTTP server read the entire request as a contiguous string,
and then processed it. We can't avoid reading the entire request up-front -
we want to verify its integrity before doing any additional processing on it.
But there is no reason why the entire request needs to be stored in one big
*contiguous* allocation. This always a bad idea. We should use a non-
contiguous buffer, and that's the goal of this patch.
We use a new Seastar HTTPD feature where we can ask for an input stream,
instead of a string, for the request's body. We then begin the request
handling by reading lthe content of this stream into a
vector<temporary_buffer<char>> (which we alias "chunked_content"). We then
use this non-contiguous buffer to verify the request's signature and
if successful - parse the request JSON and finally execute it.
Beyond avoiding contiguous allocations, another benefit of this patch is
that while parsing a long request composed of chunks, we free each chunk
as soon as its parsing completed. This reduces the peak amount of memory
used by the query - we no longer need to store both unparsed and parsed
versions of the request at the same time.
Although we already had tests with requests of different lengths, most
of them were short enough to only have one chunk, and only a few had
2 or 3 chunks. So we also add a test which makes a much longer request
(a BatchWriteItem with large items), which in my experiment had 17 chunks.
The goal of this test is to verify that the new signature and JSON parsing
code which needs to cross chunk boundaries work as expected.
Fixes#7213.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210309222525.1628234-1-nyh@scylladb.com>
The test populates the cache, then invalidates it, then tries to push
huge (10x times the segment size) chunks into seastar memory hoping that
the invalid entries will be evicted. The exit condition on the last
stage is -- total memory of the region (sum of both -- used and free)
becomes less than the size of one chunk.
However, the condition is wrong, because cache usually contains a dummy
entry that's not necessarily on lru and on some test iteration it may
happen that
evictable size < chunk size < evictable size + dummy size
In this case test fails with bad_alloc being unable to evict the memory
from under the dummy.
fixes: #7959
tests: unit(row_cache_test), unit(the failing case with the triggering
seed from the issue + 200 times more with random seeds)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210309134138.28099-1-xemul@scylladb.com>
Change test/raft directory to Boost test type.
Run replication_test cases with their own test.
RAFT_TEST_CASE macro creates 2 test cases, one with random 20% packet
loss named name_drops.
The directory test/raft is changed to host Boost tests instead of unit.
While there improve the documentation.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
In raft the UUID 0 is a special case so server ids start at 1.
Add two helper functions. Convert local 0-based id to raft 1-based
UUID. And from UUID to raft_id.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Change global map of disconnected servers to a more intuitive class
connected. The class is callable for the most common case
connected(id).
Methods connect(), disconnect(), and all() are provided for readability
instead of directly calling map methods (insert, erase, clear). They
also support both numerical (0 based) and server_id (UUID, 1 based) ids.
The actual shared map is kept in a lw_shared_ptr.
The class is passed around to be copy-constructed which is practically
just creating a new lw_shared_ptr.
Internally it tracks disconnected servers but externally it's more
intuitive to use connect instead of disconnect. So it reads
"connected id" and "not disconnected id", without double negatives.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
This is a reworked submission of #7686 which has been reverted. This series
fixes some race conditions in MV/SI schema creation and load, we spotted some
places where a schema without a base table reference can sneak into the
registry. This can cause to an unrecoverable error since write commands with
those schemas can't be issued from other nodes. Most of those cases can occur on
2 main and uncommon cases, in a mixed cluster (during an upgrade) and in a small
window after a view or base table altering.
Fixes#7709Closes#8091
* github.com:scylladb/scylla:
database: Fix view schemas in place when loading
global_schema_ptr: add support for view's base table
materialized views: create view schemas with proper base table reference.
materialized views: Extract fix legacy schema into its own logic
In some cases, user may want to repair the cluster, ignoring the node
that is down. For example, run repair before run removenode operation to
remove a dead node.
Currently, repair will ignore the dead node and keep running repair
without the dead node but report the repair is partial and report the
repair is failed. It is hard to tell if the repair is failed only due to
the dead node is not present or some other errors.
In order to exclude the dead node, one can use the hosts option. But it
is hard to understand and use, because one needs to list all the "good"
hosts including the node itself. It will be much simpler, if one can
just specify the node to exclude explicitly.
In addition, we support ignore nodes option in other node operations
like removenode. This change makes the interface to ignore a node
explicitly more consistent.
Refs: #7806Closes#8233
Add tests to check if quorum (for leader election and commit index
purposes) is calculated correctly in the presence of non-voting members.
Message-Id: <20210304101158.1237480-3-gleb@scylladb.com>
This patch adds a support for non-voting members. Non voting member is a
member which vote is not counted for leader election purposes and commit
index calculation purposes and it cannot become a leader. But otherwise
it is a normal raft node. The state is needed to let new nodes to catch
up their log without disturbing a cluster.
All kind of transitions are allowed. A node may be added as a voting member
directly or it may be added as non-voting and then changed to be voting
one through additional configuration change. A node can be demoted from
voting to non-voting member through a configuration change as well.
Message-Id: <20210304101158.1237480-2-gleb@scylladb.com>
bag_sstable_set can be replaced with partitioned_sstable_set, which
will provide the same functionality, given that L0 sstables go to
a "bag" rather than interval map. STCS, for example, will only
have L0 sstables, so it will get exact the same behavior with
partitioned_sstable_set.
it also gives us the benefit of keeping the leveled sstables in
the interval map if user has switched from LCS to STCS, until
they're all compacted into size-tiered ssts.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Since Scylla requires C++20, there is no need to protect
concept definitions or usages with SEASTAR_CONCEPT; it just
clutters the code. This patch therefore removes all uses.
Closes#8236
Currently, the rpc timeout error for the GOSSIP_GET_ENDPOINT_STATES verb
is not handled in gossiper::do_shadow_round. If the
GOSSIP_GET_ENDPOINT_STATES rpc call to any of the remote nodes goes
timeout, gossiper::do_shadow_round will throw an exception and fail the
whole boot up process.
It is fine that some of the remote nodes timeout in shadow round. It is
not a must to talk to all nodes.
This patch fixes an issue we saw recently in our sct tests:
```
INFO | scylla[1579]: [shard 0] init - Shutting down gossiping
INFO | scylla[1579]: [shard 0] gossip - gossip is already stopped
INFO | scylla[1579]: [shard 0] init - Shutting down gossiping was successful
...
ERR | scylla[1579]: [shard 0] init - Startup failed: seastar::rpc::timeout_error (rpc call timed out)
```
Fixes#8187Closes#8213
We already have a test which shows verify DynamoDB and Alternator
do not allow an index in an attribute path - like a[0].b - to be
a value reference - a[:xyz].b. We forgot to verify that the index
also can't be a name reference - a[#xyz].b is a syntax error. So here
we add a test which confirms that this is indeed the case - DynamoDB
doesn't allow it, and neither does Alternator.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210219123310.1240271-1-nyh@scylladb.com>
On Ubuntu 20.04 AMI, scylla_raid_setup --raiddev /dev/md0 causes
'/dev/md0 is already using' (issue #7627).
So we merged the patch to find free mdX (587b909).
However, look into /proc/mdstat of the AMI, it actually says no active md device available:
ubuntu@ip-10-0-0-43:~$ cat /proc/mdstat
Personalities :
unused devices: <none>
We currently decide mdX is used when os.path.exists('/sys/block/mdX/md/array_state') == True,
but according to kernel doc, the file may available even array is STOPPED:
clear
No devices, no size, no level
Writing is equivalent to STOP_ARRAY ioctl
https://www.kernel.org/doc/html/v4.15/admin-guide/md.html
So we should also check array_state != 'clear', not just array_state
existance.
Fixes#8219Closes#8220
has_monotonic_positions() wants to check for a greater-than-or-equal-to
relation, but actually tests for not-equal, since it treats a
trichotomic comparator as a less-than comparator. This is clearly seen
in the BOOST_FAIL message just below.
Fix by aligning the test with the intended invariant. Luckily, the tests
still pass.
Ref #1449.
Closes#8222
On restart the view schemas are loaded and might contain old
views with an unmarked computed column. We already have code to
update the schema, but before we do it we load the view as is. This
is not desired since once registered, this view version can be used
for writes which is forbidden since we will spot a none computed
column which is in the view's primary key but not in the base table
at all. To solve this, in addition to altering the persistent schema,
we fix the view's loaded schema in place. This is safe since computed
column is just involved in generating a value for this column when
creating a view update so the effect of this manipulation stays
internal.
The second stage of the in place fixing is to persist the
changes made in the in place fixing so the view is ready for
the next node restart in particular the `computed_columns` table.
Up until now, the global_schema_ptr object was a crack
through which a view schema with an uninitialized base
reference could sneak. Even if the schema itself contained a
base reference, the base schema didn't carry over to shards
different than the shard on which the global_schema_ptr was
created.
Since once the schema is in the registry it might be used for
everything (reads and writes), we also need to make sure that
global schemas for an incomplete view schemas will not be created.
reference.
Newly created view schemas don't always have their base info,
this is bad since such schemas don't support read nor write.
This leaves us vulnerable to a race condition where there is
an attempt to use this schema for read or write. Here we initialize
the base reference and also reconfigure the view to conform to the
new computed column type, which makes it usable for write and not only
reads. We do it for views created in the migration manager following
announcements and also for copied schemas.
We extract the logic for fixing the view schema into it's own
logic as we will need to use it in more places in the code.
This makes 'maybe_update_legacy_secondary_index_mv_schema' redundant since
it becomes a two liner wrapper for this logic. We also
remove it here and replace the call to it with the equivalent code.
Current fs.aio-max-nr value cpu_count() * 11026 is exact size of scylla
uses, if other apps on the environment also try to use aio, aio slot
will be run out.
So increase value +65536 for other apps.
Related #8133Closes#8228
Fixes an sstableloader bug where we quoted twice column names that
had to be quoted, and therefore failed on such tables - and in particular
Alternator tables which always have a column called ":attrs".
Fixes#8229
* tools/java 142f517a23...c5d9e8513e (1):
> sstableloader: Only escape column names once
034cb81323 and 0f0c3be disallowed reverse partition-range scans based on
the observation that the CQL frontend disallows them, assuming that
other client APIs also disallow them. As it turns out this is not true
and there it at least one client API (Thrift) which does allows reverse
range scans. So re-enable them.
Fixes: #8211
Tests: unit(release), dtest(thrift_tests.py)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210304142249.164247-1-bdenes@scylladb.com>
Before this patch, if Scylla crashes during some test in cql-pytest, all
tests after it will fail because they can't connect to Scylla - and we can
get a report on hundreds of failures without a clear sign of where the real
problem was.
This patch introduces an autouse fixture (i.e., a fixture automatically
used by every test) which tries to run a do-nothing CQL command after each
test. If this CQL command fails, we conclude that Scylla crashed and
report the test in which this happened - and exist pytest instead of failing
a hundred more tests.
Fixes#8080
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210304132804.1527977-1-nyh@scylladb.com>
In a scenario where node is running out of disk space, which is a common
cause of cluster expansion, it's very important to clean up the smallest
files first to increase the chances of success when the biggest files are
reached down the road. That's possible given that cleanup operates on a
single file at a time, and that the smaller the file the smaller the
space requirement.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210303165520.55563-1-raphaelsc@scylladb.com>
This function currently eagerly decrements `_size`, before `func()` is
invoked. If `func()` throws the consumption fails but the size remains
decremented. If this happens right at the last element in the row, the
`row::empty()` will incorrectly return `true`, even though there is
still one cell left in it. Move the decrement after the `func()`
invocation to avoid this by only decrementing if the consumption
was successful.
Fixes: #8154
Tests: unit(mutation_test:release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210304125318.143323-1-bdenes@scylladb.com>
A recent change to the commitlog (4082f57) caused its configurable size limit to
be strictly enforced - after reaching the limit, new segments wouldn't be
allocated until some of the previous segments are freed. This flow can work for
the regular commitlog, however the hints commitlog does not delete the segments
itself - instead, hints manager recreates its commitlog every 10 seconds, picks
up segments left by the previous instance and deletes each segment manually only
after all hints are sent out from a segment.
Because of the non-standard flow, it is possible that the hints commitlog fills
up and stops accepting more hints. Hints manager uses a relatively low limit for
each commitlog instance (128MB divided by shard count), so it's not hard to fill
it up. What's worse, hints manager tries to acquire file_update_mutex in
exclusive mode before re-creating the commitlog, while hints waiting to be
written acquire this lock in shared mode - which causes hints flushing to
completely deadlock and no more hints be admitted to the commitlog. The queue of
hints waiting to be admitted grows very quickly and soon all writes which could
result in a hint being generated are rejected with OverloadedException.
To solve this problem, it is now possible to bring back the soft disk space
limit by setting a flag in commitlog's configuration.
Tests:
- unit(dev)
- wrote hints for 15 minutes in order to see if it gets stuck again
Fixes#8137Closes#8206
* github.com:scylladb/scylla:
hints_manager: don't use commitlog hard space limit
commitlog: add an option to allow going over size limit
tri_compare() returns an int, which is dangerous as a tri_compare can
be misused where a less_compare is expected. To prevent such misuse,
convert the interval<> template to accept comparators that return
std::strong_ordering, and then convert dht::token's comparator to do
the same.
Ref #1449.
Closes#8181
* github.com:scylladb/scylla:
dht: convert token tri_compare to std::strong_ordering
interval: support C++20 three-way comparisons
This pull request removes seastar namespace imports from the header
files. There are some additional cleanups to make that easier and to
remove some commented out code.
Closes#8202
* github.com:scylladb/scylla:
redis: Remove seastar namespace import from query_processor.hh
redis: Switch to seastar::sharded<> in query_procesor.hh
redis: Remove seastar namespace import from query_utils.hh
redis: Remove seastar namespace import from reply.hh
redis: Remove commented out code from options.hh
redis: Remove seastar namespace import from options.hh
redis: Remove seastar namespace import from service.hh
redis: Switch to seastar::sharded<> in service.{hh,cc}
redis: Remove unneeded include from keyspace_utils.hh
redis: Remove seastar namespace import from keyspace_utils.hh
redis: Remove seastar namespace import from command_factory.hh
redis: Fix include path in command_factory.hh
redis: Remove unneeded includes from command_factory.hh
Refs: #8012Fixes: #8210
With the update to CDC generation management, the way we retrieve and process these changed.
One very bad bug slipped through though; the code for getting versioned streams did not take into
account the late-in-pr change to make clustering of CDC gen timestamps reversed. So our alternator
shard info became quite rump-stumped, leading to more or less no data depending on when generations
changed w.r. data.
Also, the way we track the above timestamps changed, so we should utilize this for our end-of-iterator check.
Closes#8209
* github.com:scylladb/scylla:
alternator::streams: Use better method for generation timestamp
system_distributed_keyspace: Add better routine to get latest cdc gen. timestamp
system_distributed_keyspace: Fix cdc_get_versioned_streams timestamp range
With the new scheme for cdc generation management, one of the last
changes was to make the time ordering of the stream timestamps reversed.
However, cdc_get_versioned_streams forgot to take this into account
when sifting out timestamp ranges for stream retrieval (based on
low mark).
Fixed by doing reverse iteration.
Test log consistency after apply_snapshot() is called.
Ensure log::last_term() log::last_conf_index() and log::size()
work as expected.
Misc cleanups.
* scylla-dev.git/raft-confchange-test-v4:
raft: fix spelling
raft: add a unit test for voting
raft: do not account for the same vote twice
raft: remove fsm::set_configuration()
raft: consistently use configuration from the log
raft: add ostream serialization for enum vote_result
raft: advance commit index right after leaving joint configuration
raft: add tracker test
raft: tidy up follower_progress API
raft: update raft::log::apply_snapshot() assert
raft: add a unit test for raft::log
raft: rename log::non_snapshoted_length() to log::in_memory_size()
raft: inline raft::log::truncate_tail()
raft: ignore AppendEntries RPC with a very old term
raft: remove log::start_idx()
raft: return a correct last term on an empty log
raft: do not use raft::log::start_idx() outside raft::log()
raft: rename progress.hh to tracker.hh
raft: extend single_node_is_quiet test
This reverts commit f94f70cda8, reversing
changes made to 5206a97915.
Not the latest version of the series was merged. Rvert prior to
merging the latest one.
Refs #7961Fixes#8014
The "untyped_result_set" object was created for small, internal access to cql-stored metadata.
It is nowadays used for rather more than that (cdc).
This has the potential of mixing badly with the fact that the type does deep copying of data
and linearizes all (not to mention handles multiple rows rather inefficiently).
Instead of doing a deep copy of input, we keep assume ownership and build
rows of the views therein, potentially retaining fragmented data as-is
avoiding premature linearization.
Note that this is not all sugar and flowers though. Any data access will
by nature be more expensive, and the view collections we create are
potentially just as expensive as copying for small cells.
Otoh, it allows writing code using this that avoids data copying,
depending on destination.
v2:
* Fixed wrong collection reserved in visitor
* Changed row index from shared ptr to ref
* Moved typedef
* Removed non-existing constructors
* Added const ref to index build
* Fixed raft usage after rebase
v3:
* Changed shared_ptr to unique
Closes#8015
* github.com:scylladb/scylla:
untyped_result_set: Do not copy data from input store (retain fragmented views)
result_generator: make visitor callback args explicit optionals
listlike_partial_deserializing_iterator: expose templated collection routines
Previously, we had two tests demonstrating issue #7966. But since then,
our understanding of this issue has improved which resulted in issue #8203,
so this patch improves those tests and makes them reproduce the new issue.
Importantly, we now know that this problem is not specific to a full-table
scan, and also happens in a single-partition scan, so we fix the test to
demonstrate this (instead of the old test, which missed the problem so
the test passed).
Both tests pass on Cassandra, and fail on Scylla.
Refs #8203.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210302224020.1498868-1-nyh@scylladb.com>
Refs #7961Fixes#8014
Instead of doing a deep copy of input, we keep assume ownership and build
rows of the views therein, potentially retaining fragmented data as-is
avoiding premature linearization.
Note that this is not all sugar and flowers though. Any data access will
by nature be more expensive, and the view collections we create are
potentially just as expensive as copying for small cells.
Otoh, it allows writing code using this that avoids data copying,
depending on destination.
v2:
* Fixed wrong collection reserved in visitor
* Changed row index from shared ptr to ref
* Moved typedef
* Removed non-existing constructors
* Added const ref to index build
* Fixed raft usage after rebase
v3:
* Changed shared_ptr to unique
When shedding requests (e.g. due to their size or number exceeding the
limits), errors were returned right after parsing their headers, which
resulted in their bodies lingering in the socket. The server always
expects a correct request header when reading from the socket after the
processing of a single request is finished, so shedding the requests
should also take care of draining their bodies from the socket.
Fixes#8193Closes#8194
* github.com:scylladb/scylla:
cql-pytest: add a shedding test
transport: return error on correct stream during size shedding
transport: return error on correct stream during shedding
transport: skip the whole request if it is too large
transport: skip the whole request during shedding
This scylla-only test case tries to push a too-large request
to Scylla, and then retries with a smaller request, expecting
a success this time.
Refs #8193
Accidentally introduced in 9eed26ca3d, it can never be true due to
code above it.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8201
Like DynamoDB, Alternator rejects requests larger than some fixed maximum
size (16MB). We had a test for this feature - test_too_large_request,
but it was too blunt, and missed two issues:
Refs #8195
Refs #8196
So this patch adds two better tests that reproduce these two issues:
First, test_too_large_request_chunked verifies that an oversized request
is detected even if the body is sent with chunked encoding.
Second, both tests - test_too_large_request_chunked and
test_too_large_request_content_length - verify that the rather limited
(and arguably buggy) Python HTTP client is able to read the 413 status
code - and doesn't report some generic I/O error.
Both tests pass on DynamoDB, but fail on Alternator because of these two
open issues.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210302154555.1488812-1-nyh@scylladb.com>
The main goal of this patch is to add a reproducer for issue #7966, where
partition-range scan with filtering that begins with a long string of
non-matches aborts the query prematurely - but the same thing is fine with
a single-partition scan. The test, test_filtering_with_few_matches, is
marked as "xfail" because it still fails on Scylla. It passes on Cassandra.
I put a lot of effort into making this reproducer *fast* - the dev-build
test takes 0.4 seconds on my laptop. Earlier reproducers for the same
problem took as much as 30 seconds, but 0.4 seconds turns this test into
a viable regression test.
We also add a test, test_filter_on_unset, reproduces issue #6295 (or
the duplicate #8122), which was already solved so this test passes.
Refs #6295
Refs #7966
Refs #8122
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210301170451.1470824-1-nyh@scylladb.com>
Refs #8093
Refs /scylladb/scylla-tools-java#218
Adds keyword that can preface value tuples in (a, b, c) > (1, 2, 3)
expressions, forcing the restriction to bypass column sort order
treatment, and instead just create the raw ck bounds accordningly.
This is a very limited, and simple version, but since we only need
to cover this above exact syntax, this should be sufficient.
v2:
* Add small cql test
v3:
* Added comment in multi_column_restriction::slice, on what "mode" means and is for
* Added small document of our internal CQL extension keywords, including this.
v4:
* Added a few more cases to tests to verify multi-column restrictions
* Reworded docs a bit
v5:
* Fixed copy-paste error in comment
v6:
* Added negative (error) test cases
v7:
* Added check + reject of trying to combine SCYLLA_CLUST... slice and
normal one
Closes#8094
The Redis transport layer seems to have originated as a copy-paste of
the CQL transport layer. This pull request removes bunch of unused
and commented out bits of code, and also does some minor cleanups
like organizing includes, to make the code more readable.
Closes#8198
* github.com:scylladb/scylla:
redis: Remove unused to_bytes_view() function from server.cc
redis: Remove unused tracing_request_type enum
redis: Remove unneeded connection friend declaration
redis: Remove unused process_request_executor friend declaration
redis: Remove unused _request_cpu class member
redis: Remove commented out code from server.hh
redis: Remove duplicate request.hh include
redis: Remove unused db::config forward declaration
redis: Remove unused fmt_visitor forward declaration
redis: Organize includes in server.{cc,hh}
redis: Switch to seastar::sharded<>
redis: Remove redundant access modifiers from server.hh
This commit disables the hard space limit applied by commitlogs created
to store hints. The hard limit causes problems for hints because they
use small-sized commitlogs to store hints (128MB, currently). Instead of
letting the commitlog delete the segments itself, it recreates the
commitlog every 10 seconds and manually deletes old segments after all
hints are sent out from them.
If the 128MB limit is reached, the hints manager will get stuck. A
future which puts hint into commitlog holds a shared lock, and commitlog
recreation needs to get an exclusive lock, which results in a deadlock.
No more hints will be admitted, and eventually we will start rejecting
writes with OverloadedException due to too many hints waiting to be
admitted to the commitlog.
By disabling the hard limit for hints commitlog, the old behavior is
brought back - commitlog becomes more conservative with the space used
after going over its size limit, but does not block until some of its
segments are deleted.
When a request is shed due to being too large, its response
was sent with stream id 0 instead of the stream id that matches
the communication lane. That in turn confused the client,
which is no longer the case.
When a request is shed due to exceeding the max number of concurrent
requests, its response was sent with stream id 0 instead of
the stream id that matches the communication lane.
That in turn confused the client, which is no longer the case.
On RAID prompt, we can type disk list something like this:
/dev/sda1,/dev/sdb1,/dev/sdc1,/dev/sdd1
However, if the list has spaces in the list, it doesn't work:
/dev/sda1, /dev/sdb1, /dev/sdc1, /dev/sdd1
Because the script mistakenly recognize the space part of a device path.
So we need strip() the input for each item.
Fixes#8174Closes#8190
When a request is shed due to being too large, only the header
was actually read, and the body was still stuck in the socket
- and would be read in the next iteration, which would expect
to actually read a new request header.
Instead, the whole message is now skipped, so that a new request
can be correctly read and parsed.
Fixes#8193
When a request is shed due to exceeding the number of max concurrent
requests, only its header was actually read, and the body was still
stuck in the socket - and would be read in the next iteration,
which would expect to actually read a new request header.
Instead, the whole message is now skipped, so that a new request
can be correctly read and parsed.
Refs #8193
"
Currently range scans build their results on the replica in the
`reconcilable_result` format, that -- as its name suggests -- is
normally used for reconciliation (read repair). As such this result
format is quite inefficient for normal queries: it contains all columns
and all tombstones in the requested range. These are all unnecessary for
normal queries which only want live data and only those columns that are
requested by the user.
Furthermore, as the coordinator works in terms of `query::result` for
normal queries anyway, this intermediate result has to be converted to
the final `query::result` format adding an unnecessary intermediate
conversion step.
This series gets rid of this problem by introducing
`query_data_on_all_shards()`, a variant of
`query_mutations_on_all_shards()` that builds `query::result` directly.
Reverse queries still use the old intermediate method behind the scenes.
Fixes#8061
Refs #7434
Tests: unit(release, debug)
"
* 'range-scan-data-variant/v5-rebased' of https://github.com/denesb/scylla:
cql_query_test: add unit test for the more efficient range scan result format
test/cql_test_env: do_with_cql_test_env(): add thread_attributes parameter
cql_query_test: test_query_limit: clean up scheduling groups
storage_proxy: use query_data_on_all_shards() for data range scan queries
query: partition_slice: add range_scan_data_variant option
gms: add RANGE_SCAN_DATA_VARIANT cluster feature
multishard_mutation_query: query_mutations_on_all_shards(): refuse reverse queries
multishard_mutation_query: add query_data_on_all_shards()
mutation_partition.cc: fix indentation
query_result_builder: make it a public type
multishard_mutation_query: generalize query code w.r.t. the result builder used
multishard_mutation_query: query_mutations_on_all_shards(): extract logic into new method
multishard_mutation_query: query_mutations_on_all_shards(): convert to coroutine
multishar_mutation_query: do_query_mutations(): convert to coroutine
multishard_mutation_query: read_page(): convert to coroutine
multishard_mutation_query: extract page reading logic into separate method
The most user-visible aspect of this change is range scans which select
a small subset of the columns. These queries work as the user expects
them to work: unselected columns are not included in determining the
size of the result (or that of the page). This is the aspect this test
is checking for. While at it, also test single partition queries too.
Currently range scans build their result using the `reconcilable_result`
format and then convert it to `query::result`. This is inefficient for
multiple reasons:
1) it introduces an additional intermediate result format and a
subsequent conversion to the final one;
2) the reconcilable result format was designed for reconciliation so it
contains all data, including columns unselected by the query, dead
rows and tombstones, which takes much more memory to build;
There is no reason to go through all this trouble, if there ever was one
in the past it doesn't stand anymore. So switch to the newly introduced
`query_data_on_all_shards()` when doing normal data range scans, but
only if all the nodes in the cluster supports it, to avoid artificial
differences in page sizes due to how reconcilable result and
query::result calculates result size and the consequent false-positive
read repair.
The transition to this new more efficient method is coordinated by a
cluster feature and whether to use it is decided by the coordinator
(instead of each replica individually). This is to avoid needless
reconciliation due to the different page sizes the two formats will
produce.
Switching to the data variant of range scans have to be coordinated by
the coordinator to avoid replicas noticing the availability of the
respective feature in different time, resulting in some using the
mutation variant, some using the data variant.
So the plan is that it will be the coordinator's job to check the
cluster feature and set the option in the partition slice which will
tell the replicas to use the data variant for the query.
To control the transition to the data variant of range scans. As there
is a difference in how the data and mutation variants calculate pages
sizes, the transition to the former has to happen in a controlled
manner, when all nodes in the cluster support it, to avoid artificial
differences in page content and subsequently triggering false-positive
read repair.
Refuse reverse queries just like in the new
`query_data_on_all_shards()`. The reason is the same, reverse range
scans are not supported on the client API level and hence they are
underspecified and more importantly: not tested.
A data query variant of the existing `query_mutations_on_all_shards()`.
This variant builds a `query::result`, instead of `reconcilable_result`.
This is actually the result format coordinators want when executing
range scans, the reason for using the reconcilable result for these
queries is historic, and it just introduces an unnecessary intermediate
format.
This new method allows the storage proxy to skip this intermediate
format and the associated conversion to `query::result`, just like we do
for single partition queries.
Reverse queries are refused because they are not supported on the client
API (CQL) level anyway and hence it is unspecified how they should work
and more importantly: they are not tested.
We want to add support to building `query::result` directly and reuse
the code path we use to build reconcilable result currently for it.
So templatize said code path on the result builder used. Since the
different result builders don't have a source level compatible interface
an adaptor class is used.
In the next patches we are going to generalize the query logic w.r.t.
the result builder used, so query_mutations_on_all_shards() will be just
a facade parametrizing the actual query code with the right result
builder.
So it can be modified while walked to dispatch
subscribed event notifications.
In #8143, there is a race between scylla shutdown and
notify_down(), causing use-after-free of cql_server.
Using an atomic vector itstead and futurizing
unregister_subscriber allows deleting from _lifecycle_subscribers
while walked using atomic_vector::for_each.
Fixes#8143
Test: unit(release)
DTest:
update_cluster_layout_tests:TestUpdateClusterLayout.add_node_with_large_partition4_test(release)
materialized_views_test.py:TestMaterializedViews.double_node_failure_during_mv_insert_4_nodes_test(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210224164647.561493-2-bhalevy@scylladb.com>
Move unregister_subscriber from the destructor to stop
as preparation for moving storage_service lifescyle_subscribers
to atomic_vector and futurizing unregister_subscriber.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210224164647.561493-1-bhalevy@scylladb.com>
* seastar 803e790598...ea5e529f30 (3):
> Merge "Teach io_tester to generate YAML output" from Pavel E
> bitset: set_range: mark constructor constexpr
> Update dpdk submodule
This series is extracted from #7913 as it may prove useful to other series as well, and #7913 might take a while until its merged, given that it also depends on other unmerged pull requests.
The idea of this series is to move timeouts to the client state, which will allow changing them independently for each session - e.g. by setting per-service-level timeouts and initializing the values from attached service levels (see #7867).
Closes#8140
* github.com:scylladb/scylla:
treewide: remove timeout config from query options
cql3: use timeout config from client state instead of query options
cql3: use timeout config from client state instead of query options
cql3: use timeout config from client state instead of query options
service: add timeout config to client state
Due to regression introduced by 463d0ab, regular can compact in parallel a sstable
being compacted by cleanup, scrub or upgrade.
This redundancy causes resources to be wasted, write amplification is increased
and so does the operation time, etc.
That's a potential source of data resurrection because the now-owned data from
a sstable being compacted by both cleanup and regular will still exist in the
node afterwards, so resurrection can happen if node regains ownership.
Fixes#8155.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210225172641.787022-1-raphaelsc@scylladb.com>
This will prevent accumulation of unnecessary dummy entries.
A single-partition populating scan with clustering key restrictions
will insert dummy entries positioned at the boundaries of the
clustering query range to mark the newly populated range as
continuous.
Those dummy entries may accumulate with time, increasing the cost of
the scan, which needs to walk over them.
In some workloads we could prevent this. If a populating query
overlaps with dummy entries, we could erase the old dummy entry since
it will not be needed, it will fall inside a broader continuous
range. This will be the case for time series worklodas which scan with
a decreasing (newest) lower bound.
Refs #8153.
_last_row is now updated atomically with _next_row. Before, _last_row
was moved first. If exception was thrown and the section was retried,
this could cause the wrong entry to be removed (new next instead of
old last) by the new algorithm. I don't think this was causing
problems before this patch.
The problem is not solved for all the cases. After this patch, we
remove dummies only when there is a single MVCC version. We could
patch apply_monotonically() to also do it, so that dummies which are
inside continuous ranges are eventually removed, but this is left for
later.
perf_row_cache_reads output after that patch shows that the second
scan touches no dummies:
$ build/release/test/perf/perf_row_cache_reads_g -c1 -m200M
Rows in cache: 0
Populating with dummy rows
Rows in cache: 265320
Scanning
read: 142.621613 [ms], preemption: {count: 639, 99%: 0.545791 [ms], max: 0.526929 [ms]}, cache: 0/0 [MB]
read: 0.023197 [ms], preemption: {count: 1, 99%: 0.035425 [ms], max: 0.032736 [ms]}, cache: 0/0 [MB]
Message-Id: <20210226172801.800264-1-tgrabiec@scylladb.com>
Evicts objects from caches which reflect sstable content, like the row
cache. In the future, it will also drop the page cache
and sstable index caches.
Unlike lsa/compact, doesn't cause reactor stalls.
The old lsa/compact call invokes memory reclamation, which is
non-preemptible. It also compacts LSA segments, so does more
work. Some use cases don't need to compact LSA segments, just want the
row cache to be wiped.
Message-Id: <20210301120211.36195-1-tgrabiec@scylladb.com>
This commit adds an option which, when turned on, allows the commitlog
to go over configured size limit. After reaching the limit, commitlog
will be more conservative with its usage of the disk space - for
example, it won't increase the segment reserve size or reuse recycled
segments. Most importantly, it won't block writes until the space used
by the commitlog goes down.
This change is necessary for hinted handoff to keep its current
behavior. Hinted handoff does not let the commitlog free segments
itself - instead, it re-creates it every 10 seconds and manually deletes
segments after all hints are sent from a segment.
Current aio-max-nr is set up statically to 1048576 in
/etc/sysctl.d/99-scylla-aio.conf.
This is sufficient for most use cases, but falls short on larger machines
such as i3en.24xlarge on AWS that has 96 vCPUs.
We need to tune the parameter based on the number of cpus, instead of
static setting.
Fixes#8133
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#8188
Currently, the sstable_set in a table is copied before every change
to allow accessing the unchanged version by existing sstable readers.
This patch changes the sstable_set to a structure that keeps all its
versions that are referenced somewhere and provides a way of getting
a reference to an immutable version of the set.
Each sstable in the set is associated with the versions it is alive in,
and is removed when all such versions don't have references anymore.
To avoid copying, the object holding all sstables in the set version is
changed to a new structure, sstable_list, which was previously an alias
for std::unordered_set<shared_sstable>, and which implements most of the
methods of an unordered_set, but its iterator uses the actual set with
all sstables from all referenced versions and iterates over those
sstables that belong to the captured version.
The methods that modify the sets contents give strong exception guarantee
by trying to insert new sstables to its containers, and erasing them in
the case of an caught exception.
To release shared_sstables as soon as possible (i.e. when all references
to versions that contain them die), each time a version is removed, all
sstables that were referenced exclusively by this version are erased. We
are able to find these sstables efficiently by storing, for each version,
all sstables that were added and erased in it, and, when a version is
removed, merging it with the next one. When a version that adds an sstable
gets merged with a version that removes it, this sstable is erased.
Fixes#2622
Signed-off-by: Wojciech Mitros wojciech.mitros@scylladb.comCloses#8111
* github.com:scylladb/scylla:
sstables: add test for checking the latency of updating the sstable_set in a table
sstables: move column_family_test class from test/boost to test/lib
sstables: use fast copying of the sstable_set instead of rebuilding it
sstables: replace the sstable_set with a versioned structure
sstables: remove potential ub
sstables: make sstable_set constructor less error-prone
"
We have recently seen out-of-order partitions getting into sstables
causing major disruption later on. Given the damage caused, it was again
raised that we should enable partition key monotonicity validation
unconditionally in the sstable write path. This was also raised in the
past but dismissed as key validation was suspected (but not measured) to
add considerable per-fragment overhead. One of the problems was that the
key monotonicity validation was all or nothing. It either validated all
(clustering and partition) key monotonicity or none of it.
This series takes a second look at this and solves the all-or-nothing
problem by making the configuration of the key monotonicity check more
fine grained, allowing for enabling just token monotonicity validation
separately, then enables it unconditionally.
Refs: #7623
Tests: unit(release)
"
* 'sstable-writer-validate-partition-keys-unconditionally/v3' of https://github.com/denesb/scylla:
sstables: enable token monotonicity validation by default
mutation_fragment_stream_validator: add token validation level
mutation_fragment_stream_validating_filter: make validation levels more fine-grained
This patch adds the compaction id to the get_compaction structure.
While it was supported, it was not used and up until now wasn't needed.
After this patch a call to curl -X GET 'http://localhost:10000/compaction_manager/compactions'
will include the compaction id.
Relates to #7927
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Closes#8186
The error now contains information about the view table that failed,
as well as base and view tokens.
Example:
view - Error applying view update to 127.0.0.1 (view: ks.testme_v_idx_index,
base token: -4069959284402364209, view token: -3248873570005575792): std::runtime_error (manually injected error)
Fixes#8177Closes#8178
Partition key order validation in data written to sstables can be very
disruptive. All our components in the storage layers assume that
partitions are in order, which means that reading out-of-order
partitions triggers undefined behaviour. Computer scientists often joke
that undefined behaviour can erase your hard drive and in this case the
damage done by undefined behaviour caused by out-of-order partitions is
very close to that. The corruption is known to mutate causing crashes,
corrupting more data and even loose data. For this reason it is
imperative that out-of-order partitions cannot get into sstables. This
patch enables token monotonicity validation unconditionally in
the sstable writer. As partition key monotonicity checks involve a key
copy per partition, which might have an impact on the performance, we do
the next best thing instead and enable only token monotonicity
validation.
In some cases the full-blown partition key validation and especially the
associated key copy per partition might be deemed too costly. As a next
best thing this patch adds a token only validation, which should cover
99% (number pulled out of my sleeve) of the cases. Let's hope no one
gets unlucky.
Currently key order validation for the mutation fragment stream
validating filter is all or nothing. Either no keys (partition or
clustering) are validated or all of them. As we suspect that clustering
key order validation would add a significant overhead, this discourages
turning key validation on, which means we miss out on partition key
monotonicity validation which has a much more moderate cost.
This patch makes this configurable in a more fine-grained fashion,
providing separate levels for partition and clustering key monotonicity
validation.
As the choice for the default validation level is not as clear-cut as
before, the default value for the validation level is removed in the
validating filter's constructor.
Change token's tri_compare functions to return std::strong_ordering,
which is not convertible to bool and therefore not suspect to
being misused where a less-compare is expected.
Two of the users (ring_position and decorated_key) have to undo
the conversion, since they still return int. A follow up will
convert them too.
Ref #1449.
Allow the tri-comparator input to range functions to return
std::strong_ordering, e.g. the result of operator<=>. An int
input is still allowed, and coerced to std::strong_ordering by
tri-comparing it against zero. Once all users are converted, this
will be disallowed.
The clever code that performs boundary comparisons unfortunately
has to be dumbed down to conditionals. A helper
require_ordering_and_on_equal_return() is introduced that accepts
a comparison result between bound values, an expected comparison
result, and what to return if the bound value matches (this depends
on whether individual bounds are exclusive or inclusive, on
whether the bounds are start bounds or end bounds, and on the
sense of the comparison).
Unfortunately, the code is somewhat pessimized, and there is no
way to pessimize it as the enum underlying std::strong_ordering
is hidden.
fill_buffer() will keep scanning until _lower_bound_changed is true,
even if preemption is signaled, so that the reader makes forward
progress.
Before the patch, we did not update _lower_bound on touching a dummy
entry. The read will not respect preemption until we hit a non-dummy
row. If there is a lot of dummy rows, that can cause reactor stalls.
Fix that by updating _lower_bound on dummy entries as well.
Refs #8153.
Tested with perf_row_cache_reads:
```
$ build/release/test/perf/perf_row_cache_reads -c1 -m200M
Rows in cache: 0
Populating with dummy rows
Rows in cache: 373929
Scanning
read: 183.658966 [ms], preemption: {count: 848, 99%: 0.545791 [ms], max: 0.519343 [ms]}, cache: 99/100 [MB]
read: 120.951515 [ms], preemption: {count: 257, 99%: 0.545791 [ms], max: 0.518795 [ms]}, cache: 99/100 [MB]
```
Notice that max preemption latency is low in the second "read:" line.
Closes#8167
* github.com:scylladb/scylla:
row_cache: Make fill_buffer() preemptable when cursor leads with dummy rows
tests: perf: Introduce perf_row_cache_reads
row_cache: Add metric for dummy row hits
The optimal path of said method mistakenly captures `pos` (a local
variable) in its reader factory method and passes a temporary range
implicitly constructed from said `pos` as the range parameter to the
sstable reader. This will lead to the sstable reader using a dangling
range and will result in returning no result for queries. This patch
fixes this bug and adds a unit test to cover this code path.
Fixes#8138.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210226143111.104591-2-bdenes@scylladb.com>
The `result_memory_accounter` terminates a query if it reaches either
the global or shard-local limit. This used to be so only for paged
queries, unpaged ones could grow indefinitely (until the node OOM'd).
This was changed in fea5067 which enforces the local limit on unpaged
queries as well, by aborting them. However a loophole remained in the
code: `result_memory_accounter::check_and_update()` has another stop
condition, besides `check_local_limit()`, it also checks the global
limit. This stop condition was not updated to enforce itself on unpaged
queries by aborting them, instead it silently terminated them, causing
them to return less data then requested. This was masked by most queries
reaching the local limit first.
This patch fixes this by aborting unpaged mutation queries when they hit
the global limit.
Fixes: #8162
Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210226102202.51275-1-bdenes@scylladb.com>
The multishard combining reader currently assumes that all shards have
data for the read range. This however is not always true and in extreme
cases (like reading a single token) it can lead to huge read
amplification. Avoid this by not pushing shards to
`_shard_selection_min_heap` if the first token they are expected to
produce falls outside of the read range. Also change the read ahead
algorithm to select the shards from `_shard_selection_min_heap`, instead
of walking them in shard order. This was wrong in two ways:
* Shards may be ordered differently with respect to the first partition
they will produce; reading ahead on the next shard in shard order
might not bring in data on the next shard the read will continue on.
Shard order is only correct when starting a new range and shards are
iterated over in the order they own tokens according to the sharding
algorithm.
* Shards that may not have data relevant to the read range are also
considered for read ahead.
After this patch, the multishard reader will only read from shards that
have data relevant to the read range, both in the case of normal reads
and also for read-ahead.
Fixes: #8161
Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210226132536.85438-1-bdenes@scylladb.com>
Currently all management of CDC generations happens in storage_service,
which is a big ball of mud that does many unrelated things.
This PR introduces a new service crafted to handle CDC generation
management: listening and reacting to generation changes in the cluster.
We plug the service in, initializing it in main and test code,
passing a reference to storage_service and having storage_service call
the service (using the `after_join` method): the service only starts
doing its job after the node joins the token ring (either on bootstrap
or restart).
Some parts of generation management still remain in storage_service:
the bootstrap procedure, which happens inside storage_service,
must also do some initialization regarding CDC generations,
for example: on restart it must retrieve the latest known generation
timestamp from disk; on bootstrap it must create a new generation
and announce it to other nodes. The order of these operations w.r.t
the rest of the startup procedure is important, hence the startup
procedure is the only right place for them. We may try decoupling
these services even more in follow-up PRs, but that requires a bit
of careful reasoning. What this PR does is a low-hanging fruit.
Still, what remains in storage_service is a small part of the entire
CDC generation management logic; most of it has been moved to the
new service. This includes listening for generation changes and
updating the data structures for performing CDC log writes (cdc::metadata).
Furthermore these handling functions now return futures (and are internally
coroutines), where previously they required a seastar::async context.
This PR is a prerequisite to fixing #7985. The fact that all the CDC generation
management code was in storage_service is technical debt. It will be easier
to modify the management algorithms when they sit in their own module.
Tests: unit (dev) and cdc_tests.py dtest (dev), and local replication test using scylla-cdc-java
Closes#8172
* github.com:scylladb/scylla:
cdc: move (most of) CDC generation management code to the new service
cdc: coroutinize make_new_cdc_generation
cdc: coroutinize update_streams_description
cdc: introduce cdc::generation_service
main: move cdc_service initialization just prior to storage_service initialization
Currently all management of CDC generations happens in storage_service,
which is a big ball of mud that does many unrelated things.
Previous commits have introduced a new service for managing CDC
generations. This code moves most of the relevant code to this new
service.
However, some part still remains in storage_service: the bootstrap
procedure, which happens inside storage_service, must also do some
initialization regarding CDC generations, for example: on restart it
must retrieve the latest known generation timestamp from disk; on
bootstrap it must create a new generation and announce it to other
nodes. The order of these operations w.r.t the rest of the startup
procedure is important, hence the startup procedure is the only right
place for them.
Still, what remains in storage_service is a small part of the entire
CDC generation management logic; most of it has been moved to the
new service. This includes listening for generation changes and
updating the data structures for performing CDC log writes (cdc::metadata).
Furthermore these functions now return futures (and are internally
coroutines), where previously they required a seastar::async context.
fill_buffer() will keep scanning until _lower_bound_chnaged is true,
even if preemption is signalled, so that the reader makes forward
progress.
Before the patch, we did not update _lower_bound on touching a dummy
entry. The read will not respect preemption until we hit a non-dummy
row. If there is a lot of dummy rows, that can cause reactor stalls.
Fix that by updating _lower_bound on dummy entries as well.
Refs #8153.
Tested with perf_row_cache_reads:
$ build/release/test/perf/perf_row_cache_reads -c1 -m200M
Rows in cache: 0
Populating with dummy rows
Rows in cache: 373929
Scanning
read: 183.658966 [ms], preemption: {count: 848, 99%: 0.545791 [ms], max: 0.519343 [ms]}, cache: 99/100 [MB]
read: 120.951515 [ms], preemption: {count: 257, 99%: 0.545791 [ms], max: 0.518795 [ms]}, cache: 99/100 [MB]
Notice that max preemption latency is low in the second "read:" line.
Timeout config is now stored in each connection, so there's no point
in tracking it inside each query as well. This patch removes
timeout_config from query_options and follows by removing now
unnecessary parameters of many functions and constructors.
dh_installinit --name <service> is for forcing install debian/*.service
and debian/*.default that does not matches with package name.
And if we have subpackages, packager has responsibility to rename
debian/*.service to debian/<subpackage>.*service.
However, we currently mistakenly running
dh_installinit --name scylla-node-exporter for
debian/scylla-node-exporeter.service,
the packaging system tries to find destination package for the .service,
and does not find subpackage name on it, so it will pick first
subpackage ordered by name, scylla-conf.
To solve the issue, we just need to run dh_installinit without --name
when $product == 'scylla'.
Fixes#8163Closes#8164
We currently deny running scylla_setup when umask != 0022.
To remove this limitation, run os.chmod(0o644) on every file creation
to allow reading from scylla user.
Note that perftune.yaml is not really needed to set 0644 since perftune.py is
running in root user, but setting it to align permission with other files.
Fixes#8049Closes#8119
TWCS reshape was silently ignoring windows which contain at least
min_threshold sstables (can happen with data segregation).
When resizing candidates, size of multi_window was incorrectly used and
it was always empty in this path, which means candidates was always
cleared.
Fixes#8147.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210224125322.637128-1-raphaelsc@scylladb.com>
This series adds background reclaim to lsa, with the goal
that most large allocations can be satisfied from available
free memory, and and reclaim work can be done from a preemptible
context.
If the workload has free cpu, then background reclaim will
utilize that free cpu, reducing latency for the main workload.
Otherwise, background reclaim will compete with the main
workload, but since that work needs to happen anyway,
throughput will not be reduced.
A unit test is added to verify it works.
Fixes#1634.
Closes#8044
* github.com:scylladb/scylla:
test: logalloc_test: test background reclaim
logalloc: reduce gap between std min_free and logalloc min_free
logalloc: background reclaim
logalloc: preemptible reclaim
Test that the background reclaimer is able to compete with a
fake load and reclaim 10 MB/s. The test is quite stressful as the "LRU"
is fully randomized.
If the background reclaimer is disabled, the test fails as soon as the
20MB "gap" is exhausted. With the reclaimer enabled, it is able to
free memory ahead of the allocations.
This patch adds to Alternator support for the CORS (Cross-Origin Resource
Sharing) protocol - a simple extension over the HTTP protocol which
browsers use when Javascript code contacts HTTP-based servers.
Although we usually think of Alternator as being used in a three-tier
application, in some setups there is no middle layer and the user's
browser, running Javascript code, wants to communicate directly with the
database. However, for security reasons, by default Javascript loaded
from domain X is not allowed to communicate with different domains Y.
The CORS protocol is meant to allow this, and Alternator needs to
participate in this protocol if it is to be used directly from Javascript
in browsers.
To implement CORS, Alternator needs to respond to the OPTIONS method
which it didn't allow before - with certain headers based on the
input headers. It also needs to do some of these things for the
regular methods (mostly, POST). The patch includes a comprehensive
test that runs against both Alternator and DynamoDB and shows that
Alternator handles these headers and methods the same as DynamoDB.
Additionally, I tested manually a Javascript DynamoDB client - which
didn't work prior to this patch (the browser reported CORS errors),
and works after this patch.
Fixes#8025.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210217222027.1219319-1-nyh@scylladb.com>
The schema used to create the sstable writer has to be the same as the
schema used by the reader, as the former is used to intrpret mutation
fragments produced by the reader.
Commit 9124a70 intorduced a deferring point between reader creation
and writer creation which can result in schema mismatch if there was a
concurrent alter.
This could lead to the sstable write to crash, or generate a corrupted
sstable.
Fixes#7994
Message-Id: <20210222153149.289308-1-tgrabiec@scylladb.com>
Compaction manager allows compaction of different weights to proceed in
parallel. For example, a small-sized compaction job can happen in parallel to a
large-sized one, but similar-sized jobs are serialized.
The problem is the current definition of weight, which is the log (base 4) of
total size (size of all sstables) of a job.
This is what we get with the current weight definition:
weight=5 for sizes=[1K, 3K]
weight=6 for sizes=[4K, 15K]
weight=7 for sizes=[16K, 63K]
weight=8 for sizes=[64K, 255K]
weight=9 for sizes=[258K, 1019K]
weight=10 for sizes=[1M, 3M]
weight=11 for sizes=[4M, 15M]
weight=12 for sizes=[16M, 63M]
weight=13 for sizes=[64M, 254M]
weight=14 for sizes=[256M, 1022M]
weight=15 for sizes=[1033M, 4078M]
weight=16 for sizes=[4119M, 10188M]
total weights: 12
Note that for jobs smaller than 1MB, we have 5 different weights, meaning 5
jobs smaller than 1MB could proceed in parallel. High number of parallel
compactions can be observed after repair, which potentially produces tons of
small sstables of varying sizes. That causes compaction to use a significant
amount of resources.
To fix this problem, let's add a fixed tax to the size before taking the log,
so that jobs smaller than 1M will all have the same weight.
Look at what we get with the new weight definition:
weight=10 for sizes=[1K, 2M]
weight=11 for sizes=[3M, 14M]
weight=12 for sizes=[15M, 62M]
weight=13 for sizes=[63M, 254M]
weight=14 for sizes=[256M, 1022M]
weight=15 for sizes=[1033M, 4078M]
weight=16 for sizes=[4119M, 10188M]
total weights: 7
Fixes#8124.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210217123022.241724-1-raphaelsc@scylladb.com>
Currently, init_server and join_cluster which initiate the bootstrap and
replace operations on the new node run inside the main scheduling group.
We should run them inside the maintenance scheduling group to reduce the
impact on the user workload.
This patch fixes a scheduling group leak for bootstrap and replace operation.
Before:
[shard 0] storage_service - storage_service::bootstrap sg=main
[shard 0] repair - bootstrap_with_repair sg=main
After:
[shard 0] storage_service - storage_service::bootstrap sg=streaming
[shard 0] repair - bootstrap_with_repair sg=streaming
Fixes#8130Closes#8131
operator<< used the wrong criterium for deciding whether the data is
stored as atomic_cell or collection_mutation, resulting in
catastrophical failure if it was used with frozen collections or UDTs.
Since frozen collections and UDTs are stored as atomic_cell, not
collection_mutation, the correct criterium is not is_collection(),
but is_multi_cell().
Closes#8134
This commit introduces a new service crafted to handle CDC generation
management: listening and reacting to generation changes in the cluster.
The implementation is a stub for now, the service reacts to generation
changes by simply logging the event.
The commit plugs the service in, initializing it in main and test code,
passing a reference to storage_service and having storage_service start
the service (using the `after_join` method): the service only starts
doing its job after the node joins the token ring (either on bootstrap
or restart).
initialization
As a preparation for introducing CDC generation management service.
cdc_service will depend on the generation service.
But the generation service needs some other services to work
properly. In particular, it uses the local database, so it should be
initialized after the local database.
The only service that will need the cdc generation service is
storage_service, so we can place the generation service initialization
code right before storage_service initialization code. So the order will
be cdc_generation_service -> cdc_service -> storage_service.
"
Current storage of cells in a row is a union of vector and set. The
vector holds 5 cell_and_hash's inline, up to 32 ones in the external
storage and then it's switched to std::set. Once switched, the whole
union becomes the waste of space, as it's size is
sizeof(vector head) + 5 * sizeof(cell and hash) = 90+ bytes
and only 3 pointers from it are used (std::set header). Also the
overhead to keep cell_and_hash as a set entry is more then the size
of the structure itself.
Column ids are 32-bit integers that most likely come sequentialy.
For this kind of a search key a radix tree (with some care for
non-sequential cases) can be beneficial.
This set introduces a compact radix tree, that uses 7-bit sub values
from the search key to index on each node and compacts the nodes
themselves for better memory usage. Then the row::_storage is replaced
with the new tree.
The most notable result is the memory footprint decrease, for wide
rows down to 2x times. The performance of micro-benchmarks is a bit
lower for small rows and (!) higer for longer (8+ cells). The numbers
are in patch #12 (spoiler: they are better than for v2)
v3:
- trimmed size of radix down to 7 bits
- simplified the nodes layouts, now there are 2 of them (was 4)
- enhanced perf_mutation to test N-cells schema
- added AVX intra-nodes search for medium-sized nodes
- added .clone_from() method that helped to improve perf_mutation
- minor
- changed functions not to return values via refs-arguments
- fixed nested classes to properly use language constructors
- renamed index_to to key_t to distinguish from node_index_t
- improved recurring variadic templates not to use sentinel argument
- use standard concepts
v2:
- fixed potential mis-compilation due to strict-aliasing violation
- added oracle test (radix tree is compared with std::map)
- added radix to perf_collection
- cosmetic changes (concepts, comments, names)
A note on item 1 from v2 changelog. The nodes are no longer packed
perfectly, each has grown 3 bytes. But it turned out that when used
as cells container most of this growth drowned in lsa alignments.
next todo:
- aarch64 version of 16-keys node search
tests: unit(dev), unit(debug for radix*), pref(dev)
"
* 'br-radix-tree-for-cells-3' of https://github.com/xemul/scylla:
test/memory_footpring: Print radix tree node sizes
row: Remove old storages
row: Prepare row::equal for switch
row: Prepare row::difference for switch
row: Introduce radix tree storage type
row-equal: Re-declare the cells_equal lambda
test: Add tests for radix tree
utils: Compact radix tree
array-search: Add helpers to search for a byte in array
test/perf_collection: Add callback to check the speed of clone
test/perf_mutation: Add option to run with more than 1 columns
test/perf_mutation: Prepare to have several regular columns
test/perf_mutation: Use builder to build schema
In this patch, we port validation/entities/json_test.java, containing
21 tests for various JSON-related operations - SELECT JSON, INSERT JSON,
and the fromJson() and toJson() functions.
In porting these tests, I uncovered 19 (!!) previously unknown bugs in
Scylla:
Refs #7911: Failed fromJson() should result in FunctionFailure error, not
an internal error.
Refs #7912: fromJson() should allow null parameter.
Refs #7914: fromJson() integer overflow should cause an error, not silent
wrap-around.
Refs #7915: fromJson() should accept "true" and "false" also as strings.
Refs #7944: fromJson() should not accept the empty string "" as a number.
Refs #7949: fromJson() fails to set a map<ascii, int>.
Refs #7954: fromJson() fails to set null tuple elements.
Refs #7972: toJson() truncates some doubles to integers.
Refs #7988: toJson() produces invalid JSON for columns with "time" type.
Refs #7997: toJson() is missing a timezone on timestamp.
Refs #8001: Documented unit "µs" not supported for assigning a "duration"
type.
Refs #8002: toJson() of decimal type doesn't use exponents so can produce
huge output.
Refs #8077: SELECT JSON output for function invocations should be
compatible with Cassandra.
Refs #8078: SELECT JSON ignores the "AS" specification.
Refs #8085: INSERT JSON with bad arguments should yield InvalidRequest
error, not internal error.
Refs #8086: INSERT JSON cannot handle user-defined types with case-
sensitive component names.
Refs #8087: SELECT JSON incorrectly quotes strings inside map keys.
Refs #8092: SELECT JSON missing null component after adding field to
UDT definition.
Refs #8100: SELECT JSON with IN and ORDER BY does not obey the ORDER BY.
Due to these bugs, 8 out of the 21 tests here currently xfail and one
has to be skipped (issue #8100 causes the sanitizer to detect a use
after free, and crash Scylla).
As usual in these sort of tests, all 21 tests pass when running against
Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210217130732.1202811-1-nyh@scylladb.com>
When psutil.disk_paritions() reports / is /dev/root, aws_instance mistakenly
reports root partition is part of ephemeral disks, and RAID construction will
fail.
This prevents the error and reports correct free disks.
Fixes#8055Closes#8040
Due to small value optimization used in `bytes`, views to `bytes` stored
in `vector` can be invalidated when the vector resizes, resulting in
use-after-free and data corruption. Fix that.
Closes#8105
* github.com:scylladb/scylla:
cdc: log: avoid an unnecessary copy
cdc: log: fix use-after-free in process_bytes_visitor
Due to small value optimization used in `bytes`, views to `bytes` stored
in `vector` can be invalidated when the vector resizes, resulting in
use-after-free and data corruption. Fix that.
Fixes#8117
While a duplicate vote from the same server is not possible by a
conforming Raft implementation, Raft assumptions on network permit
duplicates.
So, in theory, it is possible that a vote message is delivered
multiple times.
The current voting implementation does reject votes from non-members,
but doesn't check for duplicate votes.
Keep track of who has voted yet, and reject duplicate votes.
A unit test follows.
Imagine the cluster is in joint configuration {{A, B}, {A, B, C, D, E}}.
The leader's view of stable indexes is:
Server Match Index
A 5
B 5
C 6
D 7
E 8
The commit index would be 5 if we use joint configuration, and 6
if we assume we left it. Let it happen without an extra FSM
step.
The old name was incorrect, in case apply_snapshot() was called with
non-zero trailing entries, the total log length is greater than the
length of the part that is not stored in a snapshot.
Fix spelling in related comments.
Rename fsm::wait() to fsm::wait_max_log_size(), it's a more
specific name. Rename max_log_length to max_log_size to use
'size' rather than 'length' consistently for log size.
Replace it with a private _first_idx, which is maintained
along with the rest of class log state.
_first_idx is a name consistent with counterpart last_idx().
Do not use a function since going forward we may want
to remove Raft index from struct log_entry, so should rely
less on it.
This fixes a bug when _last_conf_idx was not reset
after apply_snapshot() because start_idx() was pointing
to a non-existent entry.
If the log is empty, we must use snapshot's term,
since the log could be right after taking a snapshot
when no trailing entries were kept.
This fixes a rare possible bug when a log matching
rule could be violated during elections by a follower
with a log which was just truncated after a snapshot.
A separate unit test for the issue will follow.
raft::log::start_idx() is currently not meaningful
in case the log is empty.
Avoid using it in fsm::replicate_to() and avoid manual search for
previous log term, instead encapsulate the search in log::term_for().
As a side effect we currently return a correct term (0)
when log matching rule is exercised for an empty log
and the very first snapshot with term 0. Update raft_etcd_test.cc
accordingly.
This change happens to reduce the overall line count.
While at it, improve the comments in raft::replicate_to().
Until now, the lists of streams in the `cdc_streams_descriptions` table
for a given generation were stored in a single collection. This solution
has multiple problems when dealing with large clusters (which produce
large lists of streams):
1. large allocations
2. reactor stalls
3. mutations too large to even fit in commitlog segments
This commit changes the schema of the table as described in issue #7993.
The streams are grouped according to token ranges, each token range
being represented by a separate clustering row. Rows are inserted in
reasonably large batches for efficiency.
The table is renamed to enable easy upgrade. On upgrade, the latest CDC
generation's list of streams will be (re-)inserted into the new table.
Yet another table is added: one that contains only the generation
timestamps clustered in a single partition. This makes it easy for CDC
clients to learn about new generations. It also enables an elegant
two-phase insertion procedure of the generation description: first we
insert the streams; only after ensuring that a quorum of replicas
contains them, we insert the timestamp. Thus, if any client observes a
timestamp in the timestamps table (even using a ONE query),
it means that a quorum of replicas must contain the list of streams.
---
Nodes automatically ensure that the latest CDC generation's list of
streams is present in the streams description table. When a new
generation appears, we only need to update the table for this
generation; old generations are already inserted.
However, we've changed the description table (from
`cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The
existing mechanism only ensures that the latest generation appears in
the new description table. We add an additional procedure that
rewrites the older generations as well, if we find that it is necessary
to do so (i.e. when some CDC log tables may contain data in these
generations).
Closes#8116
* github.com:scylladb/scylla:
tests: add a simple CDC cql pytest
cdc: add config option to disable streams rewriting
cdc: rewrite streams to the new description table
cql3: query_processor: improve internal paged query API
cdc: introduce no_generation_data_exception exception type
docs: cdc: mention system.cdc_local table
cdc: coroutinize do_update_streams_description
sys_dist_ks: split CDC streams table partitions into clustered rows
cdc: use chunked_vector for streams in streams_version
cdc: remove `streams_version::expired` field
system_distributed_keyspace: use mutation API to insert CDC streams
storage_service: don't use `sys_dist_ks` before it is started
Rewriting stream descriptions is a long, expensive, and prone-to-failure
operation. Due to #8061 it may consume a lot of memory. In general, it
may keep failing (and being retried) endlessly, straining the cluster.
As a backdoor we add this flag for potential future needs of admins or
field engineers.
I don't expect it will ever be used, but it won't hurt and may save us
some work in the worst case scenario.
Nodes automatically ensure that the latest CDC generation's list of
streams is present in the streams description table. When a new
generation appears, we only need to update the table for this
generation; old generations are already inserted.
However, we've changed the description table (from
`cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The
existing mechanism only ensures that the latest generation appears in
the new description table. This commit adds an additional procedure that
rewrites the older generations as well, if we find that it is necessary
to do so (i.e. when some CDC log tables may contain data in these
generations).
The `query_processor::query` method allowed internal paged queries.
However, it was quite limited, hardcoding a number of parameters:
consistency level, timeout config, page size.
This commit does the following improvements:
1. Rename `query` to `query_internal` to make it obvious that this API
is supposed to be used for internal queries only
2. Extend the method to take consistency level, timeout config, and page
size as parameters
3. Remove unused overloads of `query_internal`
4. Fix a bunch of typos / grammar issues in the docstring
Until now, the lists of streams in the `cdc_streams_descriptions` table
for a given generation were stored in a single collection. This solution
has multiple problems when dealing with large clusters (which produce
large lists of streams):
1. large allocations
2. reactor stalls
3. mutations too large to even fit in commitlog segments
This commit changes the schema of the table as described in issue #7993.
The streams are grouped according to token ranges, each token range
being represented by a separate clustering row. Rows are inserted in
reasonably large batches for efficiency.
The table is renamed to enable easy upgrade. On upgrade, the latest CDC
generation's list of streams will be (re-)inserted into the new table.
Yet another table is added: one that contains only the generation
timestamps clustered in a single partition. This makes it easy for CDC
clients to learn about new generations. It also enables an elegant
two-phase insertion procedure of the generation description: first we
insert the streams; only after ensuring that a quorum of replicas
contains them, we insert the timestamp. Thus, if any client observes a
timestamp in the timestamps table (even using a ONE query),
it means that a quorum of replicas must contain the list of streams.
It could happen that system_distributed_keyspace was used by
storage_service before it was fully started (inside
`handle_cdc_generation`), i.e. before sys_dist_ks' `start()` returned
(on shard 0). It only checked whether `local_is_initialized()` returns
true, so it only ensured that the service is constructed.
Currently, sys_dist_ks' `start` only announces migrations, so this was
mostly harmless. More concretely: it could result in the node trying to
send CQL requests using a table that it didn't yet recognize by calling
sys_dist_ks' methods before the `announce_migration` call inside `start`
has returned. This would result in an exception; however, the exception
would be catched by the caller and the procedure would be retried,
succeeding eventually. See `handle_cdc_generation` for details.
Still, the initial intention of the code was to wait for the sys_dist_ks
service to be fully started before it was used. This commit fixes that.
Test log consistency after apply_snapshot() is called.
Ensure log::last_term() log::last_conf_index() and log::size()
work as expected.
Misc cleanups.
* scylla-dev/raft-confchange-test:
raft: add a unit test for voting
raft: do not account for the same vote twice
raft: remove fsm::set_configuration()
raft: consistently use configuration from the log
raft: add ostream serialization for enum vote_result
raft: advance commit index right after leaving joint configuration
raft: add tracker test
raft: tidy up follower_progress API
raft: update raft::log::apply_snapshot() assert
raft: add a unit test for raft::log
raft: rename log::non_snapshoted_length() to log::length()
raft: inline raft::log::truncate_tail()
raft: ignore AppendEntries RPC with a very old term
raft: remove log::start_idx()
raft: return a correct last term on an empty log
raft: do not use raft::log::start_idx() outside raft::log()
raft: rename progress.hh to tracker.hh
raft: extend single_node_is_quiet test
expired sstables are skipped in the compaction setup phase, because they don't
need to be actually compacted, but rather only deleted at the end.
that is causing such sstables to not be removed from the backlog tracker,
meaning that backlog caused by expired sstables will not be removed even after
their deletion, which means shares will be higher than needed, making compaction
potentially more aggressive than it have to.
to fix this bug, let's manually register these sstables into the monitor,
such that they'll be removed from the tracker once compaction completes.
Fixes#6054.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210216203700.189362-1-raphaelsc@scylladb.com>
Current renaming rule of debian/scylla-* files is buggy, it fails to
install some .service files when custom product name specified.
Introduce regex based rewriting instead of adhoc renaming, and fixed
wrong renaming rule.
Fixes#8113Closes#8114
`_range_override` is used to store the modified range the reader reads
after it has to be recreated (when recreating a reader it's read range
is reduced to account for partitions it already read). When engaged,
this field overrides the `_pr` field as the definitive range the reader
is supposed to be currently reading. Fast forwarding conceptually
overrides the range the reader is currently reading, however currently
it doesn't reset the `_range_override` field. This resulted in
`_range_override` (containing the modified pre-fast-forward range)
incorrectly overriding the fast-forwarded-to range in `_pr` when
validating the first partition produced by the just recreated reader,
resulting in a false-positive validation failure.
Fixes: #8059
Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210217164744.420100-1-bdenes@scylladb.com>
Instead of resetting _reader in scanning_and_populating_reader::fill_buffer
in the `reader_finished` case, use a gentler, _read_next_partition flag
on which `read_next_partition` will be called in the next iteration.
Then, read_next_partition can close _reader only before overwriting it
with a new reader. Otherwise, if _reader is always closed in the
``reader_finished` case, we end up hitting premature end_of_stream.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-30-bhalevy@scylladb.com>
Unlike flat_mutation_reader_opt that is defined using
optimized_optional<flat_mutation_reader>, std::optional<T> does not evaluate
to `false` after being moved, only after it is explicitly reset.
Use flat_mutation_reader_opt rather than std::optional<flat_mutation_reader>
to make it easier to check if it was closed before it's destroyed
or being assigned-over.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-6-bhalevy@scylladb.com>
Currently, whole topology description for CDC is stored in a single row.
This means that for a large cluster of strong machines (say 100 nodes 64
cpus each), the size of the topology description can reach 32MB.
This causes multiple problems. First of all, there's a hard limit on
mutation size that can be written to Scylla. It's related to commit log
block size which is 16MB by default. Mutations bigger than that can't be
saved. Moreover, such big partitions/rows cause reactor stalls and
negatively influence latency of other requests.
This patch limits the size of topology description to about 4MB. This is
done by reducing the number of CDC streams per vnode and can lead to CDC
data not being fully colocated with Base Table data on shards. It can
impact performance and consistency of data.
This is just a quick fix to make it easily backportable. A full solution
to the problem is under development.
For more details see #7961, #7993 and #7985.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Closes#8048
* github.com:scylladb/scylla:
cdc: Limit size of topology description
cdc: Extract create_stream_ids from topology_description_generator
Currently, whole topology description for CDC is stored in a single row.
This means that for a large cluster of strong machines (say 100 nodes 64
cpus each), the size of the topology description can reach 32MB.
This causes multiple problems. First of all, there's a hard limit on
mutation size that can be written to Scylla. It's related to commit log
block size which is 16MB by default. Mutations bigger than that can't be
saved. Moreover, such big partitions/rows cause reactor stalls and
negatively influence latency of other requests.
This patch limits the size of topology description to about 4MB. This is
done by reducing the number of CDC streams per vnode and can lead to CDC
data not being fully colocated with Base Table data on shards. It can
impact performance and consistency of data.
This is just a quick fix to make it easily backportable. A full solution
to the problem is under development.
For more details see #7961, #7993 and #7985.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Commit aab6b0ee27 introduced the
controversial new IMR format, which relied on a very template-heavy
infrastructure to generate serialization and deserialization code via
template meta-programming. The promise was that this new format, beyond
solving the problems the previous open-coded representation had (working
on linearized buffers), will speed up migrating other components to this
IMR format, as the IMR infrastructure reduces code bloat, makes the code
more readable via declarative type descriptions as well as safer.
However, the results were almost the opposite. The template
meta-programming used by the IMR infrastructure proved very hard to
understand. Developers don't want to read or modify it. Maintainers
don't want to see it being used anywhere else. In short, nobody wants to
touch it.
This commit does a conceptual revert of
aab6b0ee27. A verbatim revert is not
possible because related code evolved a lot since the merge. Also, going
back to the previous code would mean we regress as we'd revert the move
to fragmented buffers. So this revert is only conceptual, it changes the
underlying infrastructure back to the previous open-coded one, but keeps
the fragmented buffers, as well as the interface of the related
components (to the extent possible).
Fixes: #5578Closes#8106
* github.com:scylladb/scylla:
imr: switch back to open-coded description of structures
utils: managed_bytes: add a few trivial helper methods
utils: fragment_range: move FragmentedView helpers to fragment_range.hh
utils: fragment_range: add single_fragmented_mutable_view
utils: fragment_range: implement FragmentRange for fragment_range
utils: mutable_view: add front()
types: remove an unused helper function
test: mutation_test: fix memory calculations in make_fragments_with_non_monotonic_positions
test: mutation_test: remove an obsolete assertion
test: mutation_test: initialize an uninitialized variable
test: sstable_datafile_test: fix tracking of closed sstables in sstable_run_based_compaction_test
Commit aab6b0ee27 introduced the
controversial new IMR format, which relied on a very template-heavy
infrastructure to generate serialization and deserialization code via
template meta-programming. The promise was that this new format, beyond
solving the problems the previous open-coded representation had (working
on linearized buffers), will speed up migrating other components to this
IMR format, as the IMR infrastructure reduces code bloat, makes the code
more readable via declarative type descriptions as well as safer.
However, the results were almost the opposite. The template
meta-programming used by the IMR infrastructure proved very hard to
understand. Developers don't want to read or modify it. Maintainers
don't want to see it being used anywhere else. In short, nobody wants to
touch it.
This commit does a conceptual revert of
aab6b0ee27. A verbatim revert is not
possible because related code evolved a lot since the merge. Also, going
back to the previous code would mean we regress as we'd revert the move
to fragmented buffers. So this revert is only conceptual, it changes the
underlying infrastructure back to the previous open-coded one, but keeps
the fragmented buffers, as well as the interface of the related
components (to the extent possible).
Fixes: #5578
In the upcoming IMR removal patch we will need read_simple() and similar helpers
for FragmentedView outside of types.hh. For now, let's move them to
fragment_range.hh, where FragmentedView is defined. Since it's a widely included
header, we should consider moving them to a more specialized header later.
The off-by-one error would cause
test_multishard_combining_reader_non_strictly_monotonic_positions to fail if
the added range_tombstones filled the buffer exactly to the end.
In such situation, with the old loop condition,
make_fragments_with_non_monotonic_positions would add one range_tombstone too
many to the deque, violating the test assumptions.
Due to small value optimizations, the removed assertions are not true in
general. Until now, atomic_cell did not use small value optimizations, but
it will after upcoming changes.
sstable_run_based_compaction_test assumed that sstables are freed immediately
after they are fully processed.
Hovewer, since commit b524f96a74,
mutation_reader_merger releases sstables in batches of 4, which breaks the
assumption. This fix adjusts the test accordingly.
Until now, the test only kept working by chance: by coincidence, the number of
test sstables processed by merging_reader in a single fill_buffer() call was
divisible by 4. Since the test checks happen between those calls,
the test never witnessed a situation when an sstable was fully processed,
but not released yet.
The error was noticed during the work on an upcoming patch which changes the
size of mutation_fragment, and reduces the number of test sstables processed
in a single fill_buffer() call, which breaks the test.
While duplicate votes are not allowed by Raft rules, it is possible
that a vote message is delivered multiple times.
The current voting implementation does reject votes from non-members,
but doesn't check for duplicate votes.
Keep track of who has voted yet, and reject duplicate votes.
A unit test follows.
Imagine the cluster is in joint configuration {{A, B}, {A, B, C, D, E}}.
Server stable indexes are:
Server Stable Index
A 5
B 5
C 6
D 7
E 8
The commit index would be 5 if we use joint configuration, and 6
if we assume we left it. Left it happen without an extra FSM
step.
The old name was incorrect, in case apply_snapshot() was called with
non-zero trailing entries, the total log length is greater than the
length of the part that is not stored in a snapshot.
Fix spelling in related comments.
Rename fsm::wait() to fsm::wait_max_log_length(), it's a more
specific name.
Replace it with a private _first_idx, which is maintained
along with the rest of class log state.
_first_idx is a name consistent with counterpart last_idx().
Do not use a function since going forward we may want
to remove Raft index from struct log_entry, so should rely
less on it.
This fixes a bug when _last_conf_idx was not reset
after apply_snapshot() because start_idx() was pointing
to a non-existent entry.
If the log is empty, we must use snapshot's term,
since the log could be right after taking a snapshot
when no trailing entries were kept.
This fixes a rare possible bug when a log matching
rule could be violated during elections by a follower
with a log which was just truncated after a snapshot.
A separate unit test for the issue will follow.
raft::log::start_idx() is currently not meaningful
in case the log is empty.
Avoid using it in fsm::replicate_to() and avoid manual search for
previous log term, instead encapsulate the search in log::term_for().
As a side effect we currently return a correct term (0)
when log matching rule is exercised for an empty log
and the very first snapshot with term 0. Update raft_etcd_test.cc
accordingly.
This change happens to reduce the overall line count.
While at it, improve the comments in raft::replicate_to().
This patch adds several additional tests o test/cql-pytest/test_json.py
to reproduce additional bugs or clarify some non-bugs.
First, it adds a reproducer for issue #8087, where SELECT JSON may create
invalid JSON - because it doesn't quote a string which is part of a map's
key. As usual for these reproducers, the test passes on Cassandra, and fails
on Scylla (so marked xfail).
We have a bigger test translated from Cassandra's unit tests,
cassandra_tests/validation/entities/json_test.py::testInsertJsonSyntaxWithNonNativeMapKeys
which demonstrates the same problem, but the test added in this patch is much
shorter and focuses on demonstrating exactly where the problem is.
Second, this patch adds a test test verifies that SELECT JSON works correctly
for UDTs or tuples where one of their components was never set - in such a
case the SELECT JSON should also output this component, with a "null" value.
And this test works (i.e., produces the same result in Cassandra and Scylla).
This test is interesting because it shows that issue #8092 is specific to the
case of an altered UDT, and doesn't happen for every case of null
component in a UDT.
Refs #8087
Refs #8092
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210216150329.1167335-1-nyh@scylladb.com>
* seastar 76cff58964...e53a1059f9 (18):
> rpc: streaming sink: order outgoing messages
Fixes#7552.
> http: fix compilation issues when using clang++
> http/file_handler: normalize file-type for mime detection
> http/mime_types: add support for svg+xml
> reactor: simplify get_sched_stats()
> Merge "output_stream: make api noexcept" from Benny
> Merge " input_stream: make api noexcept" from Benny
> rpc: mark 'protocol' class as final
> tls: reloadable_certificate inotify flag is wrong
Fixes#8082.
> cli: Ignore the --num-io-queues option
> io_queue: Do not carry start time in lambda capture
> fstream: Cancel all IO-s on file_data_source_impl close
> http: add "Transfer-Encoding: chunked" handling
> http: add ragel parsers for chunks used in messages with Transfer-Encoding: chunked
> http: add request content streaming
> http: add reading/skipping all bytes in an input_stream
> Merge "Reduce per-io-queue container for prio classes" from Pavel Emelyanov
> seastar-addr2line: split multiple addresses on the same line
messaging_service's rpc_protocol_server_wrapper inherits from
seastar::rpc::protocol::server as a way to avoid a
is unfortunate, as protocol.hh wasn't designed for inheritance, and
is not marked final.
Avoid this inheritance by hiding the class as a member. This causes
a lot of boilerplate code, which is unfortunate, but this random
inheritance is bad practice and should be avoided.
Closes#8084
Currently there are places that call
keyspace_element_name::get_keyspace() without checking that _ks_name is
engaged. Fix those places.
Message-Id: <20210216085545.54753-1-gleb@scylladb.com>
repair_writer::do_write(): already has a partition compare for each
mutation fragment written, do determine whether the fragment belongs to
another partition or not. This equal compare can be converted to a
tri_compare at no extra cost allowing for detecting out-of-order
partitions, in which case `on_internal_error()` is called.
Refs: #7623
Refs: #7552
Test: dtest(RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test:debug)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210216074523.318217-1-bdenes@scylladb.com>
We see long reactor stalls from `logalloc::prime_segment_pool`
in debug mode yet the stall detector's purpose is to detect
reactor stalls during normal operation where they can increase
the latency of other queries running in parallel.
Since this change doesn't actually fix the stalls but rather
hides them, the following annotations will just refrence
the respective github issues rather than auto-close them.
Refs #7150
Refs #5192
Refs #5960
Restore blocked_reactor_notify_ms right before
starting storage_proxy. Once storage_proxy is up, this node
affects cluster latency, and so stalls should be reported so
they can be fixed.
Test: secondary_index_test --blocked-reactor-notify-ms 1 (release)
DTest: CASSANDRA_DIR=../scylla/build/release SCYLLA_EXT_OPTS="--blocked-reactor-notify-ms 2" ./scripts/run_test.sh materialized_views_test:TestMaterializedViews.interrupt_build_process_with_resharding_half_to_max_test
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210216112052.27672-1-bhalevy@scylladb.com>
This series provides a `raft_services` class to create and store
a raft schema changes server instances, and also wires up the RPC handlers
for Raft RPC verbs.
* manmanson/raft-api-server-handlers-v10:
raft: share `raft_gossip_failure_detector` instance across multiple raft rpc instances
raft: move server address handling from `raft_rpc` to `raft_services` class
raft: wire up schema Raft RPC handlers
raft: raft_rpc: provide `update_address_mapping` and dispatcher functions
raft: pass `group_id` as an argument to raft rpc messages
raft: use a named constant for pre-defined schema raft group
Store an instance inside `raft_services` and reuse it for
all raft groups created and managed by `raft_services` instance.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This allows to decouple `raft_gossip_failure_detector` from being
dependent on a particular rpc instance and thus makes it possible
to share the same failure detector instance among all raft servers
since they are managed in a centralized way by a `raft_services`
instance.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This patch adds registration and de-registration of the
corresponding Raft RPC verbs handlers.
There is a new `raft_services` class that
is responsible for initializing the raft RPC verbs and
managing raft server instances.
The service inherits `seastar::peering_sharded_service<T>`,
because we need to route the request to the appropriate shard
which is handled by the `shard_for_group` function (currently
only handling schema raft group to land on shard 0, otherwise
throws an exception).
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
We had Alternator's current compatibility with DynamoDB described in
two places - alternator.md and compatibility.md. This duplication was
not only unnecessary, in some places it led to inconsistent claims.
In general, the better description was in compatibility.md, so in
this patch we remove the compatibility section from alternator.md
and instead link to compatibility.md. There was a bit of information
that was missing in compatibility.md, so this patch adds it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210215203057.1132162-1-nyh@scylladb.com>
After switching cells storage onto compact radix tree it
becomes useful to know the tree nodes' sizes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the 3rd storage type (radix tree) is all in, old
storage can be safely removed. The result is:
1. memory footprint
sizeof(class row): 112 => 16 bytes
sizeof(rows_entry): 126 => 120 bytes
the "in cache" value depends on the number of cells:
num of cells master patch
1 752 656
2 808 712
3 864 768
4 920 824
5 968 936
6 1136 992
...
16 1840 1672
17 1904 1992 (+88)
18 1976 2048 (+72)
19 2048 2104 (+56)
20 2120 2160 (+40)
21 2184 2208 (+24)
22 2256 2264 ( +8)
23 2328 2320
...
32 2960 2808
After 32 cells the storage switches into rbtree with
24-bytes per-cell overhead and the radix tree improvement
rocketlaunches
64 7872 6056
128 15040 9512
256 29376 18568
2. perf_mutation test is enhanced by this series and the
results differ depending on the number of columns used
tps value
--column-count master patch
1 59.9k 57.6k (-3.8%)
2 59.9k 57.5k
4 59.8k 57.6k
8 57.6k 57.7k <- eq
16 56.3k 57.6k
32 53.2k 57.4k (+7.9%)
A note on this. Last time 1-column test was ~5% worse which
was explained by inline storage of 5 cells that's present on
current implementation and was absent in radix tree.
An attempt to make inline storage for small radix trees
resulted in complete loss of memory footprint gain, but gave
fraction of percent to perf_mutation performance. So this
version doesn't have inline nodes.
The 1.2% improvement from v2 surprisingly came from the
tree::clone_from() which in v2 was work-around-ed by slow
walk+emplace sequence while this version has the optimized
API call for cloning.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Same as the previous patch, re-implement the row::equal to use
the radix_tree iterator for comparison of two index:cell sequences.
The std::equal() doesn't work here, since the predicate-fn needs
to look at both iterators to call it.key() on (radix tree API
feature), while std::equal provides only the T&s in it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method effectively walks two pairs of <colun_id, cell> and
applies the difference to separare row instance. The code added
is the copy of the same code below this hunk with the mechanical
substitution:
c.first -> c.key()
c.second -> c->cell
it->first -> it.key()
it->second -> it.cell
because first-s are column_id-s reported by radix tree iterator
.key() method and second-s are cells, that were referenced by
current code in get_..._vector() from boost::irange and are now
directly pointed to by raidx tree iterator.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently class row uses a union of a vector and a set to keep
the cells and switches between them. Add the 3rd type with the
radix tree, but never switch to it, just to show how the operations
would look like. Later on vector and set will be removed and the
whole row will be immediately switched to the radix tree storage.
NB: All the added places have indentation deliberately broken, so
that next patch will just remove the surrounding (old) code away
and (most of) the new one will happen in its place instantly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
For further patching it's handy to have this helper to accept
column_id and atomic_cell_or_collection arguments, instead of
an std::pair of these two.
This is to facilitate next patching.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The tree uses integral type as a search key. On each level the local index
is next 7 bits from the key, respectively for 32-bit key we have 5 levels.
The tree uses 2 memory packing techniques -- prefix compaction and growing
node layouts.
The prefix compaction is used when a node has only one child. In this case
such a node is replaced in its parent with this only child and the child in
question keeps "prefix:length" pair on board, that's used to check if the
short-cut lookup took the correct path.
The growing node layouts makes the nodes occupy as much memory as needed
to keep the _present_ keys and there are 2 kinds of layouts.
Direct layout is array, intra-node search is plain indexing. The layout
storage grows in vector-like manner, but there's a special case for the
maximum-sized layout that helps avoiding some boundary checks.
Indirect layout keeps two arrays on board -- with values and with indices.
The intra-node search is thus a lookup in the latter array first. This
layout is used to save memory for sparse keys. Lookup is optimized with
SIMD instructions.
Inner nodes use direct layouts, as they occupy ~1% of memory and thus
need not to be memory efficient. At the same time lookup of a key in the
tree potentially walks several inner nodes, so speeding up search for
them is beneficial.
Leaf nodes are indirect, since they are 99% of memory and thus need to
be packed well. The single indirect lookup when searching in the tree
doesn't slow things down notably even on insertion stress test.
Said that
* inner nodes are: header + 4 / 8 / 16 / 32 / 64 / 128 pointers
* leaf nodes are : header + 4 / 8 / 16 / 32 bytes + <same nr> objects
or header + 16 bytes bitmap + 128 objects
The header is
- backreference (8 bytes)
- prefix (4 bytes)
- size, layout, capacity (1 byte each)
The iterator is one-direction (for simplicity) but it enough for its main
target -- the sparse array of cells on a row. Also the iterator has an
.index() method that reports back the index of the entry at which it points.
This greatly simplifies the tree scans by the class row further.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The attrs_to_get object was previously copied, but it's quite
a heavyweight operations, since this object may contain an
instance of std::map or std::unordered_map.
To avoid copying whole maps, the object is wrapped in a shared
const pointer.
Message-Id: <75ad810de16c630b65ae8d319cb4b37e1de8085f.1613398751.git.sarna@scylladb.com>
As Gleb suggested in a previous review, remove ticker from raft and
leave calling tick() to external code.
While there, tick faster to speed up tests.
* https://github.com/alecco/scylla/tree/tests-17-remove-ticker:
raft: replication test: reduce ticker from 100ms to 1ms
raft: drop ticker from raft
Use the `_interval` member instead of the old `_range` field, but stay
compatible with pre 4.2 releases, falling back to `_range` when
`_interval` doesn't exist.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210215104008.166746-1-bdenes@scylladb.com>
The radix tree code will need the code to find 8-bit value
in an array of some fixed size, so here are the helpers.
Those that allow for SIMD implementations are such for x86_64
TODO: Add aarch64
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In some places scylla clones collections of objects, so it's
sometimes needed to measure the speed of this operation.
This patch adds a placeholder for it, but no implementations
for any supported collections. It will be added soon for radix
tree.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The --column-count makes the test generate schema with
the given numbers of columns and make mutation maker
fill random column with the value on each iteration.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Teach the schema builder and test itself to work on more
than one regular column, but for now only use 1, as before.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
in all expressions' from Nadav Har'El.
This series fixes#5024 - which is about adding support for nested attribute
paths (e.g., a.b.c[2]) to Alternator. The series adds complete support for this
feature in ProjectionExpression, ConditionExpression, FilterExpression and
UpdateExpression - and also its combination with ReturnValues. Many relevant
tests - and also some new tests added in this series - now pass.
The first patch in the series fixes#8043 a bug in some error cases in
conditions, which was discovered while working in this series, and is
conceptually separate from the rest of the series.
Closes#8066
* github.com:scylladb/scylla:
alternator: correct implemention of UpdateItem with nested attributes and ReturnValues
alternator: fix bug in ReturnValues=UPDATED_NEW
alternator: implemented nested attribute paths in UpdateExpression
alternator: limit the depth of nested paths
alternator: prepare for UpdateItem nested attribute paths
alternator: overhaul ProjectionExpression hierarchy implementation
alternator: make parsed::path object printable
alternator-test: a few more ProjectionExpression conflict test cases
alternator-test: improve tests for nested attributes in UpdateExpression
alternator: support attribute paths in ConditionExpression, FilterExpression
alternator-test: improve tests for nested attributes in ConditionExpression
alternator: support attribute paths in ProjectionExpression
alternator: overhaul attrs_to_get handling
alternator-test: additional tests for attribute paths in ProjectionExpression
alternator-test: harden attribute-path tests for ProjectionExpression
alternator: fix ValidationException in FilterExpression - and more
The canonical_mutation type can contain a large mutation, particularly
when the mutation is a result of converting a big schema. Its data
was stored in a field of type 'bytes', which is non-contiguous and
may cause a large allocation.
This is fixed by simply changing the type to 'bytes_ostream', which is
fragmented. The change is compatible because the idl type 'bytes' is compatible
with 'bytes_ostream' as a result of dcf794b, and all canonical_mutations's
methods use the field as an input stream (ser::as_input_stream), which can
be used on 'bytes_ostream' too.
Fixes#8074
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Closes#8075
This patch adds several more tests reproducing bugs in toJson() and
SELECT JSON.
First add two xfailing tests reproducing two toJson() issues - #7988
and #8002. The first is that toJson() incorrectly formats values of the
"time" type - it should be a string but Scylla forgets the quotes.
The second is that toJson() format "decimal" values as JSON numbers
without using an exponent, resulting in memory allocation failure
for numbers with high exponents, like 1e1000000000.
The actual test for 1e1000000000 has to be skipped because in
debug build mode we get a crash trying this huge allocation.
So instead, we check 1e1000 - this generates a string of 1000
characters, which is much too much (should just be "1e1000")
but doesn't crash.
Then we add a reproducing test for issue #8077: When using SELECT JSON
on a function, such as count(*), ttl(v) or intAsBlob(v), Cassandra has
a specific way how it formats the result in JSON, and Scylla should do
it the same way unless we have a good reason not to.
As usual, the new tests passes on Cassandra, fails on Scylla, so is marked
xfail.
Refs #7988
Refs #8002
Refs #8077.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210214210727.1098388-1-nyh@scylladb.com>
Start improving CONTRIBUTING.md, as suggested in issue #8037:
1. Incorporate the few lines we had in coding-style.md into CONTRIBUTING.md.
This was mostly a pointer to Seastar's coding style anyway, so it's not
helpful to have a separate file which hopeful developers will not find
anyway.
2. Mention the Scylla developers mailing list, not just the Scylla users
mailing list. The Scylla developers mailing list is where all the action
happens, and it's very odd not to mention it.
3. The decisions that github pull requests are forbidden was retracted
a long time ago, so change the explanation on pull requests.
4. Some smaller phrasing changes.
Refs #8037.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210214152752.1071313-1-nyh@scylladb.com>
Merged patch series from Pavel Emelyanov:
The default -O<> levels are considered to produce slow and tedious to
test code, so it's tempting to increase the level. On the other hand,
there was some complains about re-compile-mostrly work that would suffer
from slower builds.
This set tries to find a reasonable compromise -- raise the default opt
levels and provide the ability to configure one if needed.
* 'br-cxx-o-levels-2' of github.com:xemul/scylla:
configure: Switch debug build from -O0 to -Og
configure: Switch dev build from -O1 to -O2
configure: Make -O flag configurable
With the larger gap, logalloc reserved more memory for std than
the background reclaim threshold for running, so it was triggered
rarely.
With the gap reduced, background reclaim is constantly running in
an allocating workload (e.g. cache misses).
Set up a coroutine in a new scheduling group to ensure there is
a "cushion" of free memory. It reclaims in preemptible mode in
order to reduce reactor stalls (constrast with synchronous reclaim
that cannot preempt until it achieved its goal).
The free memory target is arbitrarily set at 60MB. The reclaimer's
shares are proportional to the distance from the free memory target;
so a workload that allocates memory rapidly will have the background
reclaimer working harder.
I rolled my own condition variable here, mostly as an experiment.
seastar::condition_variable requires several allocations, while
the one here requires none. We should formalize it after we gain
more experience with it.
Add an option (currently unused by all callers) to preempt
reclaim. If reclaim is preempted, it just stops what it is
doing, even if it reclaimed nothing. This is useful for background
reclaim.
Currently, preemption checks are on segment granularity. This is
probably too coarse, and should be refined later, but is already
better than the current granularity which does not allow preemption
until the entire requested memory size was reclaimed.
To speed up replication test reduce the tick time from 100ms to 1ms
Speed up: debug 3.7 to 2.5, dev 2.9 to 2.1 seconds
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
This test reproduces issue #7987, where Scylla cannot set a time column
with an integer - wheres the documentation says this should be possible
and it also works in Cassandra.
The test file includes tests for both ways of setting a time column
(using an integer and a string), with both prepared and unprepared
statements, and demonstrates that only one combination fails in Scylla -
an unprepared statement with an integer. This test xfails on Scylla
and passes on Cassandra, and the rest pass on both.
Refs #7987.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210128215103.370723-1-nyh@scylladb.com>
This patch fixes the last missing part of nested attribute support in
UpdateItem - returning the correct attributes when ReturnValues is requested.
When the expression says "a.b = :val" and ReturnValues is set to UPDATED_OLD
or UPDATED_NEW, only the actual updated attribute a.b should be returned, not
the entire top-level attribute a as we did before this patch.
This patch was made very simple because our existing hierarchy_filter()
function already does exactly the right thing, and can trivially be made to
accept any attribute_path_map<T> (in our case attribute_path_map<action>),
not just attrs_to_get as it did until now.
This patch also adds several more checks to the test in test_returnvalues.py
to improve the test's coverage even more. Interestingly, I discovered two
esoteric cases where DynamoDB does something which makes little sense, but
apparently simplified their implementation - but the beautiful thing is that
it also simplifies our implementation! See long comments about these two
cases in the test code.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Commit 0c460927bf broke UpdateItem's
ReturnValues=UPDATED_NEW by moving previous_item while it is still
needed. None of the existing tests broke because none of them needed
previous_item after it was moved - but it started to break when we
add support for nested attribute paths, which need this previous_item.
So this patch returns the move to a copy, as it was before the
aforementioned patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds full support for nested attribute paths (e.g., a.b[3].c)
in UpdateExpression. After in previous patches we already added such
support for ProjectionExpression, ConditionExpression and FilterExpression
this means the nested attribute paths feature is now complete, so we
remove the warning from the documents. However, there is one last loose
end to tie and we will do it in the next patch: After this patch, the
combination of UpdateExpression with nested attributes and ReturnValues
is still wrong, and the test for it in test_returnvalues.py still xfails.
Note that previous patches already implemented support for attribute paths
in expression evaluations - i.e., the right-hand side of UpdateExpression
actions, and in this patch we just needed to implement the left hand side:
When an update action is on an attribute a.b we need to read the entire
content of the top-level a (an RWM operation), modify just the b part of
its json with the result of the action, and finally write back the entire
content of a. Of course everything gets complicated by the fact that we
can have multiple actions on multiple pieces of the same JSON, and we also
need to detect overlapping and conflicting actions (we already have this
detection in the attribute_path_map<> class we introduced in a previous
patch).
I decided to leave one small esoteric difference, reproduced by the xfailing
test_update_expression.py::test_nested_attribute_remove_from_missing_item:
As expected, "SET x.y = :val" fails for an item if its attribute x doesn't
exist or the item itself does not exist. For the update expression
"REMOVE x.y", DynamoDB fails if the attribute x doesn't exist, but oddly
silently passes if the entire item doesn't exist. Alternator does not
currently reproduce this oddity - it will fail this write as well.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
DynamoDB limits the depth of a nested path in expressions (e.g. "a.b.c.d")
to 32 levels. This patch adds the same limit also to Alternator.
The exact value of this limit is less important (although it did make
sense to choose the same limit as DynamoDB does), but it's important
to have *some* limit: It's often convenient to handle paths with a
recursive algorithm, and if we allow unlimited path depth, it can
result in unlimited recursion depth, and a crash. Let's avoid this
possibility.
We detect the over-long path while building the parsed::path object
in the parser, and generate a parse error.
This patch also includes a test that verifies that both Alternator
and DynamoDB have the same 32-level nesting limit on paths.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch prepares UpdateItem for updating of nested attribute paths
(e.g., "SET a.b = :val"), but does not yet support them.
Instead of _update_expression holding an unsorted list of "actions",
we change it to hold a attribute_path_map of actions. This will allow
us to process all the actions on a top-level attribute together, and
moreover gets us "for free" the correct checking for overlapping and
conflicting updates - exactly the same checking we already had in
attribute_path_map for ProjectionExpression. Other than this change,
most of this patch is just code movement, not functional changes.
After this patch, the tests for update path overlap and conflict pass:
test_update_expression_multi_overlap_nested and
test_update_expression_multi_conflict_nested.
We can also mark test_update_expression_nested_attribute_rhs as passing -
this test involves an attribute path in the right-hand-side of an update,
but the left-hand-side is still a top-level attribute, so it works (it
actually worked before this patch - it started working when we implemented
attribute paths in expressions, for ConditionExpression and
FilterExpression).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
For ProjectionExpression we implemented a hierarchical filter object which
can be used to hold a tree of attribute paths groups by a the top-level
attributes, and also detect overlapping and conflicting entries.
For UpdateExpression, we need almost exactly the same object: We need to
group update actions (e.g., SET a.b=3) by the top-level attribute, and
also detect and fail overlapping or conflicting paths.
So in this patch we rewrite the data structure we had for ProjectionExpression
in a more genric manner, using the template attribute_path_map<T> - which
holds data of type T for each attribute path. We also implement a template
function attribute_path_map_add() to add a path/value pair to this map,
and includes all the overlap and conflict detecting logic.
There shouldn't be functional changes in this patch. The ProjectionExpression
code uses the new generic code instead of the specific code, but should work
the same. In the next patch we can use the new generic code to implement
UpdateExpression as well.
The only somewhat functional change is better error messages for
conflicting or overlapping paths - which now include one of the
conflicting paths.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We already had many tests for nested attributes in UpdateExpression, but
this patch adds even more:
* Test nested attribute in right-hand-side in assignment: z = a.c.x.
* Test for making multiple changes to the same and different top-level
attributes in the same update.
* Additional cases of overlap between multiple changes.
* Tests for conflict between multiple changes.
* Tests for writing to a nested path on a non-existent attribute or item.
* A stronger test for array append sorts the added items.
As this feature was not yet implemented, these tests fail on Alternator,
and pass on DynamoDB.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Provide several utility functions which will be used in rpc message
handlers:
1. `update_address_mapping` -- add a new (server_id -> inet_address)
mapping for a `raft_rpc` instance.
This is used to update rpc module with a caller address
upon receiving an rpc message from a yet unknown server.
2. A set of dispatcher functions for every rpc call that forward calls
to an appropriate `raft::rpc_server` instance (for which `raft::rpc`
has a back-pointer).
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The reference is used by range streamer and (!) storage
service itself to find out if the consistent_rangemovement
option is ON/OFF.
Both places already have the database with config at hands
and can be simplified.
v2: spellchecking
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210212095403.22662-1-xemul@scylladb.com>
The test timeuuid_test.py::testTimeuuid sporadically failed, and it turns out
the reason was a bug in the test - which this patch fixes.
The buggy test created a timeuuid and then compared the time stored in it
to the result of the dateOf() CQL function. The problem is that dateOf()
returns a CQL "timestamp", which has millisecond resolution, while the
timeuuid *may* have finer than millisecond resolution. The reason why this
test rarely failed is that in our implementation, the timeuuid almost
always gets a millisecond-resolution timestamp. Only if now() gets called
more than once in one millisecond, does it pick a higher time incremented
by less than a millisecond.
What this patch does is to truncate the time read from the timeuuid to
millisecond resolution, and only then compare it to the result of dateOf().
We cannot hope for more.
Fixes#8060
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210211165046.878371-1-nyh@scylladb.com>
For certain situations where barely enough nodes to elect a new leader
are connected a disruptive candidate can occassionally block the
election.
For example having servers A B C D E and only A B C are active in a
partition. If the test wants to elect A, it has to first make all 3
servers reach election timeout threshold (to make B and C receptive).
Then A is ticked till it becomes a candidate and has to send vote
requests to the other servers.
But all servers have a timer (_ticker) calling their periodic tick()
functions. If one of the other servers, say B, gets its timer tick
before A sends vote requests, B becomes a (disruptive) candidate and
will refuse to vote for A. In our case of only having 3 out of 5 servers
connected a single missing vote can hang the election.
This patch disables timer ticks for all servers when running custom
elections and partitioning.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
This will be used later to filter the requests which belong
to the schema raft group and route them to shard 0.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Introduce a static `schema_raft_state_machine::group_id` constant,
which denotes the raft group id for the schema changes server.
Also fix the comment on the state machine class declaration
to emphasize that the instance will be managed by shard 0.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
When a table is altered in a mixed cluster by a node with a more
recent version, the request can fail if there is a difference in
schema_features between the two versions. This miniset handles the
two problems that prevents the sync.
Closes#8011
* github.com:scylladb/scylla:
schema: recalculate digest when computed_columns feature is enabled
schema tables: Remove mutations to unknown tables when adapting schema mutations
schema tables: Register 'scylla_tables' versions that were sent to other nodes
mutations
Whenever an alter table occurs, the mutations for the just altered table
are sent over to all of the replicas from the coordinator.
In a mixed cluster the mutations should be adapted to a specific version
of the schema. However, the adaptation that happens today doesn't omit
mutations to newly added schema tables, to be more specific, mutations
to the `computed_columns` table which doesn't exist for example in
version 2019.1
This makes altering a table during a rolling upgrade from 2019.1 to
2020.1 dangerous.
nodes
In a mixed cluster there can be a situation where `scylla_tables` needs
to be sent over to another node because a schema sync or because the
node pulls it because it is referenced by a frozen_mutation. The former
is not a problem since the sending node chooses the version to send.
However, the former is problematic since `scylla_tables` versions are not
registered anywhere.
This registers every `scylla_tables` schema version which is used to adapted
mutations since after this happens a schema pull for this version might
follow.
Added a test which measures the time it takes to replace sstables in a table's
sstable_set, using the leveled compaction strategy.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Column_family_test allows performing private methods on column_family's
sstable_set. It may be useful not only in the boost tests, so it's moved
from test/boost/sstable_test.hh to test/lib/sstable_test_env.hh.
sstable_test.hh includes sstable_test_env.hh, so no includes need to be
changed.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
The sstable_set enables copying without iterating over all its elements,
so it's faster to copy a set and modify it than copy all its elements
while filtering the ones that were erased.
The modifications are done on a temporary version of the set, so that
if an operation fails the base version remains unchanged
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Currently, the sstable_set in a table is copied before every change
to allow accessing the unchanged version by existing sstable readers.
This patch changes the sstable_set to a structure that allows copying
without actually copying all the sstables in the set, while providing
the same methods(and some extra) without majorly decreasing their speed.
This is achieved by associating all copies with sstable_set versions
which hold the changes that were performed in them, and references to
the versions that were copied, a.k.a. their parents. The set represented
by a version is the result of combining all changes of its ancestors.
This causes most methods of the version to have a time complexity
dependent on the number of its ancestors. To limit this number, versions
that represent copies that have already been deleted are merged with its
descendants.
The strategy used for deciding when and with which of its children
should a version be merged heavily depends on the use case of sstable_sets:
there is a main copy of the set in a table class which undergoes many
insertions and deletions, and there are copies of it in compaction or
mutation readers which are further copied or edited few or zero times.
It's worth to mention, that when a copy is made, the copied set should not
be modified anymore, because it would also modify the results given by the
copy. In order to still allow modifying the copied set, if a change is
to be performed on it, the version assiociated with this set is replaced
with a new version depending on the previous one.
As we can see, in our use case there is a main chain of versions(with
changes from the table), and smaller branches of versions that start
from a version from this chain, but are deleted soon after.
In such case we can merge a version when it has exactly one descendant,
as this limits the number of concurrent ancestors of a version to the
number of copies of its ancestors are concurrently used. During each
such merge, the parent version is removed and the child version is
modified so that all operations on it give the same results.
In order to preserve the same interface, the sstable_set still contains a
lw_shared_ptr<sstable_list>, but sstable_list (previously an alias for
unordered_set<shared_sstable>) is now a new structure. Each sstable_set
contains a sstable_list but not every sstable_list has to be contained
by a sstable_set, and we also want to allow fast copying of sstable_lists,
so the reference to the sstable_set_version is kept by the sstable_lists
and the sstable_set can access the sstable_set_version it's associated
with through its sstable_list.
Accessing sstables that are elements of a certain sstable_set copy(so
the select, select_sstable_runs and sstable_list's iterator) get results
from containers that hold all sstables from all versions(which are stored
in a single, shared "versioned_sstable_set_data" structure), and then
filter out these sstables that aren't present in the version in question.
This version of the sstable_set allows adding and erasing the same sstable
repeatedly. Inserting and erasing from the set modifies the containers in
a version only when it has an actual effect: if an sstable has been added
in the parent version, and hasn't been erased in the child version, adding
it again will have no effect. This ensures that when merging versions, the
versions have disjoint sets of added, and erased sstables (an sstable can
still be added in one and erased in the second). It's worth noting hat if
an sstable has been added in one of the merged sets and erased in the
second, the version that remains after merging doesn't need to have any
info about the sstable's inclusion in the set - it can be inferred from
the changes in previous versions (and it doesn't matter if the sstable has
been erased before or after being added).
To release pointers to sstables as soon as possible (i.e. when all references
to versions that contain them die), if an sstable is added/erased in all
child versions that are based on a version which has no external references,
this change gets removed from these versions and added to the parent version.
If an sstable's insertion gets overwritten as a result, we might be able
to remove the sstable completely from the set. We know how many times this
needs to happen by counting, for each sstable, in how many different verisions
has it been added. When a change that adds an sstable gets merged with a change
that removes it, or when a such a change simply gets deleted alongside its
associated version, this count is reduced, and when an sstable gets added to a
version that doesn't already contain it, this count is increased.
The methods that modify the sets contents give strong exception guarantee
by trying to insert new sstables to its containers, and erasing them in
the case of an caught exception.
Fixes#2622
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
If the range expression in a range based for loop returns a temporary,
its lifetime is extended until the end of the loop. The same can't be said
about temporaries created within the range expression. In our case,
*t->get_sstables_including_compacted_undeleted() returns a reference to a
const sstable_list, but the t->get_sstables_including_compacted_undeleted()
is a temporary lw_shared_ptr, so its lifetime may not be prolonged until the
end of the loop, and it may be the sole owner of the referenced sstable_list,
so the referenced sstable_list may be already deleted inside the loop too.
Fix by creating a local copy of the lw_shared_ptr, and get reference from it
in the loop.
Fixes#7605
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Adding an non-empty set of sstables as the set of all sstables in
an sstable_set could cause inconsistencies with the values returned
by select_sstable_runs because the _all_runs map would still be
initialized empty. For similar reasons, the provided sstable_set_impl
should also be empty.
Dispel doubts by removing the unordered_set from the constructor, and
adding a check of emptiness of the sstable_set_impl.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
scylla_io_setup condition for nr_disks was using the bitwise operator
(&) instead of logical and operator (and) causing the io_properties
files to have incorrect values
Fixes#7341
Reviewed-by: Lubos Kosco <lubos@scylladb.com>
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Closes#8019
"
Currently, register_inactive_read accepts an eviction_notify_handler
to be called when the inactive_read is evicted.
However, in case there was an error in register_inactive_read
the notification function isn't called leaving behind
state that needs to be cleaned up.
This series separates the register_inactive_reader interface
into 2 parts:
1. register_inactive_reader(flat_mutation_reader) - which just registers
the reader and return an inactive_read_handle, *if permitted*.
Otherwise, the notification handler is not called (it is not known yet)
and the caller is not expected to do anything fance at this point
that will require cleanup.
This optimizes the server when overloaded since we do less work
that we'd need to undo in case the reader_concurrecy_semaphore
runs out of resources.
2. After register_inactive_reader succeeded to return a valid
inactive_read_handle, the caller sets up its local state
and may call `set_notify_handler` to set the optional
notify_handler and ttl on the o_r_h.
After this state, the notify_handler will be called when
the inactive_reader is evicted, for any reason.
querier_cache::insert_querier was modified to use the
above procedure and to handle (and log/ignore) any error
in the process.
inactive_read_handle and inactive_read keeping track of each other
was simplified by keeping an iterator in the handle and a backpointer
in the inactive_read object. The former is used to evict the reader
and to set the notify_handler and/or ttl without having to lookup the i_r.
The latter is used to invalidate the i_r_h when the i_r is destroyed.
Test: unit(release), querier_cache_test(debug)
"
* tag 'register_inactive_read-error-handling-v6' of github.com:bhalevy/scylla:
querier_cache: insert_querier: ignore errors to register inactive reader
querier_cache: insert_querier: handle errors
querier_utils: mark functions noexcept
reader_concurrency_semaphore: register_inactive_read: make noexcept
reader_concurrency_semaphore: separate set_notify_handler from register_inactive_reader
reader_concurrency_semaphore: inactive_read: make ttl_timer non-optional
reader_concurrency_semaphore: inactive_read: use intrusive list
reader_concurrency_semaphore: do_wait_admission: use try_evict_one_inactive_read
reader_concurrency_semaphore: try_evict_one_inactive_read: pass evict_reason
reader_concurrency_semaphore: unregister_inactive_read: calling on wrong semaphore is an internal error
reader_concurrency_semaphore: unregister_inactive_read: do nothing if disengaged
reader_concurrency_semaphore: inactive_read_handle: swap definition order
reader_lifecycle_policy: retire low level try_resume method
reader_concurrency_semaphore: inactive_read: keep a flat_mutation_reader
One of the USING TIMEOUT tests relied on a specific TTL value,
but that's fragile if the test runs on the boundary of 2 seconds.
Instead, the test case simply checks if the TTL value is present
and is greater than 0, which makes the test robust unless its execution
lasts for more than 1 million seconds, which is highly unlikely.
Fixes#8062Closes#8063
Previous version, merged and dequeued due to a dependency bug: https://github.com/scylladb/scylla/pull/7297
Note: this pull request is temporarily created against /next, because it depends on https://github.com/scylladb/scylla/pull/7279.
This series adds support for `max_concurrent_requests_per_shard` config variable to alternator. Excessive requests are shed and RequestLimitExceeded is sent back to the client.
Tested manually by reloading Scylla multiple times and editing the config, while bombarding alternator with many concurrent requests. Observed excepted failures are:
`botocore.errorfactory.RequestLimitExceeded: An error occurred (RequestLimitExceeded) when calling the CreateTable operation: too many in-flight requests: 17
`
Fixes#7294Closes#8039
* github.com:scylladb/scylla:
alternator: server: return api_error instead of throwing
alternator: add requests_shed metrics
alternator: add handling max_concurrent_requests_per_shard
alternator: add RequestLimitExceeded error
The code that creates system keyspace open code a lot of things from
database::create_keyspace(). The patch makes create_keyspace() suitable
for both system and non system keyspaces and uses it to create system
keyspaces as well.
Message-Id: <20210209160506.1711177-1-gleb@scylladb.com>
The error is benign but if it is not handled "unhandled exception" error
will be printed in the logs.
Message-Id: <20210209150313.GA1708015@scylladb.com>
The interface of the failure detector service is cleaned up a little:
- an unimplemented method is removed (is_alive)
- a return type of another method is fixed (arrival_samples)
- a getter for the most recent successful update is added (last_update)
This code was tested manually during various overload protection
experiments, which check if the failure detector can be used to reject
requests which have a very small chance of succeeding within their
timeout.
Closes#8052
* github.com:scylladb/scylla:
failure_detector: add getting last update time point
failure_detector: return arrival samples by const reference
failure_detector: remove unimplemented is_alive method
The same trick is used as in C*:
79e693e16e/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java (L241)
The edited CQL test relied on quietly accepting non-existing DCs, so it had to
be removed. Also, one boost-test referred to nonexistent `datacenter2` and had
to be removed.
Fixes#7595Closes#8056
* github.com:scylladb/scylla:
tests: Adjusted tests for DC checking in NTS
locator: Check DC names in NTS
Since fea5067df we enforce a limit on the memory consumption of
otherwise non-limited queries like reverse and non-paged queries. This
limit is sent down to the replicas by the coordinator, ensuring that
each replica is working with the same limit. This however doesn't work
in a mixed cluster, when upgrading from a version which doesn't have
this series. This has been worked around by falling back to the old
max_result_size constant of 1MB in mixed clusters. This however resulted
in a regression when upgrading from a pre fea5067df to a post fea5067df
one. Pre fea5067df already had a limit for reverse queries, which was
generalized to also cover non-paged ones too by fea5067df.
The regression manifested in previously working reverse queries being
aborted. This happened because even though the user has set a generous
limit for them before the upgrade, in the mix cluster replicas fall back
to the much stricter 1MB limit temporarily ignoring the configured limit
if the coordinator is an old node. This patch solves this problem by
using the locally configured limit instead of the max_result_size
constant. This means that the user has to take extra care to configure
the same limit on all replicas, but at least they will have working
reverse queries during the upgrade.
Fixes: #8022
Tests: unit(release), manual test by user who reported the issue
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210209075947.1004164-1-bdenes@scylladb.com>
* seastar 4c7c5c7c4...76cff5896 (6):
> rpc: Make is possible for rpc server instance to refuse connection
> reactor: expose cumulative tasks processed statistic
> fair_queue: add missing #include <optional>
> reactor: optimize need_preempt() thread-local-storage access
> Merge " Use reference for backend->reactor link" from Pavel E
> test: coroutines: failed coroutine does not throw
CQL test relied on quietly acceptiong non-existing DCs, so it had
to be removed. Also, one boost-test referred to nonexisting
`datacenter2` and had to be removed.
Since the reader may normally dropped upon
registration, hitting an error is equivalent to having
it evicted at any time, so just log the exception
and ignore it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
They all are trivially noexcept.
Mark them so to simplify error handing assumptions in the
next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Catch error to allocate an inactive_read and just log them.
Return an empty inactive_read_handle in
this case, as if the inactive reader was evicted due to
lack of resources.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Register the inactive reader first with no
evict_notify_handler and ttl.
Those can be set later, only if registration succeeded.
Otherwise, as in the querier example, there is no need
to to place the querier in the index and erase it
on eviction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
By default it will be unarmed and with no callback
so there's no need to wrap it in a std::optional.
This saves an allocation and another potential
error case.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To simplify insertion and eviction into the inactive_reads container,
use an intrusive list thta requires a single allocation for the
inactive_read object itself.
This allows passing a reference to the inactive_read
to evict it.
Note that the reader will be unlinked automatically from
the inactive_readers list if the inactive_read_handle is destroyed.
This is okay since there is no need to track the inactive_read
if the caller loses the i_r_h (e.g. if an error is thrown).
It is also safe to evict the inactive_reader while the
i_r_h is alive. In this case the i_r will be unlinked
after the flat_mutation_reader it holds is moved out of it.
bi::auto_unlink will detect that it's alredy unlinked
when destroyed and do nothing.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Calling unregister_inactive_read on the wrong semaphore is a blatant
bug so better call on_internal_error so it'd be easier to catch and fix.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There is no need to lookup the inactive_read if the i_r_h
is disengaged, it should not be registered so just return
quickly.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There's no need to hold a unique_ptr<flat_mutation_reader> as
flat_mutation_reader itself holds a unique_ptr<flat_mutation_reader::impl>
and functions as a unique ptr via flat_mutation_reader_opt.
With that, unregister_inactive_read was modified to return a
flat_mutation_reader_opt rather than a std::unique_ptr<flat_mutation_reader>,
keeping exactly the same semantics.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch fully implements support for attribute paths (e.g. a.b.c, a.d[3])
for the ConditionExpression in conditional updates, and FilterExpression in
queries and scans. After this patch, all previously-xfailing tests in
test_projection_expression.py and test_filter_expression.py now pass.
The fix is simple: Both ConditionExpression and FilterExpression use the
function calculate_value() to calculate the value of the expression. When
this function calculates the value of a path, it mustn't just take the
top-level attribute - it needs to walk into the specific sub-object as
specified by the attribute path.
This is not the end of attribute path support, UpdateExpression and
ReturnValues are not yet fully supported. This will come in following
patches.
Refs #5024
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Strengthen the tests in test_condition_expression.py for nested attribute
paths (e.g., b.y[1]):
1. The test test_update_condition_nested_attributes only tested successful
conditions involving nested attributes. Let's also add an *unsuccessful*
condition, to verify we don't accidentally pass every condition involving
a nested attribute.
2. Test a case where a non-existant nested attribute is involved in the
condition.
3. In the test for an attribute path with references - "#name1.#name2",
make sure the test doesn't pass if #name2 is silently ignored.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This fixes the problem where the cordinator already knows about the
new schema and issues a read which uses new objects, but the replica
doesn't know those objects yet. The read will fail in this case. We
can avoid this if we propagate schema changes with reads, like we
already do for writes.
Message-Id: <20210205163422.414275-1-tgrabiec@scylladb.com>
Refs #6148
Commitlog disk limit was previously a "soft" limit, in that we allowed allocating new segments, even if we were over
disk usage max. This would also cause us sometimes to create new segments and delete old ones, if badly timed in
needing and releasing segments, in turn causing useless disk IO for pre-allocation/zeroing.
This patch set does:
* Make limit a hard limit. If we have disk usage > max, we wait for delete or recycle.
* Make flush threshold configurable. Default is ask for flush when over 50% usage. (We do not wait for results)
* Make flush "partial". We flush X% of the used space (used - thres/2), and make the rp limit accordingly. This means we will try to clear the N oldest segments, not all. I.e. "lighter" flush. Of course, if the CL is wholly dominated by a single CF, this will not really help much. But when > 1 cf is used, it means we can skip those not having unflushed data < req rp.
* Force more eager flush/recycle if we're out of segments
Note: flush threshold is not exposed in scylla config (yet). Because I am unsure of wording, and even if it should.
Note: testing is sparse, esp. in regard to latency/timeouts added in high usage scenarios. While I can fairly easily provoke "stalls" (i.e. forced waiting for segments to free up) with simple C-S, it is hard to say exactly where in a more sane config (I set my limits looow) latencies will start accumulating.
Closes#7879
* github.com:scylladb/scylla:
commitlog: Force earlier cycle/flush iff segment reserve is empty
commitlog: Make segment allocation wait iff disk usage > max
commitlog: Do partial (memtable) flushing based on threshold
commitlog: Make flush threshold configurable
table: Add a flush RP mark to table, and shortcut if not above
This patch fully implements support for attribute paths (e.g. a.b.c, a.d[3])
for the ProjectionExpression in the various operations where this parameter
is supported - GetItem, BatchGetItem, Query and Scan. After this patch, all
xfailing tests in test_projection_expression.py now pass.
In the previous patch we remembered in the "attrs_to_get" object not only
the top-level attributes to read from the table, but also how to filter
from it only the desired pieces of the nested document. In this patch we
add a filter() function to do this filtering, and call it in the right
places to post-process the JSON objects we read from the table.
We also had to fix reference resolution in paths to resolve all the
components of the path (e.g., #name1.#name2) and not just the top-level
attribute.
This is not the end of attribute path support, there are still other
expressions (ConditionExpression, UpdateExpression, FilterExpression,
ReturnValues) where they are not yet supported. This will come in following
patches.
Refs #5024
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In the existing code, the variable "attrs_to_get" is a list of top-level
attributes to fetch for an item. It is used to implement features like
ProjectionExpression or AttributesToGet in GetItem and other places.
However, to support attribute paths (e.g., a.b.c[2]) in ProjectionExpression,
i.e., issue #5024, we need more than that. We still need to know the top-
level attribute "a", because this is the granularity we have in the Scylla
table (all the content inside "a" is serialized as a single JSON); But we
also need to remember exactly which parts *inside* "a" we will need to
extract and return.
So in this patch we add a new type, "attrs_to_get", which is more than
just a list of top-level attributes. Instead, it is a *map*, whose keys
are the top-level attributes, and the value for each of them is a
"hierarchy_filter", an object which describes which part of the attribute
is needed.
This patch includes the code which converts the AttributesToGet and
ProjectionExpression into the new attrs_to_get structure. During this
conversion, we recognize two kinds of errors which DynamoDB complains
about: We recognize "overlapping" attributes (e.g., requesting both
a.b and a.b.c) and "conflicting" attributes (e.g, requesting both
a.b and a[1]). After this, two xfailing tests we had for detecting
these overlap and conflicts finally pass and their "xfail" label is
removed.
After this patch, we have the attrs_to_get object which can allow us
to filter only the requested pieces of the top-level attributes, but
we don't use it yet - so this patch is not enough for complete support
of attribute paths in ProjectionExpression. We will complete this
support in the next patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds more tests for attribute paths in ProjectionExpression,
that deal with document paths which do not fit the content of the item -
e.g., trying to ask for "a.b[3]" when a.b is not a list but rather an
integer or a dictionary.
Moreover, we note that if you try to ask for "a.b, a[2]", DynamoDB
fails this request as a "conflict". The reasoning is that no single
item can ever have both a.b and a[2] (the first is only valid for
dictionaries, the second for lists). It's not clear to me why we
still can't return whichever of the two actually is relevant, but
the fact is that DynamoDB does not allow it.
The new tests fail on Alternator (marked xfailed) and pass on DynamoDB.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We have 7 xfailing tests for usage of nested attribute paths (e.g.,
"a.b.c[7]") in a ProjectionExpression. But some of these tests were too
"easy" to pass - a trivial and *wrong* implementation that just ignores
the path and uses the top level attribute (in the above example, "a"),
would cause some of them to start passing.
So this patch strengthens these tests. They still pass on AWS DynamoDB,
and now continue to fail with the aforementioned broken implementation.
Refs #5024.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The first condition expressions we implemented in Alternator were the old
"Expected" syntax of conditional updates. That implementation had some
specific assumptions on how it handles errors: For example, in the "LT"
operator in "Expected", the second operand is always part of the query, so
an error in it (e.g., an unsupported type) resulted it a ValidationException
error.
When we implemented ConditionExpression and FilterExpression, we wrongly
used the same functions check_compare(), check_BETWEEN(), etc., to implement
them. This results in some inaccurate error handling. The worst example is
what happens when you use a FilterExpression with an expression such as
"x < y" - this filter is supposed to silently skip items whose "x" and "y"
attributes have unsupported or different types, but in our implementation
a bad type (e.g., a list) for y resulted in a ValidationException which
aborted the entire scan! Interestingly, in once case (that of BEGINS_WITH)
we actually noticed the slightly different behavior needed and implemented
the same operator twice - with ugly code duplication. But in other operators
we missed this problem completely.
This patch first adds extensive tests of how the different expressions
(Expected, QueryFilter, FilterExpression, ConditionExpression) and the
different operators handle various input errors - unsupported types,
missing items, incompatible types, etc. Importantly, the tests demonstrate
that there is often different behavior depending on whether the bad
input comes from the query, or from the item. Some of the new tests
fail before this patch, but others pass and were useful to verify that
the patch doesn't break anything that already worked correctly previously.
As usual, all the tests pass on Cassandra.
Finally, this patch *fixes* all these problems. The comparison functions
like check_compare() and check_BETWEEN() now not only take the operands,
they also take booleans saying if each of the operands came from the
query or from an item. The old-syntax caller (Expected or QueryFilter)
always say that the first operand is from the item and the second is
from the query - but in the new-syntax caller (ConditionExpression or
FilterExpression) any or all of the operands can come from the query
and need verification.
The old duplicated code for check_BEGINS_WITH() - which a TODO to remove
it - is finally removed. Instead we use the same idea of passing booleans
saying if each of its operands came from an item or from the query.
Fixes#8043
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The db::update_keyspace() needs sharded<storage_proxy>
reference, but the only caller of it already has it and
can pass one as argument.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210205175611.13464-3-xemul@scylladb.com>
On start the transport controller keeps the storage service
on server config's lambda just to let the server grab a
database config option.
The same can be achieved by passing the sharded database
reference to sharded<server>::start, so that each server
instance get local database with config.
As an nice side effect transport::server's config looks
more like a config with simple values and without methods
and/or lambdas on board.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210205175611.13464-1-xemul@scylladb.com>
Few method in column_familiy API were doing the aggregation wrong,
specifically, bloom filter disk size.
The issue is not always visible, it happens when there are multiple
filter files per shard.
Fixes#4513
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Closes#8007
we're unconditionally using make_combined_mutation_source(), which causes extra
allocations, even if memtable was flushed into a single sstable, which is the
most common case. memtable will only be flushed into more than one sstable if
TWCS is used and memtable had old data written into it due to out-of-order
writes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210205182028.439948-1-raphaelsc@scylladb.com>
Previous patch changed the -O flag for dev builds. This
had no effect on unit tests compile+run time, and was
aimed at improving the individual tests, dtest, stress-
and other tests runtimes.
This change is mainly focused on imprving the debug-mode
full unit tests running, while keeping the debuggability:
the compile+run time gets ~10 minutes shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Based on the original patch from Nadav.
The -O1-generated code is too slow. Raising the opt level
slows compilation down ~9%, but greatly improves the
testing time. E.g. running the alternator test alone is
2.5 times faster with -O2 (118 vs 48 seconds).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It was noticed, that current optimization levels do not
generate fast enough code for dev builds. On the other
hand just increasing the default optimization level will
make re-compile-mostly work much more frustrating.
The new configure.py option allows to select the desired
-O option value by hands. Current hard-coded values are
used as defaults.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
std_list has an iterator object which provides the python3 `__next__()`
method only. Python2 wants a method called `next()`. As it is trivial to
provide both, do that to allow debugging on centos7.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210205073549.734362-1-bdenes@scylladb.com>
Fix raft_fsm_test failure in debug mode. ASAN complained
that follower_progress is used in append_entries_reply()
after it was destroyed. This could happen if in maybe_commit()
we switched to a new configuration and destroyed old progress
objects.
The fix is to lookup the object one more time after maybe_commit().
Throwing a C++ exception creates unnecessary overhead, so when
an unsupported operation is encountered, the api error is directly
returned instead of being thrown.
The config value is already used to set an upper limit of concurrent
CQL requests, and now it's also abided by alternator.
Excessive requests result in returning RequestLimitExceeded error
to the client.
Tests: manual
Running multiple concurrent requests via the test suite results in:
botocore.errorfactory.RequestLimitExceeded: An error occurred (RequestLimitExceeded) when calling the CreateTable operation: too many in-flight requests: 17
"
Before this patch, each index reader had its own cache of partition
index pages. Now there is a shared cache, owned by the sstable object.
This allows concurrent reads to share partition index pages and thus
reduce the amount of I/O.
It used to be like that a few years ago, but we moved to per-reader
cache to implement incremental promoted index parsing, to avoid OOMs
with large partitions. At that time, the solution involved caching
input streams inside partition index entries, which couldn't be reused
between readers. This could have been solved differently. Instead of
caching input streams, we can cache information needed to created them
(temporary_buffer<>). This solution takes this approach.
This series is also needed before we can implement promoted index
caching. That's because before the promoted index can be shared by
readers, the partition index entries, which hold the promoted index,
must also be shareable.
The pages live as long as there is at least one index reader
referencing them. So it only helps when there is concurrent access. In
the future we will keep them for longer and evict on memory pressure.
Promoted index cursor is no longer created when the partition index
entry is parsed, by it's created on-demand when the top-level cursor
enters the partition. The promoted index cursor is owned by the
top-level cursor, not by the partition index entry.
Below are the results of an experiment performed on my laptop which
demonstrates the improvement in performance.
Load driver command line:
./scylla-bench \
-workload uniform \
-mode read \
--partition-count=10 \
-clustering-row-count=1 \
-concurrency 100
Scylla command line:
scylla --developer-mode=1 -c1 -m1G --enable-cache=0
The workload is IO-bound.
Before, we needed 2 I/O per read, now we need 1 (amortized).
The throughput is ~70% higher.
Before:
time ops/s rows/s errors max 99.9th 99th 95th 90th median mean
1s 4706 4706 0 35ms 30ms 27ms 25ms 24ms 21ms 21ms
2s 4646 4646 0 42ms 31ms 31ms 27ms 25ms 21ms 22ms
3.1s 4670 4670 0 40ms 27ms 26ms 25ms 25ms 21ms 21ms
4.1s 4581 4581 0 39ms 33ms 33ms 27ms 26ms 21ms 22ms
5.1s 4345 4345 0 40ms 37ms 35ms 32ms 31ms 21ms 23ms
6.1s 4328 4328 0 49ms 40ms 34ms 32ms 31ms 22ms 23ms
7.1s 4198 4198 0 45ms 36ms 35ms 31ms 30ms 22ms 24ms
8.2s 3913 3913 0 51ms 50ms 50ms 39ms 35ms 24ms 26ms
9.2s 4524 4524 0 34ms 31ms 30ms 28ms 27ms 21ms 22ms
After:
time ops/s rows/s errors max 99.9th 99th 95th 90th median mean
1s 7913 7913 0 25ms 25ms 20ms 15ms 14ms 12ms 13ms
2s 7913 7913 0 18ms 18ms 18ms 16ms 14ms 12ms 13ms
3s 8125 8125 0 20ms 20ms 17ms 15ms 14ms 12ms 12ms
4s 5609 5609 0 41ms 35ms 29ms 28ms 27ms 13ms 18ms
5.1s 8020 8020 0 18ms 17ms 17ms 15ms 14ms 12ms 13ms
6.1s 7102 7102 0 27ms 27ms 24ms 19ms 18ms 13ms 14ms
7.1s 5780 5780 0 26ms 26ms 26ms 23ms 22ms 17ms 18ms
8.1s 6530 6530 0 37ms 34ms 26ms 22ms 20ms 15ms 15ms
9.1s 7937 7937 0 19ms 19ms 17ms 17ms 16ms 12ms 13ms
Tests:
- unit [release]
- scylla-bench
"
* tag 'share-partition-index-v1' of github.com:tgrabiec/scylla:
sstables: Share partition index pages between readers
sstables: index_reader: Drop now unnecessary index_entry::close_pi_stream()
sstables: index_reader: Do not store cluster index cursor inside partition indexes
Our test for tracing Alternator requests can't be sure when tracing a request
finished, because tracing is asynchronous and has no official ending signal.
So before we can conclude that tracing failed, we need to wait until a
timeout, which in the current code was roughly 6.4 seconds (the timeout
logic is unnecessarily convoluted, but to make a long story short it has
exponential sleeps starting with 0.1 second and ending with 3.2 seconds,
totaling 6.4 seconds).
It turns out that sporadically, in test runs on overcommitted test machines
with the very slow debug build, we fail this test with this timeout.
So this patch increases the timeout to 51.2 seconds. It should be more
than enough for everyone. Famous last words :-)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210204151554.582260-1-nyh@scylladb.com>
Before this patch, each index reader had its own cache of partition
index pages. Now there is a shared cache, owned by the sstable object.
This allows concurrent reads to share partition index pages and thus
reduce the amount of I/O.
This change is also needed before we can implement promoted index caching.
That's because before the promoted index can be shared by readers, the
partition index entries, which hold the promoted index, must also be
shareable.
The pages live as long as there is at least one index reader
referencing them. So it only helps when there is concurrent access. In
the future we will keep them for longer and evict on memory pressure.
Promoted index cursor is no longer created when the partition index entry
is parsed, by it's created on-demand when the top-level cursor enters
the partition. The promoted index cursor is owned by the top-level cursor,
not by the partition index entry.
Currently, the partition index page parser will create and store
promoted index cursors for each entry. The assumption is that
partition index pages are not shared by readers so each promoted index
cursor will be used by a single index_reader (the top-level cursor).
In order to be able to share partition index entries we must make the
entries immutable and thus move the cursor outside. The promoted index
cursor is now created and owned by each index_reader. There is at most
one such active cursor per index_reader bound (lower/upper).
The current procedure for building images is complicated, as it
requires access to x86_64, aarch64, and s390x machines. Add an alternative
procedure that is fully automated, as it relies on emulation on a single
machine.
It is slow, but requires less attention.
Closes#8024
The supplementary groups are removed by default, so add them back.
Supplementary groups are useful for group-shared directories like
ccache.
I added them to the podman-only branch since I don't know if this
works for docker. If a docker user verifies it works there too,
we can move it to the generic code.
Closes#8020
The patch adds set of counters for various events inside raft
implementation to facilitate monitoring and debugging.
Message-Id: <20210204125313.GA1513786@scylladb.com>
We can do with a forward declaration instead to reduce
the dependency, and include reader_concurrency_semaphore.hh
in test/lib/reader_permit.cc instead.
We need to include "../../reader_permit.hh" to get the
definition of class reader_permit. We need the include
path to prevent recursive include (or rename test/lib/reader_permit.hh
but this creates a lot of code churn).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210204122002.1041808-1-bhalevy@scylladb.com>
commitlog was changed to use fragmented_temporary_buffer::ostream (db::commitlog::output).
So if there are discontiguous small memory blocks, they can be used to satisfy
an allocation even if no contiguous memory blocks are available.
To prevent that, as Avi suggested, this change allocates in 128K blocks
and frees the last one to succeed (so that we won't fail on allocating continuations).
Fixes#8028
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210203100333.862036-1-bhalevy@scylladb.com>
This is an implementation of `raft::failure_detector` for Scylla
that uses gms::gossiper to query `is_alive` state for a given
raft server id.
Server ids are translated to `gms::inet_address` to be consumed
by `gms::gossiper` with the help of `raft_rpc` class,
which manages the mapping.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210129223109.2142072-1-pa.solodovnikov@scylladb.com>
This series provides additional RPC verbs and corresponding methods in
`messaging_service` class, as well as a scylla-specific Raft RPC module
implementation that uses `netw::messaging_service` under the hood to
dispatch RPC messages.
* https://github.com/ManManson/scylla/commits/raft-api-rpc-impl-v6:
raft: introduce `raft_rpc` class
raft: add Raft RPC verbs to `messaging_service` and wire up the RPC calls
configure.py: compile serializer.cc
Currently we call firstNvmeSize before checking that we have enough
(at least 1) ephemeral disks. When none are found, we hit the following
error (see #7971):
```
File "/opt/scylladb/scripts/libexec/scylla_io_setup", line 239, in
if idata.is_recommended_instance():
File "/opt/scylladb/scripts/scylla_util.py", line 311, in is_recommended_instance
diskSize = self.firstNvmeSize
File "/opt/scylladb/scripts/scylla_util.py", line 291, in firstNvmeSize
firstDisk = ephemeral_disks[0]
IndexError: list index out of range
```
This change reverses the order and first checks that we found
enough disks before getting the fist disk size.
Fixes#7971
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#8027
`mutation::consume()` is used by range scans to convert the immediate
`reconcilable_result` to the final `query::result` format. When the
range scan is in reverse, `mutation::consume()` has to feed the
clustering fragments to the consumer in reverse order, but currently
`mutation::consume()` always uses the natural order, breaking reverse
range scans.
This patch fixes this by adding a `consume_in_reverse` parameter to
`mutation::consume()`, and consequently support for consuming clustering
fragments in reverse order.
Fixes: #8000
Tests: unit(release, debug),
dtest(thrift_tests.py:TestMutations.test_get_range_slice)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210203081659.622424-1-bdenes@scylladb.com>
"
Currently inactive readers are stored in two different places:
* reader concurrency semaphore
* querier cache
With the latter registering its inactive readers with the former. This
is an unnecessarily complex (and possibly surprising) setup that we want
to move away from. This series solves this by moving the responsibility
if storing of inactive reads solely to the reader concurrency semaphore,
including all supported eviction policies. The querier cache is now only
responsible for indexing queriers and maintaining relevant stats.
This makes the ownership of the inactive readers much more clear,
hopefully making Benny's work on introducing close() and abort() a
little bit easier.
Tests: unit(release, debug:v1)
"
* 'unify-inactive-readers/v2' of https://github.com/denesb/scylla:
reader_concurrency_semaphore: store inactive readers directly
querier_cache: store readers in the reader concurrency semaphore directly
querier_cache: retire memory based cache eviction
querier_cache: delegate expiry to the reader_concurrency_semaphore
reader_concurrency_semaphore: introduce ttl for inactive reads
querier_cache: use new eviction notify mechanism to maintain stats
reader_concurrency_semaphore: add eviction notification facility
reader_concurrency_semaphore: extract evict code into method evict()
This is the continuaiton of the row-cache performance
improvements, this time -- the rework of clustering keys part.
The goal is to solve the same set of problems:
- logN eviction complexity
- deep and sparse tree
Unlike partitions, this cache has one big feature that makes it
impossible to just use existing B+ tree:
There's no copyable key at hands. The clustering key is the
managed_bytes() that is not nothrow-copy-constructibe, neither
it's hash-able for lookup due to prefix lookup.
Thus the choice is the B-tree, which is also N-ary one, but
doesn't copy keys around.
B-trees are like B+, but can have key:data pairs in inner nodes,
thus those nodes may be significantly bigger then B+ ones, that
have data-s only in leaf trees. Not to make the memory footprint
worse, the tree assumes that keys and data live on the same object
(the rows_entry one), and the tree itself manages only the key
pointers.
Not to invalidate iterators on insert/remove the tree nodes keep
pointers on keys, not the keys themselves.
The tree uses tri-compare instead of less-compare. This makes the
.find and .lower_bound methods do ~10% less comparisons on random
insert/lookup test.
Numbers:
- memory_footprint: B-tree master
rows_entry size: 216 232
1 row
in-cache: 968 960 (because of dummy entry)
in-memtable: 1006 1022
100 rows
in-cache: 50774 50856
in-memtable: 50620 50918
- mutation_test: B-tree master
tps.average: 891177 833896
- simple_query: B-tree master
tps.median: 71807 71656
tps.maximum: 71847 71708
* xemul/clustering-cache-over-btree-4:
mutation_partition: Save one keys comparison
partition_snapshot_row_cursor: Remove rows pointer
mutation_partition: Use B-tree insertion sugar
perf-test : Print B-tree sizes
mutation_partition: Switch cache of rows onto B-tree
partition_snapshot_reader: Rename cmp to less for explicity
mutation_partition: Make insertion bullet-proof
mutation_partition: Use tri-compare in non-set places
flat_mutation_reader: Use clear() in destroy_current_mutation()
rows_entry: Generalize compare
utils: Intrusive B-tree (with tests)
tests: Generalize bptree compaction test
tests: Generalize bptree stress test
Fixes#7615
Makes the CL writer interface N-valued (though still 1 for the "old" paths). Adds a new write path to input N mutations -> N rp_handles.
Guarantees that all entries are written or none are, and that they will be flushed to disk together.
Small test included.
Closes#7616
* github.com:scylladb/scylla:
commitlog_test: Add multi-entry write test
commitlog: Add "add_entries" call to allow inputting N mutations
commitlog: Make commitlog entries optionally multi-entry
commitlog: Move entry_writer definition to cc file
Whenever we push a fragment, we check whether the buffer is
full and return proceed::no if so, so that the state machine pauses
and lets the consumer continue. This patch adds an additional
condition - if preemption is needed, we also return proceed::no.
This drops us back to the outer loop
(in sstable_mutation_reader::fill_buffer), which will yield to
the reactor as part of seastar::do_until().
Two cases (partition_start and partition_end) did not have the
check for is_buffer_full(); it is added now. This can trigger
is the partition has no rows.
Unlike the previous attempt, push_ready_fragments() is not touched.
The extra preemption opportunities triggered a preexisting bug in
clustering_ranges_walker; it is fixed in the first patch of the series.
I tested this by reading from a large partition with a simple
schema (pk int, ck int, primary key(pk, ck)) with BYPASS CACHE.
However, even without the patch I only got sporadic stalls
with the detector set to 1ms, so it's possible I'm not testing
correctly.
Test: unit (dev, debug, release)
Fixes#7883.
Closes#7928
* github.com:scylladb/scylla:
sstable: reader: preempt after every fragment
clustering_range_walker: fix false discontiguity detected after a static row
Rather than asserting, as seen in #7977.
This shouldn't crash the server in production.
Add unit test that reproduces this scenario
and verifies the internal error exception.
Fixes#7977
Test: unit(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210201163051.1775536-1-bhalevy@scylladb.com>
Fixes#7615
Allows N mutations to be written "atomically" (i.e. in the same
call). Either all are added to segement, or none.
Returns rp_handle vector corresponding to the call vector.
Allows writing more than one blob of data using a single
"add" call into segment. The old call sites will still
just provide a single entry.
To ensure we can determine the health of all the entries
as a unit, we need to wrap them in a "parent" entry.
For this, we bump the commitlog segment format and
introduce a magic marker, which if present, means
we have entries in entry, totalling "size" bytes.
We checksum the entra header, and also checksum
the individual checksums of each sub-entry (faster).
This is added as a post-word.
When parsing/replaying, if v2+ and marker, we have to
read all entries + checksums into memory, verify, and
_then_ we can actually send the info to caller.
Fedora version of systemd macros does not work correctly on CentOS7,
since CentOS7 does not support "file trigger" feature.
To fix the issue we need to stop using systemd macros, call systemctl
directly.
See scylladb/scylla-jmx#94
Closes#8005
"
This series extends the scylla_metadata sstable component
to hold an optional testual description of the sstable origin.
It describes where the sstables originated from
(e.g. memtable, repair, streaming, compaction, etc.)
The origin string is provided by the sstable writer via
sstable_writer_config, written to the scylla_metadata component,
and loaded on sstable::load().
A get_origin() method was added to class sstable to retrieve
its origin. It returns an empty string by default if the origin
is missing.
Compaction now logs the sstable origin for each sstable it
compacts, and it generates the sstable origin for all sstables
in generates. Regular compaction origin is simply set to "compaction"
while other compaction types are mentioned by name, as
"cleanup", "resharding", "reshaping", etc.
A unit test was added to test the sstable_origin by writing either
an empty origin and a random string, and then comparing
the origin retrieved by sstable::load to the one written.
Test: unit(release)
Fixes#7880
"
* tag 'sstable-origin-v2' of github.com:bhalevy/scylla:
compaction: log sstable origin
sstables: scylla_metadata: add support for sstable_origin
sstables: sstable_writer_config: add origin member
The apply_monotonically checks if the cursor is behind the source
position to decide whether or not to push it forward (with the
lower_bound call). The 2nd comparison is done to check if either
the cursor was ahead or if lower_bound result actually hit the key.
This 2nd comparison can be avoided:
- the 1st case needs B-tree lower_bound API extention that reports
if the bound is match or not.
- the 2nd one is covered with reusing tri-compare result from the
1st comparison
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The pointer is needed to erase an element by its iterator from the
rows container. The B-tree has this method on iterator and it does
NOT need to walk up the tree to find its root, so the complexity
is still amortized constant.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After the switch from BST to B-tree the memory foorprint includes inner/leaf nodes
from the B-tree, so it's useful to know their sizes too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The switch is pretty straightforward, and consists of
- change less-compare into tri-compare
- rename insert/insert_check into insert_before_hint
- use tree::key_grabber in mutation_partition::apply_monotonically to
exception-safely transfer a row from one tree to another
- explicitly erase the row from tree in rows_entry::on_evicted, there's
a O(1) tree::iterator method for this
- rewrite rows_entry -> cache_entry transofrmation in the on_evicted to
fit the B-tree API
- include the B-tree's external memory usage into stats
That's it. The number of keys per node was is set to 12 with linear search
and linear extention of 20 because
- experimenting with tree shows that numbers 8 through 10 keys with linear
search show the best performance on stress tests for insert/find-s of
keys that are memcmp-able arrays of bytes (which is an approximation of
current clustring key compare). More keys work slower, but still better
than any bigger value with any type of search up to 64 keys per node
- having 12 keys per nodes is the threshold at which the memory footprint
for B-tree becomes smaller than for boost::intrusive::set for partitions
with 32+ keys
- 20 keys for linear root eats the first-split peak and still performs
well in linear search
As a result the footpring for B tree is bigger than the one for BST only for
trees filled with 21...32 keys by 0.1...0.7 bytes per key.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The bi::intrusive::set::insert-s are non-throwing, so it's safe to add
new entry like this
auto* ne = new entry;
set.insert(ne);
and not worry about memory leak. B-tree's insert will be throwing, so we
need some way to free the new entries in case of exception. There's alreay
a way for this:
std::unique_ptr<entry> ne = std::make_unique<entry>();
set.insert(*ne);
ne.release();
so make every insertion into the set work this way in advance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The mutation_partition::_rows will be switched on B-tree with tri
comparator, so to clearly identify not affected by it places, switch
them onto tri-compare in advance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the code uses a look of unlink_leftmost_without_rebalance
calls. B-tree does have it, but plain clearing of the tree is a bit
faster with clear().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Turn the rows_entry less-comparator's calls into a template as
they are nothing but wrappers on top of rows_entyry tri-comparator.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The design of the tree goes from the row-cache needs, which are
1. Insert/Remove do not invalidate iterators
2. Elements are LSA-manageable
3. Low key overhead
4. External tri-comparator
5. As little actions on insert/remove as possible
With the above the design is
Two types of nodes -- inner and leaf. Both types keep pointer on parent nodes
and N pointers on keys (not keys themselves). Two differences: inner nodes have
array of pointers on kids, leaf nodes keep pointer on the tree (to update left-
and rightmost tree pointers on node move).
Nodes do not keep pointers/references on trees, thus we have O(1) move of any
object, but O(logN) to get the tree size. Fortunately, with big keys-per-node
value this won't result in too many steps.
In turn, the tree has 3 pointers -- root, left- and rightmost leaves. The latter
is for constant-time begin() and end().
Keys are managed by user with the help of embeddable member_hook instance,
which is 1 pointer in size.
The code was copied from the B+ tree one, then heavily reworked, the internal
algorythms turned out to differ quite significantly.
For the sake of mutation_partition::apply_monotonically(), which needs to move
an element from one tree into another, there's a key_grabber helping wrapper
that allows doing this move respecting the exception-safety requirement.
As measured by the perf_collections test the B-tree with 8 keys is faster, than
the std::set, but slower than the B+tree:
vs set vs b+tree
fill: +13% -6%
find: +23% -35%
Another neat thing is that 1-key insertion-removal is ~40% faster than
for BST (the same number of allocations, but the key object is smaller,
less pointers to set-up and less instructions to execute when linking
node with root).
v4:
- equip insertion methods with on_alloc_point() calls to catch
potential exception guarantees violations eariler
- add unlink_leftmost_without_rebalance. The method is borrowed from
boost intrusive set, and is added to kill two birds -- provide it,
as it turns out to be popular, and use a bit faster step-by-step
tree destruction than plain begin+erase loop
v3:
- introduce "inline" root node that is embedded into tree object and in
which the 1st key is inserted. This greatly improves the 1-key-tree
performance, which is pretty common case for rows cache
v2:
- introduce "linear" root leaf that grows on demand
This improves the memory consumption for small trees. This linear node may
and should over-grow the NodeSize parameter. This comes from the fact that
there are two big per-key memory spikes on small trees -- 1-key root leaf
and the first split, when the tree becomes 1-key root with two half-filled
leaves. If the linear extention goes above NodeSize it can flatten even the
2nd peak
- mitigate the keys indirection a bit
Prefetching the keys while doing the intra-node linear scan and the nodes
while descending the tree gives ~+5% of fill and find
- generalize stress tests for B and B+ trees
- cosmetic changes
TODO:
- fix few inefficincies in the core code (walks the sub-tree twice sometimes)
- try to optimize the leaf nodes, that are not lef-/righmost not to carry
unused tree pointer on board
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Whenever we push a fragment, we check whether the buffer is
full and return proceed::no if so, so that the state machine pauses
and lets the consumer continue. This patch adds an additional
condition - if preemption is needed, we also return proceed::no.
This drops us back to the outer loop
(in sstable_mutation_reader::fill_buffer), which will yield to
the reactor as part of seastar::do_until().
Two cases (partition_start and partition_end) did not have the
check for is_buffer_full(); it is added now. This can trigger
is the partition has no rows.
Unlike the previous attempt, push_ready_fragments() is not touched.
I tested this by reading from a large partition with a simple
schema (pk int, ck int, primary key(pk, ck)) with BYPASS CACHE.
However, even without the patch I only got sporadic stalls
with the detector set to 1ms, so it's possible I'm not testing
correctly.
Test: unit (dev)
Fixes#7883.
clustering_range_walker detects when we jump from one row range to another. When
a static row is included in the query, the constructor sets up the first before/after
bounds to be exactly that static row. That creates an artificial range crossing if
the first clustering range is contiguous with the static row.
This can cause the index to be consulted needlessly if we happen to fall back
to sstable_mutation_reader after reading the static row.
A unit test is added.
Ref #7883.
The database lass have to duplicated functions keyspaces() and
get_keyspaces(). Drop the former since it is used in one place only.
Message-Id: <20210201135333.GA1403508@scylladb.com>
Support configuration changes based on joint consensus.
When a user adds a configuration entry, commit an interim "joint
consensus" configuration to the log first, and transition to the
final configuration once both C_old and C_new configurations
accept the joint entry.
Misc cleanups.
* scylla-dev/raft-config-changes-v2:
raft: update README.md
raft: add a simple test for configuration changes
raft: joint consensus, wire up configuration changes in the API
raft: joint consensus, count votes using joint config
raft: joint consensus, wire up configuration changes in FSM
raft: joint consensus, update progress tracker with joint configuration
raft: joint consensus, don't store configuration in FSM
raft: joint consensus, keep track of the last confchange index in the log
raft: joint consensus, implement helpers in class configuration
raft: joint consensus, use unordered_set for server_address list
raft: joint consensus, switch configuration to joint
raft: rename check_committed() to maybe_commit()
raft: fix spelling and add comments
Add new scylla_metadata_type::SSTableOrigin.
Store and retrive a sstring to the scylla metadata component.
Pass sstable_writer_config::origin from the mx sstable writer
and ignore it in the k_l writer.
Add unit test to verify the sstable_origin extension
using both empty and a random string.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a string describing where the sstables originated
from (e.g. memtable, repair, streaming, compaction, etc.)
If configure_writer is called with a nullptr, the origin
will be equal to an empty string.
Introduce test_env_sstables_manager that provides an overload
of configure_writer with no parmeters that calls the base-class'
configure_writer with "test" origin. This was to reduce the
code churn in this patch and to keep the tests simple.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch adds a test for the different units which are supposed to
be usable for assigning a "duration" type in CQL. It turns out that
all documented units are supported correctly except µs (with a unicode
mu), so the test reproduces issue #8001.
The test xfails on Scylla (because µs is not supported) and passes
on Cassandra.
Refs: #8001.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210131192220.407481-1-nyh@scylladb.com>
Since makeself script changes current umask, scylla_setup causes
"scylla does not work with current umask setting (0077)" error.
To fix that we need use latest version of makeself, and specfiy --keep-umask
option.
Fixes#6243Closes#6244
* github.com:scylladb/scylla:
dist/offline_redhat: fix umask error
dist/offline_installer/redhat: support cross build
Since makeself script changes current umask, scylla_setup causes
"scylla does not work with current umask setting (0077)" error.
To fix that we need use latest version of makeself, and specfiy --keep-umask
option.
Fixes#6243
Supported cross build by running CentOS7 on docker, now it's able to build
on Fedora.
It also supported switch container image, tested on Oracle Linux 7 and
CentOS 7/8.
* seastar 52d41277a...cb3aaf07e (2):
> tls: reloadable_credentials_base: add_dir_watch: fix root dir detection
> scripts/perftune.py: convert nic option in old perftune.yaml to list for compatibility
last fragment is unconditionally pushed to set of fragments, so if data
size is fragment-aligned, an empty fragment will be needlessly pushed to
the back of the fragment set.
note: i haven't tested if empty fragment at back of set will cause issues,
i think it won't, but this should be avoided anyway.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210129231532.871405-3-raphaelsc@scylladb.com>
1) reuse default_fragment_size for knowledge of max fragment size
2) fragments_count is not a good name as it doesn't include last non-full
fragment (if present), so rename it.
3) simplify calculation of last fragment size
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210129231532.871405-1-raphaelsc@scylladb.com>
The patch contains a skeleton implementation for the Scylla-specific
Raft RPC module.
It uses `netw::messaging_service` as underlying mechanism to send
RPC messages.
The instance is supposed to be bound to a single raft group.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
All RPC module APIs except for `send_snapshot` should resolve as
soon as the message is sent, so these messages are passed via
`send_message_oneway_timeout`.
`send_snapshot` message is sent via `send_message_timeout` and
returns a `future<>`, which resolves when snapshot transfer
finishes or fails with an exception.
All necessary functions to wire the new Raft RPC verbs are also
provided (such as `register` and `unregister` handlers).
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This file was not added to the configure.py,
which `raft_sys_table_storage` series was supposed to do.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Send RequestVote to a joint config.
We need to exclude self from the list of peers
if we're not part of the current configuration.
Avoid disrupting the cluster in this case.
Maintain separate status for previous and current config when counting
votes.
When add_entry() with new configuraiton is submitted,
create a joint configuration and switch to it immediately.
Refuse to enter joint configuration if a configuration
change is already in progress.
When the leader it committed an entry with joint configuration,
append a new entry with final configuration and switch to it.
Resign leadership if the current leader is not part of a new
configuration.
When we change from A, B, C to B, C, D and the leader is A,
then, when C_new starts to be used, the leader is not part of
the current configuration, so it doesn't have to be in the tracker.
Do not try to find & advance leader progress unconditionally then.
The leader doesn't have to be part of the current
configuration, so add a way to access follower_progress for the leader
only if it is present.
Upon configuration changes, preserve progress information
for intact nodes, remove for removed, and create a new progress
object for added nodes.
When tracking commit progress in joint configuration mode,
calculate two commit indexes for two configurations, and
choose the smallest one.
In follower state, FSM doesn't know the current cluster
configuration. Instead of trying to watch the follower log for
configuration changes to keep FSM copy up to date, remove it from
FSM altogether since the follower doesn't need it anyway.
When entering candidate or leader state, fetch the most recent
configuration from the log and initialize the state specific
state with it.
When initializing the log, find the most recent configuration
change index, if present.
Maintain the most recent configuration change index when
the log is truncated or entries are appended to it.
The last configuration change index will be used by FSM when it enters
candidate or leader state to fetch the current configuration.
We never truncate beyond a single in-progress configuration
change, so storing the previous value of last_conf_idx
helps avoid log backward scan on truncation in 100% of cases.
Remove all unused log constructors.
In order to work correctly in transitional configuration,
participants must enter it after crashes, restarts and
state changes.
This means it must be stored in Raft log and snapshot
on the leader and followers.
This is most easily done if transitional configuration
is just a flavour of standard configuration.
In FSM, rename _current_config to _configuration,
it now contains both current and future configuration
at all times.
The idea of the monotonicity checking test is: try to apply
one one random partition to another random one sequentually
failing allocations. Each time allocation fails (with the
bad_alloc exception) -- check the exception guarantee is
respected, then apply (!) the very same two partitions to
each other. At the end of the test we make sure, that an
exception may pop up at any point of application and it
will be safe.
This idea is flawed currently. When verifying the guarantee
the test moves the 2nd partition and leaves it empty for the
next loop iteration. So right on the 2nd attempt to apply
partitions it becomes a no-op, doesn't fail and no more
exceptions arise.
Fix by restoring both partitions at the end of each check.
Broken since 74db08165d.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210129153641.5449-1-xemul@scylladb.com>
This series contains an initial implementation of raft persistency module
that uses `raft` system table as the underlying storage model.
"system.raft" table will be used as a backend storage for implementing
raft persistence module in Scylla. It combines both raft log,
persisted vote and term, and snapshot info.
The table is partitioned by group id, thus allowing multi-raft
operation. The rest of the table structure mirrors the fields of
corresponding core raft structures defined in `raft.hh`, such as
`raft::log_entry`.
The raft table stores the only the latest snapshot id while
the actual snapshot will be available in a separate table
called `system.raft_snapshots`. The schema of `raft_snapshots`
mirrors the fields of `raft::snapshot` structure.
IDL definitions are also added for every raft struct so that we
automatically provide serialization and deserialization facilities
needed both for persistency module and for future RPC implmementation.
The first patch is a side-change needed to provide complete
serialization/deserialization for `bytes_ostream`, which we
need when persisting the raft log in the table (since `data`
is a variant containing `raft::command` (aka `bytes_ostream`)
among others).
`bytes_ostream` was lacking `deserialize` function, which is
added in the patch.
The second patch provides serializer for `lw_shared_ptr<T>`
which will be used for `raft::append_entries`, which has
a field with `std::vector<const lw_shared_ptr<raft::log_entry>>`
type.
There is also a patch to extend `fragmented_temporary_buffer`
with a static function `allocate_to_fit` that allocates an
instance of the fragmented buffer that has a specified size.
Individual fragment size is limited to 128kb.
The patch-set also contains the test suite covering basic
functionality of the persistency module.
* manmanson/raft-api-impl-v11:
raft/sys_table_storage: add basic tests for raft_sys_table_storage
raft: introduce `raft_sys_table_storage` class
utils: add `fragmented_temporary_buffer::allocate_to_fit`
raft: add IDL definitions for raft types
raft: create `system.raft` and `system.raft_snapshots` tables
serializer: add `serializer<lw_shared_ptr<T>>` specialization
serializer: add `deserialize` function overload for `bytes_ostream`
The test suite covers the most basic use cases for the system table
backed raft persistency module:
* store/load vote and term
* store/load snapshot
* store snapshot with log tail truncation
* store/load log entries
* log truncation
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This is the implementation of raft persistency module that
uses `raft` system table as the underlying storage model.
The instance is supposed to be bound to a single raft group.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Introduce `fragmented_temporary_buffer::allocate_to_fit` static
function returning an instance of the buffer of a specified size.
The allocated buffer fragments have a size of at most 128kb.
`bytes_ostream` has the same hard-coded limit, so just use the
same here.
This patch will be later needed for `raft::log_entry` raw data
serialization when writing to the underlying persistent storage.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Changes to the `configuration` and `tagged_uint64` classes are needed
to overcome limitations of the IDL compiler tool, i.e. we need to
supply a constructor to the struct initializing all the
members (raft::configuration) and also need to make an accessor
function for private members (in case of raft::tagged_uint64).
All other structs mirror raft definitions in exactly the same way
they are declared in `raft.hh`.
`tagged_id` and `tagged_uint64` are used directly instead of their
typedef-ed companions defined in `raft.hh` since we don't want
to introduce indirect dependencies. In such case it can be guaranteed
that no accidental changes made outside of the idl file will affect idl
definitions.
This patch also fixes a minor typo in `snapshot_id_tag` struct used
in `snapshot_id` typedef.
System raft table will be used as a backend storage for implementing
raft persistence module in Scylla. It combines both raft log,
persisted vote and term, and snapshot info.
The table is partitioned by group id, thus allowing multi-raft
operation. The rest of the table structure mirrors the fields of
corresponding core raft structures defined in `raft.hh`, such as
`raft::log_entry`.
The raft table stores the only the latest snapshot id while
the actual snapshot will be available in a separate table
called `system.raft_snapshots`. The schema of `raft_snapshots`
mirrors the fields of `raft::snapshot` structure.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This one works similar to `serializer<optional<T>>` and will be
later needed for serializing `raft::append_request`, which has
a field containing `lw_shared_ptr`.
Users to be warned, though: this code assumes that the pointer
is never null. This is done to mirror the serialize implementation
for `lw_shared_ptr:s` in the messaging_service.cc, which is
subject to being deleted in favor of the impl in the
`serializer_impl.hh`.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
We use a custom sharder for all schema tables: every table under
the `system_schema` keyspace, plus `system.scylla_table_schema_history`.
This sharder puts all data on shard 0.
To achieve this, we hardcode the sharder in initial schema object
definitions. Furthermore - since the sharder is not stored inside schema
mutations yet - whenever we deserialize schema objects from mutations,
we modify the sharder based on the schema's keyspace and table names.
A regression test is added to ensure no one forgets to set the special
sharder for newly added schema tables. This test assumes that all newly
added schema tables will end up in the `system_schema` keyspace (other
tables may go unnoticed, unfortunately).
Closes#7947
"
Currently there are three different methods for creating an sstable
reader:
* one for single key reads
* one for ranged reads
* and one nobody uses
This patch-set consolidates all these into a single `make_reader()`
method, which behind the scenes uses the same logic to dispatch to the
right sstable reader constructor that `sstables::as_mutation_source()`
uses.
This patch-set is part of an effort to clean up the jungle that is the
various reader creation methods. The next step is to clean up the
sstable_set, which has even more methods.
One very sad discovery I made while working on this patch-set is that
we
still default `mutation_reader::forwarding` to `yes` in the sstable
range reader creator method and in the
`mutation_source::make_reader()`.
I couldn't assume that all callers are passing what they mean as the
value for that parameter. I found many sites in tests that create
forwardable single partition readers. This is also something we should
address soon.
Tests: unit(release, debug:v3)
"
* 'sstables-consolidate-reader-factory-methods-v4' of https://github.com/denesb/scylla:
cql_query_test: add unit test covering the non-optimal TWCS sstable read path
sstable_mutation_reader: consolidate constructors
tests: don't pass temporary ranges to readers
sstables: sstable_mutation_reader: remove now unused whole sstable constructor
sstables: stats: remove now unused sstable_partition_reads counter
sstable: remove read_.*row.*_flat() methods
tree-wide: use sstables::make_reader() instead of the read_.*row.*_flat() methods
sstables: pass partition_range to create_single_key_sstable_reader()
sstables: sstable: add make_reader()
The sstable read path for TWCS tables takes a different path when the
optimized read path cannot be used. This path was found to be not
covered at all by unit tests which allowed a trivial use-after-free to
slip in. Add a unit test to cover this path as well, so ASAN can catch
such bugs in the future.
* seastar a287bb1a3...52d41277a (8):
> fair_queue: Preempted requests got re-queued too far
> scripts/perftune.py: remove repeated items after merging options from file
> file.hh: Remove fair_queue.hh
> Merge "Reloadable TLS certificate tolerance" from Calle
> Merge "Cancellable IO" from Pavel E
> abort-source: Improve the subscriptions management
> fair_queue: Improve requests preemption while in pending state
> http: add support for Default handler (/*)
The two remaining sstable constructor are very similar apart from the
content of the initialize lambda. Speaking of which, the two remaining
initializer lambdas can be easily merged into one too. So this patch
does just that, consolidates the two constructors one and moves
consolidates as well as extracts the initializer method into a member
method. This means we have to store the previously captured variables as
members, but this is actually a good thing: when debugging we can see
the range and slice the reader is reading, and we are not actually
paying for it either -- they were already stored, just out of sight.
The sstable_mutation_reader, like all other mutation readers expects
that the partition-range passed to it is kept alive by its creator
for the duration of its lifetime. However, the single-key constructor
of the sstable reader was more tolerant, as it only extracted the key
from the range, essentially requiring only the key to be kept alive (but
not the containing range). Naturally in time some code come to rely on
it and ended up passing temporary ranges to the reader. This behaviour
will no longer be acceptable as we are about to consolidate the various
sstable reader constructors, uniformly requiring that the range is kept
alive. So this patch fixes up the tests so they work with this stricter
requirement. Only two occurences were found.
We want to unify the various sstable reader creation methods and this
method taking a ring position instead of a partition range like
everybody else stands in the way of that.
This is effect reverts 68663d0de.
Follow-up to #7917
The size of an cf::column_family_info is 224 bytes, so an std::vector that
contains one for each column family may be very large, causing allocations
of over 1MB.
Considering the vector is used only for iteration, it can be changed to
a non-contiguous list instead.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Closes#7973
"
Currently we have two parallel query paths:
* database::query() -> table::query() -> data_query()
* mutation::query()
The former is used by single partition queries, the latter by range
scans, as mutation::query() is used to convert reconcilable_result to
query::result (which means it is also used in single partition queries
if it triggers read repair). This is a rather unfortunate situation as
we have two parallel implementation of the query code, which means they
are prone to diverge, and in fact they already have -- more on that
later.
This patchset aims to remedy this situation by retiring
`mutation::query()` and migrating users to an implementation based on
the "standard" query path, in other words one using the same building
blocks as the `database::query()` path. This means using
`compact_mutation` for compacting and `query_result_builder` for result
building. These components however were created to work with
`flat_mutation_reader`, however introducing a reader into this pipeline
would mean that we'd have to make all the related APIs asynchronous,
which would cause an insane amount of churn. To avoid this, this
patchset adds an API compatible `consume()` method to `mutation`, which
can accept a `compact_mutation` instance as-is. This allows an elegant
and succinct reimplementation. So far so good.
Like mentioned above, the two implementations have diverged in time, or
have been different from the start. The difference manifest when
calculating digests, more precisely in which tombstones are included in
the digest. The retired `mutation::query()` path incorporates only
non-purgeable tombstones in the digest. The standard query path however
incorporates all tombstones, even those that can be purged. After some
scrutiny however this difference proved to be completely theoretical,
as
the code path where this would matter -- converting reconcilable result
to query result -- passes min timestamp as the query time to the
compaction, so nothing is compacted and hence the difference has no
chance to manifest.
This patch-set was motivated by the desire to provide a single solution
to #7434, instead of two, one for each path.
Tests: unit(release:v2, debug:v2, dev:v3)
"
* 'unified-query-path/v3' of https://github.com/denesb/scylla:
mutation: remove now unused query() and query_compacted()
treewide: use query_mutations() instead of mutation::query()
mutation_test: test_query_digest: ensure digest is produced consistently
mutation_query: introduce query_mutation()
mutation_query: to_data_query_result(): migrate to standard query code
mutation_query: move to_data_query_result() to mutation_partition.cc
mutation: add consume()
flat_mutation_reader: move mutation consumer concepts to separate header
mutation compactor: query compaction: ignore purgeable tombstones
This will be the only method to create sstable readers with. For now we
leave the other variants, they as well as their users will be removed in
a following patch.
This patch adds a cql-pytest, test_json.py::test_tojson_double(),
which reproduces issue #7972 - where toJson() prints some doubles
incorrectly - truncated to integers, but some it prints fine (I
still don't know why, this will need to be debugged).
The test is marked xfail: It fails on Scylla, and passes on Cassandra.
Refs #7972.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210127124338.297544-1-nyh@scylladb.com>
This patch introduces `schema_raft_state_machine` class
which is currently just a dummy implementation throwing a
"not implemented" exceptions for every call.
Will be needed later to construct an instance of `raft::server`.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210126193413.1520948-1-pa.solodovnikov@scylladb.com>
For some reason we had a distinct specialization of `serialize`
function to handle `bytes_ostream` but not `deserialize`.
This will be used in the following patches.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Currently the replacing node sets the status as STATUS_UNKNOWN when it
starts gossip service for the first time before it sets the status to
HIBERNATE to start the replacing operation. This introduces the
following race:
1) Replacing node using the same IP address of the node to be replaced
starts gossip service without setting the gossip STATUS (will be seen as
STATUS_UNKNOWN by other nodes)
2) Replacing node waits for gossip to settle and learns status and
tokens of existing nodes
3) Replacing node announces the HIBERNATE STATUS.
After Step 1 and before Step 3, existing nodes will mark the replacing
node as UP, but haven't marked the replacing node as doing replacing
yet. As a result, the replacing node will not be excluded from the read
replicas and will be considered a target node to serve CQL reads.
To fix, we make the replacing node avoid responding echo message when it is not
ready.
Fixes#7312Closes#7714
I see a miscompile on aarch64 where a call to format("{}", uuid)
translates a function pointer to -1. When called, this crashes.
Reduce the inline threshold from 2500 to 600. This doesn't guarantee
no miscompiles but all the tests pass with this parameter.
Closes#7953
The following patches fix issues seen occasionally in debug mode.
Notes:
- In debug mode there's still the UB nullptr arithmetic warning.
* https://github.com/alecco/scylla/tree/raft-ale-tests-07h-wait-propagation:
raft: replication test: wait for log propagation
raft: replication test: move wait for log to a function
raft: replication test: remove unused member
raft: replication test: use later()
raft: testing: remove election wait time and just yield
test_cell_external_memory_usage uses with_allocator() to observe how some
types allocate memory. However, compiler reordering (observed with clang 11
on aarch64) can move the various thread-local CQL type object initialization
into the with_allocator() scope; so any managed object allocated as part of
this initialization also gets measured, and the test fails. The code movement
is legal, as far as I can tell.
Fix this by initializing the type object early; use an atomic_thread_fence
as an optimization barrier so the compiler doesn't eliminate the or move
the early initialization.
Closes#7951
This patch adds a test for trying to set a tuple element to null with
fromJson(), which works on Cassandra but fails on Scylla. So the test
xfails on Scylla. Reproduces issue #7954.
Refs #7954.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210124082311.126300-1-nyh@scylladb.com>
"
`next_partition()` used to return void, so readers that had to call
future returning code had to work around this. Now that
`next_partition()` returns a future, we can get rid of these
workarounds.
Tests: unit(release, debug)
"
* 'next-partition-cross-shard-readers/v1' of https://github.com/denesb/scylla:
mutation_reader: reader_lifecycle_policy::stopped_reader: drop pending_next_partition flag
mutation_reader: evictable_reader: remove next_partition() workaround
mutation_reader: shard_reader: remove next_partition() workaround
mutation_reader: foreign_reader: remove next_partition() workaround
If the message is larger than current buffer size, we need to consume
more data until we reach to tail of the message.
To do so, we need to return nullptr when it's not on the tail.
Fixes#7273Closes#7903
* github.com:scylladb/scylla:
redis: rename _args_size/_size_left There are two types of numerical parameter in redis protocol: - *[0-9]+ defined array size - $[0-9]+ defined string size
redis: fix large message handling
There are two types of numerical parameter in redis protocol:
- *[0-9]+ defined array size
- $[0-9]+ defined string size
Currently, array size is stored to args_count, and string size is
stored to _arg_size / _size_left.
It's bit hard to understand since both uses same word "arg(s)", let's
rename string size variables to _bytes_count / _bytes_left.
If the message is larger than current buffer size, we need to consume
more data until we reach to tail of the message.
To do so, we need to return nullptr when it's not on the tail.
Fixes#7273
Wait until entries propagate after adding and before changing leader
using the same code as done for partitioning.
This fixes occasional hangs in debug mode when a test switches to a
different leader without leaving enough time for full propagation.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Instead of sleep 1us use later()
Also use later to yield after sending append entries in rpc test impl.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
After support for mixed cluster compatibility feature
DIGEST_MULTIPARTITION_READ was dropped in 854a44ff9b
range_slice_read_executor and never_speculating_read_executor become
identical, so remove the former for good.
Message-Id: <20210124122731.GA1122499@scylladb.com>
If possible, test the highest sstable format version,
as it's the mostly used.
If there pre-written sstables we need to load from the
test directory from an older version, either specify their
version explicitly, or use the new test_env::reusable_sst
method that looks up the latest sstable version in the
given directory and generation.
Test: unit(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201210161822.2833510-1-bhalevy@scylladb.com>
`next_partition()` now returns a future<>, so we can forward it to the
remote shard in the scope of the next partition call, remove the
now obsolete workaround for the synchronous next partition.
`next_partition()` now returns a future<>, so we can forward it to the
remote shard in the scope of the next partition call, remove the
now obsolete workaround for the synchronous next partition.
`next_partition()` now returns a future<>, so we can forward it to the
remote shard in the scope of the next partition call, remove the
now obsolete workaround for the synchronous next partition.
The fromJson() function can take a map JSON and use it to set a map column.
However, the specific example of a map<ascii, int> doesn't work in Scylla
(it does work in Cassandra). The xfailing tests in this patch demonstrate
this. Although the tests use perfectly legal ASCII, scylla fails the
fromJson() function, with a misleading error.
Refs #7949.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210121233855.100640-1-nyh@scylladb.com>
Before we retire the mutation::query() code, expand the digest test to
check that the new code replacing it produces identical digest on all
possible equivalent mutations.
This is a replacement of `mutation::query()`, but with an implementation
based on the standard query result building code.
This will allow us to migrate the remaining `mutation::query()` users
off of said method, which in turn will allow us to retire it finally.
Reimplement in terms of the standard query result building code. We want
to retire the alternative query result code in `mutation::query()` and
`to_data_query_result()` is one of the main users.
We want to rewrite the above mentioned method's implementation in terms
of the standard query result building code (that of the `data_query()`
path), in order to retire the alternative query code in the mutation
class.
The `data_query()` code uses classes private to `mutation_partition.cc`
and instead of making these public, just move `to_data_query_result()`
to `mutation_partition.cc`.
This consume method accepts a `FlattenedConsumer`, the same one that the
name-sake `flat_mutation_reader::consume()` does. Indeed the main
purpose of this method is to allow using the standard query result
building stack with a mutation, the same way said stack is used with
mutation readers currently. This will allow us to replace the parallel
query result building code that currently exists in the
`mutation::query()` and friends, with the standard one.
In the next patch we will want to use these concepts in `mutation.hh`. To
avoid pulling in the entire `flat_mutation_reader.hh` just for these,
and create a circular dependency in doing so, move them to a dedicated
header instead.
This behaviour is makes query result building sensitive to whether the
data was recently compacted or not, in particular different digests will
be produced depending on whether purgeable tombstones happened to be
compacted (and thus purged) or not. This means that two replicas can
produce different digests for the same data if has compacted some
purgeable tombstones and the other not.
To avoid this, drop purgeable tombstones during query compaction as
well.
This method was marked with 'FIXME -- should not be public'
when it was introduced. Since then it has stopped being used
and can even be removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210122083146.5886-1-xemul@scylladb.com>
`multishard_combining_reader` currently only works under the assumption
that every table uses the same sharder configured using the node's number
of shards. But we could potentially specify a different sharder for a chosen table,
e.g. one that puts everything on shard 0.
Then this assumption will be broken and the reader causes a segfault.
Fixes#7945.
When writing to an integer column, Cassandra's fromJson() function allows
not just JSON number constants, it also allows a string containing a
number. Strings which do not hold a number fail with a FunctionFailure.
In particular, the empty string "" is an invalid number, and should fail.
The tests in this patch check this for two integer types: int and
varint.
Curiously, Cassandra and Scylla have opposite bugs here: Scylla fails
to recognize the error for varint, while Cassandra fails to recognize
the error for int. The tests in this patch reproduce these bugs.
The tests demonstrating Scylla's bug are marked xfail, and the tests
demonstrating Cassandra's bug is marked "cassandra_bug" (which means
it is marked xfail only when running against Cassandra, but expected
to succeed on Scylla.
Refs #7944.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210121133833.66075-1-nyh@scylladb.com>
As reproduced in cql-pytest/test_json.py and reported in issue #7911,
failing fromJson() calls should return a FUNCTION_FAILURE error, but
currently produce a generic SERVER_ERROR, which can lead the client
to think the server experienced some unknown internal error and the
query can be retried on another server.
This patch adds a new cassandra_exception subclass that we were missing -
function_execution_exception - properly formats this error message (as
described in the CQL protocol documentation), and uses this exception
in two cases:
1. Parse errors in fromJson()'s parameters are converted into a
function_execution_exception.
2. Any exceptions during the execute() of a native_scalar_function_for
function is converted into a function_execution_exception.
In particular, fromJson() uses a native_scalar_function_for.
Note, however, that functions which already took care to produce
a specific Cassandra error, this error is passed through and not
converted to a function_execution_exception. An example is
the blobAsText() which can return an invalid_request error, so
it is left as such and not converted. This also happens in Cassandra.
All relevant tests in cql-pytest/test_json.py now pass, and are
no longer marked xfail. This patch also includes a few more improvements
to test_json.py.
Fixes#7911
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210118140114.4149997-1-nyh@scylladb.com>
Merged patch series by Konstantin Osipov:
"These series improve uniqueness of generated timeuuids and change
list append/prepend logic to use client/LWT timestamp in timeuuids
generated for list keys. Timeuuid compare functions are
optimized.
The test coverage is extended for all of the above."
uuid: add a comment warning against UUID::operator<
uuid: replace slow versions of timeuiid compare with optimized/tested versions.
test: add tests for legacy uuid compare & msb monotonicity
test: add a test case for append/prepend limit
test: add a test case for monotonicity of timeuuid least significant bits
uuid: implement optimized timeuuid compare
test: add a test case for list prepend/append with custom timestamp
lists: rewrite list prepend to use append machinery
lists: use query timestamp for list cell values during append
uuid: fill in UUID node identifier part of UUID
test: add a CQL test for list append/prepend operations
Introduce uint64_t based comparator for serialized timeuuids.
Respect Cassandra legacy for timeuuid compare order.
Scylla uses two versions of timeuuid compare:
- one for timeuuid values stored in uuid columns
- a different one for timeuuid values stored in timeuuid columns.
This commit re-implements the implementations of these comparators in
types.cc and deprecates the respective implementations types.cc. They
will be removed in a following patch.
A micro-benchmark at https://github.com/alecco/timeuuid-bench/
shows 2-4x speed up of the new comparators.
Rewrite list prepend to use the same machinery
as append, and thus produce correct results when used in LWT.
After this patch, list prepend begins to honor user supplied timestamps.
If a user supplied timestamp for prepend is less than 2010-01-01 00:00:00
an exception is thrown.
Fixes#7611
Scylla list cells are represented internally as a map of
timeuuid => value. To append a new value to a list
the coordinator generates a timeuuid reflecting the current time as key
and adds a value to the map using this key.
Before this patch, Scylla always generated a timeuuid for a new
value, even if the query had a user supplied or LWT timestamp.
This could break LWT linearizability. User supplied timestamps were
ignored.
This is reported as https://github.com/scylladb/scylla/issues/7611
A statement which appended multiple values to a list or a BATCH
generated an own microsecond-resolution timeuuid for each value:
BEGIN BATCH
UPDATE ... SET a = a + [3]
UPDATE ... SET a = a + [4]
APPLY BATCH
UPDATE ... SET a = a + [3, 4]
To fix the bug, it's necessary to preserve monotonicity of
timeuuids within a batch or multi-value append, but make sure
they all use the microsecond time, as is set by LWT or user.
To explain the fix, it's first necessary to recall the structure
of time-based UUIDs:
60 bits: time since start of GMT epoch, year 1582, represented
in 100-nanosecond units
4 bits: version
14 bits: clock sequence, a random number to avoid duplicates
in case system clock is adjusted
2 bits: type
48 bits: MAC address (or other hardware address)
The purpose of clockseq bits is as defined in
https://tools.ietf.org/html/rfc4122#section-4.1.5
is to reduce the probability of UUID collision in case clock
goes back in time or node id changes. The implementation should reset it
whenever one of these events may occur.
Since LWT microsecond time is guaranteed to be
unique by Paxos, the RFC provisioning for clockseq and MAC
slots becomes excessive.
The fix thus changes timeuuid slot content in the following way:
- time component now contains the same microsecond time for all
values of a statement or a batch. The time is unique and monotonic in
case of LWT. Otherwise it's most always monotonic, but may not be
unique if two timestamps are created on different coordinators.
- clockseq component is used to store a sequence number which is
unique and monotonic for all values within the statement/batch.
- to protect against time back-adjustments and duplicates
if time is auto-generated, MAC component contains a random (spoof)
MAC address, re-created on each restart. The address is different
at each shard.
The change is made for all sources of time: user, generated, LWT.
Conditioning the list key generation algorithm on the source of
time would unnecessarily complicate the code while not increase
quality (uniqueness) of created list keys.
Since 14 bits of clockseq provide us with only 16383 distinct slots
per statement or batch, 3 extra bits in nanosecond part of the time
are used to extend the range to 131071 values per statement/batch.
If the rang is exceeded beyond the limit, an exception is produced.
A twist on the use of clockseq to extend timeuuid uniqueness is
that Scylla, like Cassandra, uses int8 compare to compare lower
bits of timeuuid for ordering. The patch takes this into account
and sign-complements the clockseq value to make it monotonic
according to the legacy compare function.
Fixes#7611
test: unit (dev)
Before this patch, UUID generation code was not creating
sufficiently unique IDs: the 6 byte node identifier was mostly
empty, i.e. only containing shard id. This could lead to
collisions between queries executed concurrently at different
coordinators, and, since timeuuid is used as key in list append
and prepend operations, lead to lost updates.
To generate a unique node id, the patch uses a combination of
hardware MAC address (or a random number if no hardware address is
available) and the current shard id.
The shard id is mixed into higher bits of MAC, to reduce the
chances on NIC collision within the same network.
With sufficiently unique timeuuids as list cell keys, such updates
are no longer lost, but multi-value update can still be "merged"
with another multi-value update.
E.g. if node A executes SET l = l + [4, 5] and node B executes SET
l = l + [6, 7], the list value could be any of [4, 5, 6, 7], [4,
6, 5, 7], [6, 4, 5, 7] and so on.
At least we are now less likely to get any value lost.
Fixes#6208.
@todo: initialize UUID subsystem explicitly in main()
and switch to using seastar::engine().net().network_interfaces()
test: unit (dev)
Now that managed_bytes and its users do not assume that a managed_bytes
instance allocated using standard_allocation_strategy is non-fragmented,
we can set the preferred max contiguous allocation to 128k. This causes
managed_bytes to fragment instances that are larger than this size.
Note that managed_bytes is the only user.
Closes#7943
This patch set adds etcd unit tests for raft.
It also includes a fix for replication test in debug mode and a
simplification for append_request.
Tests: unit ({dev}), unit ({debug}), unit ({release})
* https://github.com/alecco/scylla/tree/raft-ale-tests-09b:
raft: etcd unit tests: test log replication
raft: boost test etcd: test fsm can vote from any state
raft: boost test etcd: port TestLeaderElectionOverwriteNewerLogs
raft: replication test: add etcd test for cycling leaders
raft: testing: provide primitives to wait for log propagation
raft: etcd unit tests: initial boost tests
raft: combine append_request _receive and _send
Podman doesn't correctly support --pids-limit with cgroupsv1. Some
versions ignore it, and some versions reject the option.
To avoid the error, don't supply --pids-limit if cgroupsv2 is not
available (detected by its presence in /proc/filesystems). The user
is required to configure the pids limit in
/etc/containers/containers.conf.
Fixes#7938.
Closes#7939
"
_consumer_fut is expected to return an exception
on the abort path. Wait for it and drop any exception
so it won't be abandoned as seen in #7904.
A future<> close() method was added to return
_consumer_fut. It is called both after abort()
in the error path, and after consume_end_of_stream,
on the success path.
With that, consume_end_of_stream was made void
as it doesn't return a future<> anymore.
Fixes#7904
Test: unit(release)
"
* tag 'close-bucket-writer-v5' of github.com:bhalevy/scylla:
mutation_writer: bucket_writer: add close
mutation_writer/feed_writers: refactor bucket/shard writers
mutation_writer: update bucket/shard writers consume_end_of_stream
This adds a "--build-mode" command line option to "scylla" executable:
$ ./build/dev/scylla --build-mode
dev
This allows you to discover the build mode of a "scylla" executable
without resorting to "readelf", for example, to verify that you are
looking at the correct executable while debugging packaging issues.
Closes#7865
Just like scylla-sstable-index, scylla-types accepts types in (short)
cassandra class name notation. The mapping from the clq3 type names to
the class names is not straight-forward in all cases, so provide a link
to a table which lists the cassandra class name of all supported types
(and more).
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210120083816.37774-2-bdenes@scylladb.com>
This reverts commit df2f67626b. The fix
is correct, but has an unfortunate side effect with O_DSYNC: each
128k write also needs to flush the XFS log. This translates to
32MB/128k = 256 flushes, compared to one flush with the original code.
A better fix would be to prezero without O_DSYNC, then reopen the file
with O_DSYNC, but we can do that later.
Reopens#5857.
"
Currently storage service and snitch implicitly depend on each
other. Storage service gossips snitch data on start, snitch
kicks the storage service when its configuration changes.
This interdependency is relaxed:
- snitch gossips all its state itself without using the
storage service as a mediator
- storage service listens for snitch updates with the help
of self-breaking subscription
Both changes make snitch independent from storage service,
remove yet another call for global storage service from the
codebase and make the storage service -> snitch reference
robust against dagling pointers/references
tests: unit(dev), dtest.rebuild.TestRebuild.simple_rebuild(dev)
"
* 'br-snitch-gossip-2' of https://github.com/xemul/scylla:
storage-service: Subscribe to snitch to update topology
snitch: Introduce reconfiguration signal
snitch: Always gossip snitch info itself
snitch: Do gossip DC and RACK itself
snitch: Add generic gossiping helper
The new naming scheme more clearly communicates to the client of
the raft library that the `persistence` interface implements
persistency layer of the fsm that is powering the raft
protocol itself rather than the client-side workflow and
user-provided `state_machine`.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201126135114.7933-1-pa.solodovnikov@scylladb.com>
Current code that checks when snapshot has to be transferred does not
take in account the case where there can be log entries preceding the
snapshot. Fix the code to correctly test for snapshot transfer
condition.
Message-Id: <20210117095801.GB733394@scylladb.com>
replication_test's state machine is not commutative, so if commands are
applied in different order the states will be different as well. Since
the preemption check was added into co_await in seastar even waiting for
a ready future can preempt which will cause reordering of simultaneously
submitted entries in debug mode. For a long time we tried to keep entries
submission parallel in the test, but with the above seastar change it
is no longer possible to maintain it without changing the state machine
to be commutative. The patch changes the test to submit entries one by
one.
Message-Id: <20210117095147.GA733394@scylladb.com>
bucket_writer::close waits for the _consumer_fut.
It is called both after consume_end_of_stream()
and after abort().
_consumer_fut is expected to return an exception
on the abort path. Wait for it and drop any exception
so it won't be abandoned as seen in #7904.
With that moved to close() time, consume_end_of_stream
doesn't need to return a future and is made void
all the way in the stack. This is ok since
queue_reader_handle::push_end_of_stream is synchronous too.
Added a unit test that aborts the reader consumer
during `segregate_by_timestamp`, reproducing the
Exceptional future ignored issue without the fix.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Consolidate shard_based_splitting_writer::shard_writer
and timestamp_based_splitting_writer::bucket_writer
common code into mutation_writer::bucket_writer.
This provides a common place to handle consume_end_of_stream()
and abort(), and in particular the handling of the underlying
_conmsumer_fut.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
After 61520a33d6
feed_writers doesn't call consume_end_of_stream
after abort() so no need to test
if (!_handle.is_terminated()) {
and consume_end_of_stream is now called in then_wrapped
rather than `finally` so it's ok if it throws.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The code would already silence broken pipe exceptions since it's
expected when the other side closes the connection or when we shutdown the
socket during Scylla shutdown, but the code wouldn't handle the following:
1. "Connection reset by peer" errors: these can also happen in the
aforementioned two scenarios; the conditions that determine which of
the two types of errors occur are unclear.
2. The scenarios would sometimes result in a `seastar::nested_exception`,
mainly during shutdown. The errors could happen once when trying to send
a response to a request (`_write_buf.write(...)/flush(...)`) and then
again when trying to close the connection in a `finally` block. These
nested exceptions were not silenced.
The commit handles each of these cases.
Closes#7907.
Closes#7931
The main motivation for this patchset is to prepare
for adding a async close() method to flat_mutation_reader.
In order to close the reader before destroying it
in all paths we need to make next_partition asynchronous
so it can asynchronously close a current reader before
destoring it, e.g. by reassignment of flat_mutation_reader_opt,
as done in scanning_reader::next_partition.
Test: unit(release, debug)
* git@github.com:bhalevy/scylla.git futurize-next-partition-v1:
flat_mutation_reader: return future from next_partition
multishard_mutation_query: read_context: save_reader: destroy reader_meta from the calling shard
mutation_reader: filtering_reader: fill_buffer: futurize inner loop
flat_mutation_reader::impl: consumer_adapter: futurize handle_result
flat_mutation_reader: consume_pausable/in_thread: futurize_invoke consumer
flat_mutation_reader: FlatMutationReaderConsumer: support also async consumer
flat_mutation_reader:impl: get rid of _consume_done member
For tests to be able to transition in a consistent state, in some cases
it's needed to allow the followers to catch up with the leader.
This prevents occasional hangs in debug mode for incoming tests.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Combine structs for append request send and receive into a single
struct.
Author: Gleb Natapov <gleb@scylladb.com>
Date: Mon Nov 23 14:33:14 2020 +0200
Test single- and multi- value list append, prepend,
append and prepend in a batch, conditional statements.
This covers the parts of Cassandra which are working as documented
and which we intend to preserve compatibility with.
storage_service: Introduce load_and_stream
=== Introduction ===
This feature extends the nodetool refresh to allow loading arbitrary sstables
that do not belong to a node into the cluster. It loads the sstables from disk
and calculates the owning nodes of the data and streams to the owners
automatically.
From example, say the old cluster has 6 nodes and the new cluster has 3 nodes.
We can copy the sstables from the old cluster to any of the new nodes and
trigger the load and stream process.
This can make restores and migrations much easier.
=== Performance ===
I managed to get 40MB/s per shard on my build machine.
CPU: AMD Ryzen 7 1800X Eight-Core Processor
DISK: Samsung SSD 970 PRO 512GB
Assume 1TB sstables per node, each shard can do 40MB/s, each node has 32
shards, we can finish the load and stream 1TB of data in 13 mins on each
node.
1TB / 40 MB per shard * 32 shard / 60 s = 13 mins
=== Tests ===
backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_test
which creates a cluster with 4 nodes and inserts data, then use
load_and_stream to restore to a 2 nodes cluster.
=== Usage ===
curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true
=== Notes ===
Btw, with the old nodetool refresh, the node will not pick up the data
that does not belong to this node but it will not delete it either. One
has to run nodetool cleanup to remove those data manually which is a
surprise to me and probably to users as well. With load and stream, the
process will delete the sstables once it finishes stream, so no nodetool
cleanup is needed.
The name of this feature load and stream follows load and store in CPU world.
Fixes#7831Closes#7846
* github.com:scylladb/scylla:
storage_service: Introduce load_and_stream
distributed_loader: Add get_sstables_from_upload_dir
table: Add make_streaming_reader for given sstables set
This is a revival of #7490.
Quoting #7490:
The managed_bytes class now uses implicit linearization: outside LSA, data is never fragmented, and within LSA, data is linearized on-demand, as long as the code is running within with_linearized_managed_bytes() scope.
We would like to stop linearizing managed_bytes and keep it fragmented at all times, since linearization can require large contiguous chunks. Large contiguous allocations are hard to satisfy and cause latency spikes.
As a first step towards that, we remove all implicitly linearizing accessors and replace them with an explicit linearization accessor, with_linearized().
Some of the linearization happens long before use, by creating a bytes_view of the managed_bytes object and passing it onwards, perhaps storing it for later use. This does not work with with_linearized(), which creates a temporary linearized view, and does not work towards the longer term goal of never linearizing. As a substitute a managed_bytes_view class is introduced that acts as a view for managed_bytes (for interoperability it can also be a view for bytes and is compatible with bytes_view).
By the end of the series, all linearizations are temporary, within the scope of a with_linearized() call and can be converted to fragmented consumption of the data at leisure.
This has limited practical value directly, as current uses of managed_bytes are limited to keys (which are limited to 64k). However, it enables converting the atomic_cell layer back to managed_bytes (so we can remove IMR) and the CQL layer to managed_bytes/managed_bytes_view, removing contiguous allocations from the coordinator.
Closes#7820
* github.com:scylladb/scylla:
test: add hashers_test
memtable: fix accounting of managed_bytes in partition_snapshot_accounter
test: add managed_bytes_test
utils: fragment_range: add a fragment iterator for FragmentedView
keys: update comments after changes and remove an unused method
mutation_test: use the correct preferred_max_contiguous_allocation in measuring_allocator
row_cache: more indentation fixes
utils: remove unused linearization facilities in `managed_bytes` class
misc: fix indentation
treewide: remove remaining `with_linearized_managed_bytes` uses
memtable, row_cache: remove `with_linearized_managed_bytes` uses
utils: managed_bytes: remove linearizing accessors
keys, compound: switch from bytes_view to managed_bytes_view
sstables: writer: add write_* helpers for managed_bytes_view
compound_compat: transition legacy_compound_view from bytes_view to managed_bytes_view
types: change equal() to accept managed_bytes_view
types: add parallel interfaces for managed_bytes_view
types: add to_managed_bytes(const sstring&)
serializer_impl: handle managed_bytes without linearizing
utils: managed_bytes: add managed_bytes_view::operator[]
utils: managed_bytes: introduce managed_bytes_view
utils: fragment_range: add serialization helpers for FragmentedMutableView
bytes: implement std::hash using appending_hash
utils: mutable_view: add substr()
utils: fragment_range: add compare_unsigned
utils: managed_bytes: make the constructors from bytes and bytes_view explicit
utils: managed_bytes: introduce with_linearized()
utils: managed_bytes: constrain with_linearized_managed_bytes()
utils: managed_bytes: avoid internal uses of managed_bytes::data()
utils: managed_bytes: extract do_linearize_pure()
thrift: do not depend on implicit conversion of keys to bytes_view
clustering_bounds_comparator: do not depend on implicit conversion of keys to bytes_view
cql3: expression: linearize get_value_from_mutation() eariler
bytes: add to_bytes(bytes)
cql3: expression: mark do_get_value() as static
=== Introduction ===
This feature extends the nodetool refresh to allow loading arbitrary sstables
that do not belong to a node into the cluster. It loads the sstables from disk
and calculates the owning nodes of the data and streams to the owners
automatically.
From example, say the old cluster has 6 nodes and the new cluster has 3 nodes.
We can copy the sstables from the old cluster to any of the new nodes and
trigger the load and stream process.
This can make restores and migrations much easier.
=== Performance ===
I managed to get 40MB/s per shard on my build machine.
CPU: AMD Ryzen 7 1800X Eight-Core Processor
DISK: Samsung SSD 970 PRO 512GB
Assume 1TB sstables per node, each shard can do 40MB/s, each node has 32
shards, we can finish the load and stream 1TB of data in 13 mins on each
node.
1TB / 40 MB per shard * 32 shard / 60 s = 13 mins
=== Tests ===
backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_test
which creates a cluster with 4 nodes and inserts data, then use
load_and_stream to restore to a 2 nodes cluster.
=== Usage ===
curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true
=== Notes ===
Btw, with the old nodetool refresh, the node will not pick up the data
that does not belong to this node but it will not delete it either. One
has to run nodetool cleanup to remove those data manually which is a
surprise to me and probably to users as well. With load and stream, the
process will delete the sstables once it finishes stream, so no nodetool
cleanup is needed.
The name of this feature load and stream follows load and store in CPU world.
Fixes#7831
This test is a sanity check. It verifies that our wrappers over well known
hashes (xxhash, md5, sha256) actually calculate exactly those hashes.
It also checks that the `update()` methods of used hashers are linear with
respect to concatenation: that is, `update(a + b)` must be equivalent to
`update(a); update(b)`. This wasn't relied on before, but now we need to
confirm that hashing fragmented keys without linearizing them won't break
backward compatibility.
managed_bytes has a small overhead per each fragment. Due to that, managed_bytes
containing the same data can have different total memory usage in different
allocators. The smaller the preferred max allocation size setting is, the more
fragments are needed and the greater total per-fragment overhead is.
In particular, managed_bytes allocated in the LSA could grow in
memory usage when copied to the standard allocator, if the standard allocator
had a preferred max allocation setting smaller than the LSA.
partition_snapshot_accounter calculates the amount of memory used by
mutation fragments in the memtable (where they are allocated with LSA) based
on the memory usage after they are copied to the standard allocator.
This could result in an overestimation, as explained above.
But partition_snapshot_accounter must not overestimate the amount of freed
memory, as doing otherwise might result in OOM situations.
This patch prevents the overaccounting by adding minimal_external_memory_usage():
a new version of external_memory_usage(), which ignores allocator-dependent
overhead. In particular, it includes the per-fragment overhead in managed_bytes
only once, no matter how many fragments there are.
The comments were outdated after the latest changes (bytes_view vs
managed_bytes_view).
compound_view_wrapper::get_component() is unused, so we remove it.
Since we introduced relocatable package and offline installer, scylla binary itself can run almost any distributions.
However, setup scripts are not designed to run in unsupported distributions, it causes error on such environment.
This PR adds minimal support to run offline installation on unsupported distributions, tested on SLES, Arch Linux and Gentoo.
Closes#7858
* github.com:scylladb/scylla:
dist: use sysconfig_parser to parse gentoo config file
dist: add package name translation
dist: support SLES/OpenSUSE
install.sh: add systemd existance check
install.sh: ignore error missing sysctl entries
dist: show warning on unsupported distributions
dist: drop Ubuntu 14.04 code
dist: move back is_amzn2() to scylla_util.py
dist: rename is_gentoo_variant() to is_gentoo()
dist: support Arch Linux
dist: make sysconfig directory detectable
Flush is facing stalls because partition_snapshot_flat_reader::fill_buffer()
generates mutation fragment until buffer is full[1] without yielding.
this is the code path:
flush_reader::fill_buffer() <---------|
flat_mutation_reader::consume_pausable() <--------|
partition_snapshot_flat_reader::fill_buffer() -|
[1]: https://github.com/scylladb/scylla/blob/6cfc949e/partition_snapshot_reader.hh#L261
This is fixed by breaking the loop in do_fill_buffer() if preemption is needed,
allowing do_until() to yield in sequence, and when it resumes, continue from
where it left off, until buffer is full.
Fixes#7885.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210114141417.285175-1-raphaelsc@scylladb.com>
implicit revert of 6322293263
sshd previosly was used by the scylla manager 1.0.
new version does not need it. there is no point of
having it currently. it also confuses everyone.
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
Closes#7921
The client_state::check_access() calls for global storage service
to get the features from it and check if the CDC feature is on.
The latter is needed to perform CDC-specific checks.
However it was noticed, that the check for the feature is excessive
as all the guarded if-s will resolve to false in case CDC is off
and the check_access will effectively work as it would with the
feature check.
With that observation, it's possible to ditch one more global storage
service reference.
tests: unit(dev), dtest(dev, auth)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210105063651.7081-1-xemul@scylladb.com>
The reader_meta in _readers[shard] is created on shard 0 and must
be destroyed on it as well.
A following patch changes next_partition() to return a future<>
thus it introduces a continuation that requires access to `rm`.
We cannot move it down to the conuation safely, since it will be
wrongly destroyed in the invoked shard, so use do_with to hold it
in the scope of the calling shard until the invoked function
completes.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently snitch explicitly calls storage service (if
it's initialized) to update topology on snitch data
change.
Instead of it -- make storage service subscribe on the
snitch reconfigure signal upon creation.
This finally makes snitch fully independent from storage
service.
In tests the snitch instance is not created, so check
for it before subscribing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add a notifier to snitch_base that gets triggered when the
snitch configuration changes to which others may subscribe.
For now only the gossiping-file-snitch triggers it when it
re-reads its config file. Other existing snitches are kinda
static in this sense.
The subscribe-trigger engine is based on scoped connection
from boost::signals2.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The gossiping_property_file_snitch updates the gossip RACK and DC
values upon config change. Right now this is done with the help
of storage service, but the needed code to gossip rack and dc is
already available in the snitch itself.
Said that -- gossip snitch info by snitch helper and remove the
storage_service's one. This makes the 2nd step decoupling snitch
and storage service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is the 2nd step in generalizing the snitch data gossiping
and at the same the 1st step in decoupling storage service and
snitch.
During start storage service starts gossiper, which notifies the
snicth with .gossiper_starting() call, then the storage service
calls gossip_snitch_info.
This patch makes snitch itself do the last step.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Nowadays some snitch implementations gossip the INTERNAL_IP
value and storage_service gossip RACK and DC for all of them.
This functionality is going to be generalized and the first
step is in making a common method for a snitch to gossip its
data.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
offline installer can run in non-systemd distributions, but it won't
work since we only have systemd units.
So check systemd existance and print error message.
is_redhat_variant() is the function to detect RHEL/CentOS/Fedora/OEL,
and is_debian_variant() is the function to detect Debian/Ubuntu.
Unlike these functions, is_gentoo_variant() does not detect "Gentoo variants",
we should rename it to is_gentoo().
Currently, install.sh provide a way to customize sysconfig directory,
but sysconfig directory is hardcoded on script.
Also, /etc/sysconfig seems correct to use default value, but current
code specify /etc/default as non-redhat distributions.
Instead of hardcoding, generate generate python script in install.sh
to save specified sysconfig directory path in python code.
The reply to a /column_family/ GET request contains info about all
column families. Currently, all this info is stored in a single
string when replying, and this string may require a big allocation
when there are many column families.
To avoid that allocation, instead of a single string, use a
body_writer function, which writes chunks of the message content
to the output stream.
Fixes#7916
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Closes#7917
After these changes the generated code deserializes the stream into a chunked vector, instead of an contiguous one, so even if there are many fields in it, there won't be any big allocations.
I haven't run the scylla cluster test with it yet but it passes the unit tests.
Closes#7919
* github.com:scylladb/scylla:
idl: change the type of mutation_partition_view::rows() to a chunked_vector
idl-compiler: allow fields of type utils::chunked_vector
Numbers in JSON are not limited in range, so when the fromJson() function
converts a number to a limited-range integer column in Scylla, this
conversion can overflow. The following tests check that this conversion
should result in an error (FunctionFailure), not silent trunction.
Scylla today does silently wrap around the number, so these tests
xfail. They pass on Cassandra.
Refs #7914.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112151041.3940361-1-nyh@scylladb.com>
This patch adds more (failing) tests for issue #7911, where fromJson()
failures should be reported as a clean FunctionFailure error, not an
internal server error.
The previous tests we had were about JSON parse failures, but a
different type of error we should support is valid JSON which returned
the wrong type - e.g., the JSON returning a string when an integer
was expected, or the JSON returning a string with non-ASCII characters
when ASCII was expected. So this patch adds more such tests. All of
them xfail on Scylla, and pass on Cassandra.
Refs #7911.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112122211.3932201-1-nyh@scylladb.com>
This patch adds a reproducer test for issue #7912, which is about passing
a null parameter to the fromJson() function supposed to be legal (and
return a null value), and is legal in Cassandra, but isn't allowed in
Scylla.
There are two tests - for a prepared and unprepared statement - which
fail in different ways. The issue is still open so the tests xfail on
Scylla - and pass on Cassandra.
Refs #7912.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112114254.3927671-1-nyh@scylladb.com>
Related issue scylladb/sphinx-scylladb-theme#88
Once this commit is merged, the docs will be published under the new domain name https://scylla.docs.scylladb.com
Frequently asked questions:
Should we change the links in the README/docs folder?
GitHub automatically handles the redirections. For example, https://scylladb.github.io/sphinx-scylladb-theme/stable/examples/index.html redirects to https://sphinx-theme.scylladb.com/stable/examples/index.html
Nevertheless, it would be great to change URLs progressively to avoid the 301 redirections.
Do I need to add this new domain in the custom dns domain section on GitHub settings?
It is not necessary. We have already edited the DNS for this domain and the theme creates programmatically the required CNAME file. If everything goes well, GitHub should detect the new URL after this PR is merged.
The DNS doesn't seem to have the right SSL certificates
GitHub handles the certificate provisioning but is not aware of the subdomain for this repo yet. make multi-version will create a new file "CNAME". This is published in gh-pages branch, therefore GitHub should create the missing cert.
Closes#7877
Use the thread_local seastar::testing::local_random_engine
in all seastar tests so they can be reproduced using
the --random-seed option.
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210112103713.578301-2-bhalevy@scylladb.com>
The min/max aggregators use aggregate_type_for comparators, and the
aggregate_type_for<timeuuid> is regular uuid. But that yields wrong
results; timeuuids should be compared as timestamps.
Fix it by changing aggregate_type_for<timeuuid> from uuid to timeuuid,
so aggregators can distinguish betwen the two. Then specialize the
aggregation utilities for timeuuid.
Add a cql-pytest and change some unit tests, which relied on naive
uuid comparators.
Fixes#7729.
Tests: unit (dev, debug)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#7910
"
Without interposer consumer on flush, it could happen that a new sstable,
produced by memtable flush, will not conform to the strategy invariant.
For example, with TWCS, this new sstable could span multiple time windows,
making it hard for the strategy to purge expired data. If interposer is
enabled, the data will be correctly segregated into different sstables,
each one spanning a single window.
Fixes#4617.
tests:
- mode(dev).
- manually tested it by forcing a flush of memtable spanning many windows
"
* 'segregation_on_flush_v2' of github.com:raphaelsc/scylla:
test: Add test for TWCS interposer on memtable flush
table: Wire interposer consumer for memtable flush
table: Add write_memtable_to_sstable variant which accepts flat_mutation_reader
table: Allow sstable write permit to be shared across monitors
memtable: Track min timestamp
table: Extend cache update to operate a memtable split into multiple sstables
This patch adds a reproducer test for issue #7911, which is about a parse
error in JSON string passed to the fromJson() function causing an
internal error instead of the expected FunctionFailure error.
The issue is still open so the test xfails on Scylla (and passes on
Cassandra).
Refs #7911.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112094629.3920472-1-nyh@scylladb.com>
The option can only take integer values >= 0, since negative
TTL is meaningless and is expected to fail the query when used
with `USING TTL` clause.
It's better to fail early on `CREATE TABLE` and `ALTER TABLE`
statement with a descriptive message rather than catch the
error during the first lwt `INSERT` or `UPDATE` while trying
to insert to system.paxos table with the desired TTL.
Tests: unit(dev)
Fixes: #7906
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210111202942.69778-1-pa.solodovnikov@scylladb.com>
Unfortunately snapshot checking still does not work in the presence of
log entries reordering. It is impossible to know when exactly the
snapshot will be taken and if it is taken before all smaller than
snapshot idx entries are applied the check will fail since it assumes
that.
This patch disabled snapshot checking for SUM state machine that is used
in backpressure test.
Message-Id: <20201126122349.GE1655743@scylladb.com>
The value of mutation_partition_view::rows() may be very large, but is
used almost exclusively for iteration, so in order to avoid a big allocation
for an std::vector, we change its type to an utils::chunked_vector.
Fixes#7918
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
The utils::chunked_vector has practically the same methods
as a std::vector, so the same code can be generated for it.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
We have recently seen a suspected corrupt mutation fragment stream to get
into an sstable undetected, causing permanent corruption. One of the
suspected ways this could happen is the compaction sstable write path not
being covered with a validator. To prevent events like this in the future
make sure all sstable write paths are validated by embedding the validator
right into the sstable writer itself.
Refs: #7623
Refs: #7640
Tests: unit(release)
* https://github.com/denesb/scylla.git sstable-writer-fragment-stream-validation/v2:
sstable_writer: add validation
test/boost/sstable_datafile_test: sstable_scrub_test: disable key validation
mutation_fragment_stream_validator: make it easier to validate concrete fragment types
flat_mutation_reader: extract fragment stream validator into its own header
Attempt to hurry flushing/segment delete/recycle if we are trying
to get a segment for allocation, and reserve is empty when above
disk threshold. This is minimize time waited in allocation semaphore.
Cassandra constructs `QueryOptions.SpecificOptions` in the same
way that we do (by not providing `serial_constency`), but they
do have a user-defined constructor which does the following thing:
this.serialConsistency = serialConsistency == null ? ConsistencyLevel.SERIAL : serialConsistency;
This effectively means that DEFAULT `SpecificOptions` always
have `SerialConsistency` set to `SERIAL`, while we leave this
`std::nullopt`, since we don't have a constructor for
`specific_options` which does this.
Supply `db::consistency_level::SERIAL` explicitly to the
`specific_options::DEFAULT` value.
Tests: unit(dev)
Fixes: #7850
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201231104018.362270-1-pa.solodovnikov@scylladb.com>
This adds a simple reproducer for a bug involving a CONTAINS relation on
frozen collection clustering columns when the query is restricted to a
single partition - resulting in a strange "marshalling error".
This bug still exists, so the test is marked xfail.
Refs #7888.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210107191417.3775319-1-nyh@scylladb.com>
We add a reproducer for issues #7868 and #7875 which are about bugs when
a table has a frozen collection as its clustering key, and it is sorted
in *reverse order*: If we tried to insert an item to such a table using an
unprepared statement, it failed with a wrong error ("invalid set literal"),
but if we try to set up a prepared statement, the result is even worse -
an assertion failure and a crash.
Interestingly, neither of these problems happen without reversed sort order
(WITH CLUSTERING ORDER BY (b DESC)), and we also add a test which
demonstrates that with default (increasing) order, everything works fine.
All tests pass successfully when run against Cassandra.
The fix for both issues was already committed, so I verified these tests
reproduced the bug before that commit, and pass now.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210110232312.3844408-1-nyh@scylladb.com>
In this patch, we port validation/entities/frozen_collections_test.java,
containing 33 tests for frozen collections of all types, including
nesting collections.
In porting these tests, I uncovered four previously unknown bugs in Scylla:
Refs #7852: Inserting a row with a null key column should be forbidden.
Refs #7868: Assertion failure (crash) when clustering key is a frozen
collection and reverse order.
Refs #7888: Certain combination of filtering, index, and frozen collection,
causes "marshalling error" failure.
Refs #7902: Failed SELECT with tuple of reversed-ordered frozen collections.
These tests also provide two more reproducers for an already known bug:
Refs #7745: Length of map keys and set items are incorrectly limited to
64K in unprepared CQL.
Due to these bugs, 7 out of the 33 tests here currently xfail. We actually
had more failing tests, but we fixed issue #7868 before this patch went in,
so its tests are passing at the time of this submission.
As usual in these sort of tests, all 33 pass when running against Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210110231350.3843686-1-nyh@scylladb.com>
In test_streams.py we had some code to get a list of shards and iterators
duplicated three times. Put it in a function, shards_and_latest_iterators(),
to reduce this duplication.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201006112421.426096-1-nyh@scylladb.com>
Add a mutation_fragment_stream_validating_filter to
sstables::writer_impl and use it in sstable_writer to validate the
fragment stream passed down to the writer implementation. This ensures
that all fragment streams written to disk are validated, and we don't
have to worry about validating each source separately.
The current validator from sstable::write_components() is removed. This
covers only part of the write paths. Ad-hoc validations in the reader
implementations are removed as well as they are now redundant.
The test violates clustering key order on purpose to produce a corrupt
sstable (to test scrub). Disable key validation so when we move the
validator into the writer itself in the next patch it doesn't abort the
test.
The current API is tailored to the `mutation_fragment` type. In
the next patch we will want to use the validator from a context where
the mutation fragments are already decomposed into their respective
concrete types, e.g. static_row, clustering_row, etc. To avoid having to
reconstruct a mutation fragment type just to use the validator, add an
API which allows validating these concrete types conveniently too.
Replace two methods for unreversal (`as` and `self_or_reversed`) with
a new one (`without_reversed`). More flexible and better named.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#7889
Currently, frozen mutations, that contain partitions with out-of-order
or duplicate rows will trigger (if they even do) an assert in
`row::append_cell()`. However, this results in poor diagnostics (if at
all) as the context doesn't contain enough information on what exactly
went wrong. This results in a cryptic error message and an investigation
that can only start after looking at a coredump.
This series remedies this problem by explicitly checking for
out-of-order and duplicate rows, as early as possible, when the
supposedly empty row is created. If the row already existed (is a
duplicate) or it is not the last row in the partition (out-of-order row)
an exception is thrown and the deserialization is aborted. To further
improve diagnostics, the partition context is also added to the
exception.
Tests: unit(release)
* botond/frozen-mutation-bad-row-diagnostics/v3:
frozen_mutation: add partition context to errors coming from deserializing
partition_builder: accept_row(): use append_clustering_row()
mutation_partition: add append_clustered_row()
measuring_allocator is a wrapper around standard_allocator, but it exposed
the default preferred_max_contiguous_allocation, not the one from
standard_allocator. Thus managed_bytes allocated in those two allocators
had fragments of different size, and their total memory usage differed,
causing test_external_memory_usage to fail if
standard_allocator::preferred_max_contiguous_allocation was changed from the
default. Fix that.
Remove the following bits of `managed_bytes` since they are unused:
* `with_linearized_managed_bytes` function template
* `linearization_context_guard` RAII wrapper class for managing
`linearization_context` instances.
* `do_linearize` function
* `linearization_context` class
Since there is no more public or private methods in `managed_class`
to linearize the value except for explicit `with_linearized()`,
which doesn't use any of aforementioned parts, we can safely remove
these.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The patch fixes indentation issues introduced in previous patches
related to removing `with_linearized_managed_bytes` uses from the
code tree.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
There is no point in calling the wrapper since linearization code
is private in `managed_bytes` class and there is no one to call
`managed_bytes::data` because it was deleted recently.
This patch is a prerequisite for removing
`with_linearized_managed_bytes` function completely, alongside with
the corresponding parts of implementation in `managed_bytes`.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Since `managed_bytes::data()` is deleted as well as other public
APIs of `managed_bytes` which would linearize stored values except
for explicit `with_linearized`, there is no point
invoking `with_linearized_managed_bytes` hack which would trigger
automatic linearization under the hood of managed_bytes.
Remove useless `with_linearized_managed_bytes` wrapper from
memtable and row_cache code.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The keys classes (partition_key et al) already use managed_bytes,
but they assume the data is not fragmented and make liberal use
of that by casting to bytes_view. The view classes use bytes_view.
Change that to managed_bytes_view, and adjust return values
to managed_bytes/managed_bytes_view.
The callers are adjusted. In some places linearization (to_bytes())
is needed, but this isn't too bad as keys are always <= 64k and thus
will not be fragmented when out of LSA. We can remove this
linearization later.
The serialize_value() template is called from a long chain, and
can be reached with either bytes_view or managed_bytes_view.
Rather than trace and adjust all the callers, we patch it now
with constexpr if.
operator bytes_view (in keys) is converted to operator
managed_bytes_view, allowing callers to defer or avoid
linearization.
bytes_view can convert to managed_bytes_view, so the change
is compatible with the existing representation and the next
patches, which change compound types to use managed_bytes_view.
This operator has a single purpose: an easier port of legacy_compound_view
from bytes_view to managed_bytes_view.
It is inefficient and should be removed as soon as legacy_compound_view stops
using operator[].
managed_bytes_view is a non-owning view into managed_bytes.
It can also be implicitly constructed from bytes_view.
It conforms to the FragmentedView concept and is mainly used through that
interface.
It will be used as a replacement for bytes_view occurrences currently
obtained by linearizing managed_bytes.
This is a preparation for the upcoming introduction of managed_bytes_view,
intended as a fragmented replacement for bytes_view.
To ease the transition, we want both types to give equal hashes for equal
contents.
Unset values for key and value were not handled. Handle them in a
manner matching Cassandra.
This fixes all cases in testMapWithUnsetValues, so re-enable it (and
fix a comment typo in it).
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When the right-hand side of IN is an unset value, we must report an
error, like Cassandra does.
This fixes testListWithUnsetValues, so re-enable it.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Make the bind() operation of the scalar marker handle the unset-value
case (which it previously didn't).
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Avoid crash described in #7740 by ignoring the update when the
element-to-remove is UNSET_VALUE.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Since we haven't implemented parse error on redis protocol parser,
reply message is broken at parse error.
Implemented parse error, reply error message correctly.
Fixes#7861Fixes#7114Closes#7862
When the clustering order is reversed on a map column, the column type
is reversed_type_impl, not map_type_impl. Therefore, we have to check
for both reversed type and map type in some places.
This patch handles reverse types in enough places to make
test_clustering_key_reverse_frozen_map pass. However, it leaves
other places (invocations of is_map() and *_cast<map_type_impl>())
as they currently are; some are protected by callers from being
invoked on reverse types, but some are quite possibly bugs untriggered
by existing tests.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When the clustering order is reversed on a list column, the column type
is reversed_type_impl, not list_type_impl. Therefore, we have to check
for both reversed type and list type in some places.
This patch handles reverse types in enough places to make
test_clustering_key_reverse_frozen_list pass. However, it leaves
other places (invocations of is_list() and *_cast<list_type_impl>())
as they currently are; some are protected by callers from being
invoked on reverse types, but some are quite possibly bugs untriggered
by existing tests.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When the clustering order is reversed on a set column, the column type
is reversed_type_impl, not set_type_impl. Therefore, we have to check
for both reversed type and set type in some places.
To make such checks easier, add convenience methods self_or_reversed()
and as() to abstract_type. Invoke those methods (instead of is_set()
and casts) enough to make test_clustering_key_reverse_frozen_set pass.
Leave other invocations of is_set() and *_cast<set_type_impl>() as
they are; some are protected by callers from being invoked on reverse
types, but some are quite possibly bugs untriggered by existing tests.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Adds a second RP to table, marking where we flushed last.
If a new flush request comes in that is below this mark, we
can skip a second flush.
This is to (in future) support incremental CL flush.
This patch enables select cql statements where collection columns are
selected columns in queries where clustering column is restricted by
"IN" cql operator. Such queries are accepted by cassandra since v4.0.
The internals actually provide correct support for this feature already,
this patch simply removes relevant cql query check.
Tests: cql-pytest (testInRestrictionWithCollection)
Fixes#7743Fixes#4251
Signed-off-by: Vojtech Havel <vojtahavel@gmail.com>
Message-Id: <20210104223422.81519-1-vojtahavel@gmail.com>
* seastar d1b5d41b...a2fc9d72 (6):
> perftune.py: support passing multiple --nic options to tune multiple interfaces at once
> perftune.py recognize and sort IRQs for Mellanox NICs
> perftune.py: refactor getting of driver name into __get_driver_name()
Fixes#6266
> install-dependencies: support Manjaro
> append_challenged_posix_file_impl: optimize_queue: use max of sloppy_size_hint and speculative_size
> future: do_until: handle exception in stop condition
"
The size_estimates_mutation_reader call for global proxy
to get database from. The database is used to find keyspaces
to work with. However, it's safe to keep the local database
refernece on the reader itself.
tests: unit(debug)
"
* 'br-no-proxy-in-size-estimate-reader' of https://github.com/xemul/scylla:
size_estimate_reader: Use local db reference not global
size_estimate_reader: Keep database reference on mutation reader
size_estimate_reader: Keep database reference on virtual_reader
Conversions from views to owners have no business being implicit.
Besides, they would also cause various ambiguity problems when adding
managed_bytes_view.
From now on, memtable flush will use the strategy's interposer consumer
iff split_during_flush is enabled (disabled by default).
It has effect only for TWCS users as TWCS it's the only strategy that
goes on to implement this interposer consumer, which consists of
segregating data according to the window configuration.
Fixes#4617.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
As a preparation for interposer on flush, let's allow database write monitor
to store a shared sstable write permit, which will be released as soon as
any of the sstable writers reach the sealing stage.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
In this patch, we port validation/entities/collection_test.java, containing
7 tests for CQL counters. Happily, these tests did not uncover any bugs in
Scylla and all pass on both Cassandra and Scylla.
There is one small difference that I decided to ignore instead of reporting
a bug. If you try a CREATE TABLE with both counter and non-counter columns,
Scylla gives a ConfigurationException error, while Cassandra gives a more
reasonable InvalidRequest. The ported test currently allows both.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201223181325.3148928-1-nyh@scylladb.com>
In issue #7843 there were questions raised on how much does Scylla
support the notion of Unicode Equivalence, a.k.a. Unicode normalization.
Consider the Spanish letter ñ - it can be represented by a single Unicode
character 00F1, but can also be represented as a 006E (lowercase "n")
followed by a 0303 ("combining tilde"). Unicode specifies that these
two representations should be considered "equivalent" for purposes of
sorting or searching. But the following tests demonstrates that this
is not, in fact, supported in Scylla or Cassandra:
1. If you use one representation as the key, then looking up the other one
will not find the row. Scylla (and Cassandra) do *not* consider
the two strings equivalent.
2. The LIKE operator (a Scylla-only extension) doesn't know that
the single-character ñ begins with an n, or that the two-character
ñ is just a single character.
This is despite the thinking on #7843 which by using ICU in the
implementation of LIKE, we somehow got support for this. We didn't.
Refs #7843
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201229125330.3401954-1-nyh@scylladb.com>
This patch adds a reproducer for issue #7856, which is about frozen sets
and how we can in Scylla (but not in Cassandra), insert one in the "wrong"
order, but only in very specific circumstances which this reproducer
demonstrates: The bug can only be reproduced in a nested frozen collection,
and using prepared statements.
Refs #7856
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201231085500.3514263-1-nyh@scylladb.com>
Tracking both min and max timestamp will be required for memtable flush
to short-circuit interposer consumer if needed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This extension is needed for future work where a memtable will be segregated
during flush into one sstable or more. So now multiple sstables can be added
to the set after a memtable flush, and compaction is only triggered at the
end.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
- the mystical `accept_predicate` is renamed to `accept_keyspace`
to be more self-descriptive
- a short comment is added to the original calculate_schema_digest
function header, mentioning that it computes schema digest
for non-system keyspaces
Refs #7854
Message-Id: <04f1435952940c64afd223bd10a315c3681b1bef.1609763443.git.sarna@scylladb.com>
In scylla-jmx, we fixed a hardcode sysconfdir in EnvironmentFile path,
realpath was used to convert the path. This patch changed to use
realpath in scylla repo to make it consistent with scylla-jmx.
Suggested-by: Pekka Enberg <penberg@scylladb.com>
Signed-off-by: Amos Kong <amos@scylladb.com>
Closes#7860
The original idea for `schema_change_test` was to ensure that **if** schema hasn't changed, the digest also remained unchanged. However, a cumbersome side effect of adding an internal distributed table (or altering one) is that all digests in `schema_change_test` are immediately invalid, because the schema changed.
Until now, each time a distributed system table was added/amended, a new test case for `schema_change_test` was generated, but this effort is not worth the effect - when a distributed system table is added, it will always propagate on its own, so generating a new test case does not bring any tangible new test coverage - it's just a pain.
To avoid this pain, `schema_change_test` now explicitly skips all internal keyspaces - which includes internal distributed tables - when calculating schema digest. That way, patches which change the way of computing the digest itself will still require adding a new test case, which is good, but, at the same time, changes to distributed tables will not force the developers to introduce needless schema features just for the sake of this test.
Tests:
* unit(dev)
* manual(rebasing on top of a change which adds two distributed system tables - all tests still passed)
Refs #7617Closes#7854
* github.com:scylladb/scylla:
schema_change_test: skip distributed system tables in digest
schema_tables: allow custom predicates in schema digest calc
alternator: drop unneeded sstring creation
system_keyspace: migrate helper functions to string_view
database: migrate find_keyspace to string views
With previous design of the schema change test, a regeneration
was necessary each time a new distributed system table was added.
It was not the original purpose of the test to keep track of new
distributed tables which simply propagate on their own,
so the test case is now modified: internal distributed tables
are not part of the schema digest anymore, which means that
changes inside them will not cause mismatches.
This change involves a one-shot regeneration of all digests,
which due to historical reasons included internal distributed
tables in the digest, but no further regenerations should ever
be necessary when a new internal distributed table is added.
For testing purposes it would be useful to be able to skip computing
schema for certain tables (namely, internal distributed tables).
In order to allow that, a function which accepts a custom predicate
is added.
It's now possible to use string views to check if a particular
table is a system table, so it's no longer needed to explicitly
create an sstring instance.
Functions for checking if the keyspace is system/internal were based
on sstring references, which is impractical compared to string views
and may lead to unnecessary creation of sstring instances.
It looks like the history of the flag begins in Cassandra's
https://issues.apache.org/jira/browse/CASSANDRA-7327 where it is
introduced to speedup tests by not needing to start the gossiper.
The thing is we always start gossiper in our cql tests, so the flag only
introduce noise. And, of course, since we want to move schema to use raft
it goes against the nature of the raft to be able to apply modification only
locally, so we better get rid of the capability ASAP.
Tests: units(dev, debug)
Message-Id: <20201230111101.4037543-2-gleb@scylladb.com>
When a node notice that it uses legacy SI tables it converts them to use
new format, but it update only local schema. It will only cause schema
discrepancy between nodes, there schema change should propagate
globally.
Fixes#7857.
Message-Id: <20201230111101.4037543-1-gleb@scylladb.com>
After 13fa2bec4c, every compaction will be performed through a filtering
reader because consumers cannot do the filtering if interposer consumer is
enabled.
It turns out that filtering_reader is adding significant overhead when regular
compactions are running. As no other compaction type need to actually do
any filtering, let's limit filtering_reader to cleanup compaction.
Alternatively, we could disable interposer consumer on behalf of cleanup,
or add support for the consumers to do the filtering themselves but that
would add lots of complexity.
Fixes#7748.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201230194516.848347-2-raphaelsc@scylladb.com>
This filter is used to discard data that doesn't belong to current
shard, but scylla will only make a sstable available to regular
compaction after it was resharded on either boot or refresh.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201230194516.848347-1-raphaelsc@scylladb.com>
This reverts commit ceb67e7728. The
"epel-release" package is needed to install the "supervisord"
package, which I somehow missed in testing...
Fixes#7851
This patch adds two simple tests for what happens when a user tries to
insert a row with one of the key column missing. The first tests confirms
that if the column is completely missing, we correctly print an error
(this was issue #3665, that was already marked fixed).
However, the second test demonstrates that we still have a bug when
the key column appears on the command, but with a null value.
In this case, instead of failing the insert (as Cassandra does),
we silently ignore it. This is the proper behavior for UNSET_VALUE,
but not for null. So the second test is marked xfail, and I opened
issue #7852 about it.
Refs #3665
Refs #7852
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201230132350.3463906-1-nyh@scylladb.com>
In a previous version of test_using_timeout.py, we had tables pre-filled
with some content labled "everything". The current version of the tests
don't use it, so drop it completely.
One test, test_per_query_timeout_large_enough, still had code that did
res = list(cql.execute(f"SELECT * FROM {table} USING TIMEOUT 24h"))
assert res == everything
this was a bug - it only works as expected if this test is run before
anything other test is run, and will fail if we ever reorder or parallelize
these tests. So drop these two lines.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201229145435.3421185-1-nyh@scylladb.com>
This miniseries rewrites two alternator request handlers from seastar threads to coroutines - since these handlers are not on a hot path and using seastar threads is way too heavy for such a simple routine.
NOTE: this pull request obviously has to wait until coroutines are fully supported in Seastar/Scylla.
Closes#7453
* github.com:scylladb/scylla:
alternator: coroutinize untagging a resource
alternator: coroutinize tagging a resource
node_exporter had been added to scylla-server package by commit
95197a09c9.
So we can enable it by default for offline installation.
Closes#7832
* github.com:scylladb/scylla:
scylla_setup: cleanup if judgments
scylla_setup: enable node_exporter for offline installation
On every compaction completion, sstable set is rebuilt from scratch.
With LCS and ~160G of data per shard, it means we'll have to create
a new sstable set with ~1000 entries whenever compaction completes,
which will likely result in reactor stalling for a significant
amount of time.
Fixes#7758.
Closes#7842
* github.com:scylladb/scylla:
table: Fix potential reactor stall on LCS compaction completion
table: decouple preparation from execution when updating sstable set
table: change rebuild_sstable_list to return new sstable set
row_cache: allow external updater to decouple preparation from execution
The range_tombstone_list always (unless misused?) contains de-overlapped
entries. There's a test_add_random that checks this, but it suffers from
several problems:
- generated "random" ranges are sequential and may only overlap on
their borders
- test uses the keys of the same prefix length
Enhance the generator part to produce a purely random sequence of ranges
with bound keys of arbitrary length. Just pay attention to generate the
"valid" individual ranges, whose start is not ahead of the end.
Also -- rename the test to reflect what it's doing and increase the
number of iterations.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201228115525.20327-1-xemul@scylladb.com>
On every compaction completion, sstable set is rebuilt from scratch.
With LCS and ~160G of data per shard, it means we'll have to create
a new sstable set with ~1000 entries whenever compaction completes,
which will likely result in reactor stalling for a significant
amount of time.
This is fixed by futurizing build_new_sstable_list(), so it will
yield whenever needed.
Fixes#7758.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
row cache now allows updater to first prepare the work, and then execute
the update atomically as the last step. let's do that when rebuilding
the set, so now new set is created in the preparation phase, and the
new set replaces the old one in the execution phase, satisfying the
atomicity requirement of row cache.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
procedure is changed to return the new set, so caller will be responsible
for replacing the old set with the new one. this will allow our future
work where building new set and enabling it will be decoupled.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
External updater may do some preparatory work like constructing a new sstable list,
and at the end atomically replace the old list by the new one.
Decoupling the preparation from execution will give us the following benefits:
- the preparation step can now yield if needed to avoid reactor stalls, as it's
been futurized.
- the execution step will now be able to provide strong exception guarantees, as
it's now decoupled from the preparation step which can be non-exception-safe.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The CQL tests in test/cql-pytest use the Python CQL driver's default
timeout for execute(), which is 10 seconds. This usually more than
enough. However, in extreme cases like noted in issue #7838, 10
seconds may not be enough. In that issue, we run a very slow debug
build on a very slow test machine, and encounter a very slow request
(a DROP KEYSPACE that needs to drop multiple tables).
So this patch increases the default timeout to an even larger
120 seconds. We don't care that this timeout is ridiculously
large - under normal operations it will never be reached, there
is no code which loops for this amount of time for example.
Tested that this patch fixes#7838 by choosing a much lower timeout
(1 second) and reproducing test failures caused by timeouts.
Fixes#7838.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201228090847.3234862-1-nyh@scylladb.com>
The MAX_LEVELS is the levels count, but sstable level (index) starts
from 0. So the maximum and valid level is MAX_LEVELS - 1.
Signed-off-by: Amos Kong <amos@scylladb.com>
Closes#7833
sstable_writer may depend on the sstable throughout its whole lifecycle.
If the sstable is freed before the sstable_writer we might hit use-after-free
as in the follwing case:
```
std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*>::operator+=(long) at /usr/include/c++/10/bits/stl_deque.h:240
(inlined by) std::operator+(std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*> const&, long) at /usr/include/c++/10/bits/stl_deque.h:378
(inlined by) std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*>::operator[](long) const at /usr/include/c++/10/bits/stl_deque.h:252
(inlined by) std::deque<sstables::compression::segmented_offsets::bucket, std::allocator<sstables::compression::segmented_offsets::bucket> >::operator[](unsigned long) at /usr/include/c++/10/bits/stl_deque.h:1327
(inlined by) sstables::compression::segmented_offsets::push_back(unsigned long, sstables::compression::segmented_offsets::state&) at ./sstables/compress.cc:214
sstables::compression::segmented_offsets::writer::push_back(unsigned long) at ./sstables/compress.hh:123
(inlined by) compressed_file_data_sink_impl<crc32_utils, (compressed_checksum_mode)1>::put(seastar::temporary_buffer<char>) at ./sstables/compress.cc:519
seastar::output_stream<char>::put(seastar::temporary_buffer<char>) at table.cc:?
(inlined by) seastar::output_stream<char>::put(seastar::temporary_buffer<char>) at ././seastar/include/seastar/core/iostream-impl.hh:432
seastar::output_stream<char>::flush() at table.cc:?
seastar::output_stream<char>::close() at table.cc:?
sstables::file_writer::close() at sstables.cc:?
sstables::mc::writer::~writer() at writer.cc:?
(inlined by) sstables::mc::writer::~writer() at ./sstables/mx/writer.cc:790
sstables::mc::writer::~writer() at writer.cc:?
flat_mutation_reader::impl::consumer_adapter<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~consumer_adapter() at compaction.cc:?
(inlined by) std::_Optional_payload_base<sstables::compaction_writer>::_M_destroy() at /usr/include/c++/10/optional:260
(inlined by) std::_Optional_payload_base<sstables::compaction_writer>::_M_reset() at /usr/include/c++/10/optional:280
(inlined by) std::_Optional_payload<sstables::compaction_writer, false, false, false>::~_Optional_payload() at /usr/include/c++/10/optional:401
(inlined by) std::_Optional_base<sstables::compaction_writer, false, false>::~_Optional_base() at /usr/include/c++/10/optional:474
(inlined by) std::optional<sstables::compaction_writer>::~optional() at /usr/include/c++/10/optional:659
(inlined by) sstables::compacting_sstable_writer::~compacting_sstable_writer() at ./sstables/compaction.cc:229
(inlined by) compact_mutation<(emit_only_live_rows)0, (compact_for_sstables)1, sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>::~compact_mutation() at ././mutation_compactor.hh:468
(inlined by) compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>::~compact_for_compaction() at ././mutation_compactor.hh:538
(inlined by) std::default_delete<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >::operator()(compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>*) const at /usr/include/c++/10/bits/unique_ptr.h:85
(inlined by) std::unique_ptr<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>, std::default_delete<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~unique_ptr() at /usr/include/c++/10/bits/unique_ptr.h:361
(inlined by) stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >::~stable_flattened_mutations_consumer() at ././mutation_reader.hh:342
(inlined by) flat_mutation_reader::impl::consumer_adapter<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~consumer_adapter() at ././flat_mutation_reader.hh:201
auto flat_mutation_reader::impl::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter>(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:272
(inlined by) auto flat_mutation_reader::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter>(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:383
(inlined by) auto flat_mutation_reader::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:389
(inlined by) seastar::future<void> sstables::compaction::setup<noop_compacted_fragments_consumer>(noop_compacted_fragments_consumer)::{lambda(flat_mutation_reader)#1}::operator()(flat_mutation_reader)::{lambda()#1}::operator()() at ./sstables/compaction.cc:612
```
What happens here is that:
compressed_file_data_sink_impl(output_stream<char> out, sstables::compression* cm, sstables::local_compression lc)
: _out(std::move(out))
, _compression_metadata(cm)
, _offsets(_compression_metadata->offsets.get_writer())
, _compression(lc)
, _full_checksum(ChecksumType::init_checksum())
_compression_metadata points to a buffer held by the sstable object.
and _compression_metadata->offsets.get_writer returns a writer that keeps
a reference to the segmented_offsets in the sstables::compression
that is used in the ~writer -> close path.
Fixes#7821
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201227145726.33319-1-bhalevy@scylladb.com>
When applying a mutation partition to another if a dummy entry
from the source falls into a destination continuous range, it
can be just dropped. However, current implementation still
inserts it and then instantly removes.
Relax this code-flow by dropping the unwanted entry without
tossing it.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201224130438.11389-1-xemul@scylladb.com>
node_exporter had been added to scylla-server package by commit
95197a09c9.
So we can enable it by default for offline installation.
Signed-off-by: Amos Kong <amos@scylladb.com>
When a rows_entry is added to row_cache it's constructed from
clustering_row by unpacking all its internals and putting
them into the rows_entry's deletable_row. There's a shorter
way -- the clustering_row already has the deletale_row onboard
from which rows_entry can copy-construct its.
This lets keeping the rows_entry and deletable_row set of
constructors a bit shorter.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201224161112.20394-1-xemul@scylladb.com>
"
This series does a lot of cleanups, dead code removal, and most
importantly fixes the following things in IDL compiler tool:
* The grammar now rejects invalid identifiers, which,
in some cases, allowed to write things like `std:vector`.
* Error reporting is improved significantly and failures are
now pointing to the place of failure much more accurately.
This is done by restricting rule backtracing on those rules
which don't need it.
"
* 'idl-compiler-minor-fixes-v4' of https://github.com/ManManson/scylla:
idl: move enum and class serializer code writers to the corresponding AST classes
idl: extract writer functions for `write`, `read` and `skip` impls for classes and enums
idl: minor fixes and code simplification
idl: change argument name from `hout` to `cout` in all dependencies of `add_visitors` fn
idl: fix parsing of basic types and discard unneeded terminals
idl: remove unused functions
idl: improve error tracing in the grammar and tighten-up some grammar rules
idl: remove redundant `set_namespace` function
idl: remove unused `declare_class` function
idl: slightly change `str` and `repr` for AST types
idl: place directly executed init code into if __name__=="__main__"
To connection-less environment, we need to add node_exporter binary
to scylla-server package, not downloading it from internet.
Related #7765Fixes#2190Closes#7796
It turns out that `cql_table_large_data_handler::record_large_rows`
and `cql_table_large_data_handler::record_large_cells` were broken
for reporting static cells and static rows from the very beginning:
In case a large static cell or a large static row is encountered,
it tries to execute `db::try_record` with `nullptr` additional values,
denoting that there is no clustering key to be recorded.
These values are next passed to `qctx.execute_cql()`, which
creates `data_value` instances for each statement parameter,
hence invoking `data_value(nullptr)`.
This uses `const char*` overload which delegates to
`std::string_view` ctor overload. It is UB to pass `nullptr`
pointer to `std::string_view` ctor. Hence leading to
segmentation faults in the aforementioned large data reporting
code.
What we want here is to make a null `data_value` instead, so
just add an overload specifically for `std::nullptr_t`, which
will create a null `data_value` with `text` type.
A regression test is provided for the issue (written in
`cql-pytest` framework).
Tests: test/cql-pytest/test_large_cells_rows.py
Fixes: #6780
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201223204552.61081-1-pa.solodovnikov@scylladb.com>
In Alternator's expression parser in alternator/expressions.g, a list can be
indexed by a '[' INTEGER ']'. I had doubts whether maybe a value-reference
for the index, e.g., "something[:xyz]", should also work. So this patch adds
a test that checks whether "something[:xyz]" works, and confirms that both
DynamoDB and Alternator don't accept it and consider it a syntax error.
So Alternator's parser is correct to insist that the index be a literal
integer.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201214100302.2807647-1-nyh@scylladb.com>
* seastar 2bd8c8d088...1f5e3d3419 (5):
> Merge "Avoid fair-queue rovers overflow if not configured" from Pavel E
> doc: add a coroutines section to the tutorial
> Merge "tests/perf: add random-seed config option" from Benny
> iotune: Print parameters affecting the measurement results
> cook: Add patch cmd for ragel build (signed char confusion on aarch64)
Expand the role of AST classes to also supply methods for actually
generating the code. More changes will follow eventually until
all generation code is handled by these classes.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
"
We've encountered a number of reactor stalls
related to token_metadata that were fixed
in 052a8d036d.
This is a follow-up series that adds a clear_gently
method to token_metadata that uses continuations
to prevent reactor stalls when destroying token_metadata
objects.
Test: unit(dev), {network_topology_strategy,storage_proxy}_test(debug)
"
* tag 'token_metadata_clear_gently-v3' of github.com:bhalevy/scylla:
token_metadata: add clear_gently
token_metadata: shared_token_metadata: add mutate_token_metadata
token_metdata: futurize update_normal_tokens
abstract_replication_strategy: get_pending_address_ranges: invoke clone_only_token_map if can_yield
repair: replace_with_repair: convert to coroutine
A previous patch added test/cql-pytest/cassandra_tests - a framework for
porting Cassandra's unit tests to Python - but only ported two tiny test
files with just 3 tests. In this patch, we finally port a much larger
test file validation/entities/collection_test.java. This file includes
50 separate tests, which cover a lot of aspects of collection support,
as well as how other stuff interact with collections.
As of now, 23 (!) of these 50 tests fail, and exposed six new issues
in Scylla which I carefully documented:
Refs #7735: CQL parser missing support for Cassandra 3.10's new "+=" syntax
Refs #7740: CQL prepared statements incomplete support for "unset" values
Refs #7743: Restrictions missing support for "IN" on tables with
collections, added in Cassandra 4.0
Refs #7745: Length of map keys and set items are incorrectly limited to 64K
in unprepared CQL
Refs #7747: Handling of multiple list updates in a single request differs
from recent Cassandra
Refs #7751: Allow selecting map values and set elements, like in
Cassandra 4.0
These issues vary in severity - some are simply new Cassandra 4.0 features
that Scylla never implemented, but one (#7740) is an old Cassandra 2.2
feature which it seems we did not implement correctly in some cases that
involve collections.
Note that there are some things that the ported tests do not include.
In a handful of places there are things which the Python driver checks,
before sending a request - not giving us an opportunity to check how
the server handles such errors. Another notable change in this port is
that the original tests repeated a lot of tests with and without a
"nodetool flush". In this port I chose to stub the flush() function -
it does NOT flush. I think the point of these tests is to check the
correctness of the CQL features - *not* to verify that memtable flush
works correctly. Doing a real memtable flush is not only slow, it also
doesn't really check much (Scylla may still serve data from cache,
not sstables). So I decided it is pointless.
An important goal of this patch is that all 50 tests (except three
skipped tests because Python has client-side checking), pass when
run on Cassandra (with test/cql-pytest/run-cassandra). This is very
important: It was very easy to make mistakes while porting the tests,
and I did make many such mistakes; But running the against Cassandra
allowed me to fix those mistakes - because the correct tests should
pass on Cassandra. And now they do.
Unfortunately, the new tests are significantly slower than what we've
been accustomed in Alternator/CQL tests. The 50 tests create more than a
hundred tables, udfs, udts, and similar slow operations - they do not
reuse anything via fixtures. The total time for these 50 tests (in dev
build mode) is around 18 seconds. Just one test - testMapWithLargePartition
is responsibe for almost half (!) of that time - we should consider in
the future whether it's worth it or can be made smaller.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201215155802.2867386-1-nyh@scylladb.com>
clear_gently gently clears the token_metadata members.
It uses continuations to allow yielding if needed
to prevent reactor stalls.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
mutate_token_metadata acquires the shared_token_metadata lock,
clones the token_metadata (using clone_async)
and calls an asynchronous functor on
the cloned copy of the token_metadata to mutate it.
If the functor is successful, the mutated clone
is set back to to the shared_token_metadata,
otherwise, the clone is destroyed.
With that, get rid of shared_token_metadata::clone
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The function complexity if O(#tokens) in the worst case
as for each endpoint token to traverses _token_to_endpoint_map
lineraly to erase the endpoint mapping if it exists.
This change renames the current implementation of
update_normal_tokens to update_normal_tokens_sync
and clones the code as a coroutine that returns a future
and may yield if needed.
Eventually we should futurize the whole token_metadata
and abstract_replication_strategy interface and get rid
of the synchronous functions. Until then the sync
version is still required from call sites that
are neither returning a future nor run in a seastar thread.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add the seastar-cpu-map.sh to the SBINFILES variable, which is used to
create symbolic links to scripts so that they appear in $PATH.
Please note that there are additional Python scripts (like perftune.py),
which are not in $PATH. That's because Python scripts are handled
separately in "install.sh" and no Python script has a "sbin" symlink. We
might want to change this in the future, though.
Fixes#6731Closes#7809
This is a temporary scaffold for weaning ourselves off
linearization. It differs from with_linearized_managed_bytes in
that it does not rely on the environment (linearization_context)
and so is easier to remove.
We use managed_bytes::data() in a few places when we know the
data is non-fragmented (such as when the small buffer optimization
is in use). We'd like to remove managed_bytes::data() as linearization
is bad, so in preparation for that, replace internal uses of data()
with the equivalent direct access.
do_linearize() is an impure function as it changes state
in linearization_context. Extract the pure parts into a new
do_linearize_pure(). This will be used to linearize managed_bytes
without a linearization_context, during the transition period where
fragmented and non-fragmented values coexist.
do_get_value() is careful to return a fragmented view, but
its only caller get_value_from_mutation() linearizes it immediately
afterwards. Linearize it sooner; this prevents mixing in
fragmented values from cells (now via IMR) and fragmented values
from partition/clustering keys. It only works now because
keys are not fragmented outside LSA, and value_view has a special
case for single-fragment values.
This helps when keys become fragmented.
Converting from bytes to bytes is nonsensical, but it helps
when transitioning to other types (managed_bytes/managed_bytes_view),
and these types will have to_bytes() conversions.
We introduce a new single-key sstable reader for sstables created by `TimeWindowCompactionStrategy`.
The reader uses the fact that sstables created by TWCS are mostly disjoint with respect to the contained `position_in_partition`s in order to avoid having multiple sstable readers opened at the same time unnecessarily. In case there are overlapping ranges (for example, in the current time-window), it performs the necessary merging (it uses `clustering_order_reader_merger`, introduced recently).
The reader uses min/max clustering key metadata present in `md` sstables in order to decide when to open or close a sstable reader.
The following experiment was performed:
1. create a TWCS table with 1 minute windows
2. fill the table with 8 equal windows of data
(each window flushed to a separate sstable)
3. perform `select * from ks.t where pk = 0 limit 1` query
with and without the change
The expectation is that with the commit, only one sstable will be opened
to fetch that one row; without the commit all 8 sstables would be opened at once.
The difference in the value of `scylla_reactor_aio_bytes_read` was measured
(value after the query minus value before the query), both with and without the commit.
With the commit, the difference was 67584.
Without the commit, the difference was 528384.
528384 / 67584 ~= 7.8.
Fixes#6418.
Closes#7437
* github.com:scylladb/scylla:
sstables: gather clustering key filtering statistics in TWCS single key reader
sstables: use time_series_sstable_set in time_window_compaction_strategy
sstable_set: new reader for TWCS single partition queries
mutation_reader_test: test clustering_order_reader_merger with time_series_sstable_set
sstable_set: introduce min_position_reader_queue
sstable_set: introduce time_series_sstable_set
sstables: add min_position and max_position accessors
sstable_set: make create_single_key_sstable_reader a virtual method
clustering_order_reader_merger: fix the 0 readers case
The following experiment was performed:
1. create a TWCS table with 1 minute windows
2. fill the table with 8 windows of data
(each window flushed to a separate sstable)
3. perform `select * from ks.t where pk = 0 limit 1` query
with and without the change
The expectation is that with the commit, only one sstable will be opened
to fetch that one row; without the commit all 8 sstables would be opened at once.
The difference in the value of `scylla_reactor_aio_bytes_read` was measured
(value after the query minus value before the query), both with and without the commit.
With the commit, the difference was 67584.
Without the commit, the difference was 528384.
528384 / 67584 ~= 7.8.
Fixes https://github.com/scylladb/scylla/issues/6418.
This commit introduces a new implementation of `create_single_key_sstable_reader`
in `time_series_sstable_set` dedicated for TWCS-created sstables.
It uses the fact that such sstables are mostly disjoint with respect to
contained `position_in_partition`s in order to decrease the number of
sstable readers that are opened at the same time.
The implementation uses `clustering_order_reader_merger` under the hood.
The reader assumes that the schema does not have static columns and none
of the queried sstable contain partition tombstones; also, it assumes
that the sstables have the min/max clustering key metadata in order for
the implementation to be efficient. Thus, if we detect that some of
these assumptions aren't true, we fall back to the old implementation.
This is a queue of readers of sstables in a time_series_sstable_set,
returning the readers in order of the smallest position_in_partition
that the sstables have. It uses the min/max clustering key sstable
metadata.
The readers are opened lazily, at the moment of being returned.
At this moment it is a slightly less efficient version of
bag_sstable_set, but in following commits we will use the new data
structures to gain advantage in single partition queries
for sstables created by TimeWindowCompactionStrategy.
... of sstable_set_impl.
Soon we shall provide a specialized implementation in one of the
`sstable_set_impl` derived classes.
The existing implementation is used as the default one.
There were two problems with handling conflicting equalities on the same PK column (eg, c=1 AND c=0):
1. When the column is indexed, Scylla crashed (#7772)
2. Computing ranges and slices was throwing an exception
This series fixes them both; it also happens to resolve some old TODOs from restriction_test.
Tests: unit (dev, debug)
Closes#7804
* github.com:scylladb/scylla:
cql3: Fix value_for when restriction is impossible
cql3: Fix range computation for p=1 AND p=1
Previously, single_column_restrictions::value_for() assumed that a
column's restriction specifies exactly one value for the column. But
since 37ebe521e3, multiple equalities on the same column are allowed,
so the restriction could be a conjunction of conflicting
equalities (eg, c=1 AND c=0). That violates an assert and crashes
Scylla.
This patch fixes value_for() by gracefully handling the
impossible-restriction case.
Fixes#7772
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Previously compute_bounds was assuming that primary-key columns are
restricted by exactly one equality, resulting in the following error:
query 'select p from t where p=1 and p=1' failed:
std::bad_variant_access (std::get: wrong index for variant)
This patch removes that assumption and deals correctly with the
multiple-equalities case. As a byproduct, it also stops raising
"invalid null value" exceptions for null RHS values.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Split `write`, `read` and `skip` serializer function writers to
separate functions in `handle_class` and `handle_enum` functions,
which slightly improves readability.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
* Introduce `ns_qualified_name` and `template_params_str` functions
to simplify code a little bit in `handle_enum` and `handle_class`
functions.
* Previously each serializer had a separate namespace open-close
statements, unify them into a single namespace scope.
* Fix a few more `hout` -> `cout` argument names.
* Rename `template` pattern to `template_decl` to improve clarity.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Prior to the patch all functions that are called from `add_visitors`
and this function itself declared the argument denoting the output
file as `hout`. Though, this was quite misleading since `hout`
is meant to be header file with declarations, while `cout` is an
implementation file.
These functions write to implmentation file hence `hout` should
be changed to `cout` to avoid confusion.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Prior to the patch `btype` production was using `with_colon`
rule, which accidentally supported parsing both numbers and
identifiers (along with other invalid inputs, such as "123asd").
It was changed to use `ns_qualified_ident` and those places
which can accept numeric constants, are explicitly listing
it as an alternative, e.g. template parameter list.
Unfortunately, I had to make TemplateType to explicitly construct
`BasicType` instances from numeric constants in template arguments
list. This is exactly the way it was handled before, though.
But nonetheless, this should be addressed sometime later.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Remove the following functions since they are not used:
* `open_namespaces`
* `close_namespaces`
* `flat_template`
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This patch replaces use of some handwritten rules to use their
alternatives already defined in `pyparsing.pyparsing_common`
class, i.e.: `number`, `identifier` productions.
Changed ignore patterns for comments to use pre-defined
`pp.cppStyleComment` instead of hand-written combination of
'//'-style and C-style comment rules.
Operator '-' is now used whenever possible to improve debugging
experience: it disables default backtracking for productions
so that compiler fails earlier and can now point more precisely
to a place in the input string where it failed instead of
backtracking to the top-level rule and reporting error there.
Template names and class names now use `ns_qualified_ident`
rule instead of `with_colon` which prevents grammar from
matching invalid identifiers, such as `std:vector`.
Many places are using the updated `identifier` production, which
is working correctly unlike its predecessor: now inputs
such as `1ident` are considered invalid.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Surround string representation with angle brackets. This improves
readability when printing debug output.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Since idl compiler is not intended to be used as a module to other
python build scripts, move initialization code under an if checking
that current module name is "__main__".
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
When recycling a segment in O_DSYNC mode if the size of the segment
is neither shrunk nor grown, avoid calling file::truncate() or
file::allocate().
Message-Id: <20201215182332.1017339-2-kostja@scylladb.com>
"
This patch series consists of the following patches:
1. The first one turned out to be a massive rewrite of almost
everything in `idl-compiler.py`. It aims to decouple parser
structures from the internal representation which is used
in the code-generation itself.
Prior to the patch everything was working with raw token lists and
the code was extremely fragile and hard to understand and modify.
Moreover, every change in the parser code caused a cascade effect
of breaking things at many different places, since they were relying
on the exact format of output produced by parsing rules.
Now there is a bunch of supplementary AST structures which provide
hierarchical and strongly typed structure as the output of parsing
routine.
It is much easier to verify (by the means of `isinstance`, for example)
and extend since the internal structures used in code-generation are
decoupled from the structure of parsing rules, which are now controlled
by custom parse actions providing high-level abstractions.
It is tested manually by checking that the old code produces exactly
the same autogenerated sources for all Scylla IDLs as the new one.
2 and 3. Cosmetics changes only: fixed a few typos and moved from
old-fashioned `string.Template` to python f-strings.
This improves readability of the idl-compiler code by a lot.
Only one non-functional whitespace change introduced.
4. This patch adds a very basic support for the parser to
understand `const` specifier in case it's used with a template
parameter for a data member in a class, e.g.
struct my_struct {
std::vector<const raft::log_entry> entries;
};
It actually does two things:
* Adjusts `static_asserts` in corresponding serializer methods
to match const-ness of fields.
* Defines a second serializer specialization for const type in
`.dist.hh` right next to non-const one.
This seems to be sufficient for raft-related uses for now.
Please note there is no support for the following cases, though:
const std::vector<raft::log_entry> entries;
const raft::term_t term;
None of the existing IDLs are affected by the change, so that
we can gradually improve on the feature and write the idl
unit-tests to increase test coverage with time.
5. A basic unit-test that writes a test struct with an
`std::vector<S<const T>>` field and reads it back to verify
that serialization works correctly.
6. Basic documentation for AST classes.
TODO: should also update the docs in `docs/IDL.md`. But it is already
quite outdated, and some changes would even be out of scope for this
patch set.
"
* 'idl-compiler-refactor-v5' of https://github.com/ManManson/scylla:
idl: add docstrings for AST classes
idl: add unit-test for `const` specifiers feature
idl: allow to parse `const` specifiers for template arguments
idl: fix a few typos in idl-compiler
idl: switch from `string.Template` to python f-strings and format string in idl-compiler
idl: Decouple idl-compiler data structures from grammar structure
feed_writer() eats exception and transforms it into an end of stream
instead. Downstream validators hate when this happens.
Fixes#7482
Message-Id: <20201216090038.GB3244976@scylladb.com>
A tool which lists all partitions contained in an sstable index. As all
partitions in an sstable are indexed, this tool can be used to find out
what partitions are contained in a given sstable.
The printout has the following format:
$pos: $human_readable_value (pk{$raw_hex_value})
Where:
* $pos: the position of the partition in the (decompressed) data file
* $human_readable_value: the human readable partition key
* $raw_hex_value: the raw hexadecimal value of the binary representation
of the partition key
For now the tool requires the types making up the partition key to be
specified on the command line, using the `--type|-t` command line
argument, using the Cassandra type class name notation for types.
As these are not assumed to be widely known, this patch includes a
document mapping all cql3 types to their Cassandra type class name
equivalent (but not just).
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201208092323.101349-1-bdenes@scylladb.com>
Fixes#7732
When truncating with auto_snapshot on, we try to verify the low rp mark
from the CF against the sstables discarded by the truncation timestamp.
However, in a scenario like:
Fill memtables
Flush
Truncate with snapshot A
Fill memtables some more
Truncate
Move snapshot A to upload + refresh (load old tables)
Truncate
The last op will assert, because while we have sstables loaded, which
will be discarded now, we did not in fact generate any _new_ ones
(since memtables are empty), and the RP we get back from discard is
one from an earlier generation set.
(Any permutation of events that create the situation "empty memtable" +
"non-empty sstables with only old tables" will generate the same error).
Added a check that before flushing checks if we actually have any
data, and if not, does not uphold the RP relation assert.
Closes#7799
This series makes sure that before the table is dropped, all pending memtable flushes related to its memtables would finish.
Normally, flushes are not problematic in Scylla, because all tables are by default `auto_snapshot=true`, which also implies that a table is flushed before being dropped. However, with `auto_snapshot=false` the flush is not attempted at all. It leads to the following race:
1. Run a node with `auto_snapshot=false`
2. Schedule a memtable flush (e.g. via nodetool)
3. Get preempted in the middle of the flush
4. Drop the table
5. The flush that already started wakes up and starts operating on freed memory, which causes a segfault
Tests: manual(artificially preempting for a long time in bullet point 2. to ensure that the race occurs; segfaults were 100% reproducible before the series and do not happen anymore after the series is applied)
Fixes#7792Closes#7798
* github.com:scylladb/scylla:
database: add flushes to waiting for pending operations
table: unify waiting for pending operations
database: add a phaser for flush operations
database: add waiting for pending streams on table drop
This patch introduces very limited support for declaring `const`
template parameters in data members.
It's not covering all the cases, e.g.
`const type member_variable` and `const template_def<T1, T2, ...>`
syntax is not supported at the moment.
Though the changes are enough for raft-related use: this makes it
possible to declare `std::vector<raft::log_entries_ptr>` (aka
`std::vector<lw_shared_ptr<const raft::log_entry>>`) in the IDL.
Existing IDL files are not affected in any way.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Move to a modern and lightweight syntax of f-strings
introduced in python 3.6. It improves readability and provides
greater flexibility.
A few places are now using format strings instead, though.
In case when multiline substitution variable is used, the template
string should be first re-indented and only after that the
formatting should be applied, or we can end up with screwed
indentation the in generated sources.
This change introduces one invisible whitespace change
in `query.dist.impl.hh`, otherwise all generated code is exactly
the same.
Tests: build(dev) and diff genetated IDL sources by hand
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Instead of operating on the raw lists of tokens, transform them into
typed structures representation, which makes the code by many orders of
magnitude simpler to read, understand and extend.
This includes sweeping changes throughout the whole source code of the
tool, because almost every function was tightly coupled to the way
data was passed down from the parser right to the code generation
routines.
Tested manually by checking that old generated sources are precisely
the same as the new generated sources.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Pending flushes can participate in races when a table
with auto_snapshot==false is dropped. The race is as follows:
1. A flush of table T is initiated
2. The flush operation is preempted
3. Table T is dropped without flushing, because it has auto_snapshot off
4. The flush operation from (2.) wakes up and continues
working on table T, which is already dropped
5. Segfault/memory corruption
To prevent such races, a phaser for pending flushes is introduced
We already wait for pending reads and writes, so for completeness
we should also wait for all pending stream operations to finish
before dropping the table to avoid inconsistencies.
Download node_exporter in frozen image to prepare adding node_exporter
to relocatable pacakge.
Related #2190Closes#7765
[avi: updated toolchain, x86_64/aarch64/s390x]
Alternator tracing tests require the cluster to have the 'always'
isolation level configured to work properly. If that's not the case,
the tests will fail due to not having CAS-related traces present
in the logs. In order to help the users fix their configuration,
a helper message is printed before the test case is performed.
Automatic tests do not need this, because they are all ran with
matching isolation level, but this message could greatly improve
the user experience for manual tests.
Message-Id: <62bcbf60e674f57a55c9573852b6a28f99cbf408.1607949754.git.sarna@scylladb.com>
The outcome of alternator tracing tests was that tracing probability
was always set to 0 after the test was finished. That makes sense
for most test runs, but manual tests can work on existing clusters
with tracing probability set to some other value. Due to preserve
previous trace probability, the value is now extracted and stored,
so that it can be restored after the test is done.
Message-Id: <94f829b63f92847b4abb3b16f228bf9870f90c2e.1607949754.git.sarna@scylladb.com>
Normally a file size should be aligned around block size, since
we never write to it any unaligned size. However, we're not
protected against partial writes.
Just to be safe, align up the amount of bytes to zerofill
when recycling a segment.
Message-Id: <20201211142628.608269-4-kostja@scylladb.com>
Three tests in test_streams.py run update_table() on a table without
waiting for it to complete, and then call update_table() on the same
table or delete it. This always works in Scylla, and usually works in
AWS, but if we reach the second call, it may fail because the previous
update_table() did not take effect yet. We sometimes see these failures
when running the Alternator test suite against AWS.
So in this patch, after an each update_table() we wait for the table
to return from UPDATING to ACTIVE status.
The entire Alternator test suite now passes (or skipped) on AWS,
so: Fixes#7778.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201213164931.2767236-1-nyh@scylladb.com>
The test test_query_filter.py::test_query_filter_paging fails on AWS
and shouldn't fail, so this patch fixes the test. Note that this is
only a test problem - no fix is needed for Alternator itself.
The test reads 20 results with 1-result pages, and assumed that
21 pages are returned. The 21st page may happen because when the
server returns the 20th, it might not yet know there will be no
additional results, so another page is needed - and will be empty.
Still a different implementation might notice that the last page
completed the iteration, and not return an extra empty page. This is
perfectly fine, and this is what AWS DynamoDB does today - and should
not be considered an error.
Refs #7778
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201213143612.2761943-1-nyh@scylladb.com>
When request signature checking is enabled in Alternator, each request
should come with the appropriate Authorization header. Most errors in
this preparing this header will result in an InvalidSignatureException
response; But DynamoDB returns a more specific error when this header is
completely missing: MissingAuthenticationTokenException. We should do the
same, but before this patch we return InvalidSignatureException also for
a missing header.
The test test_authorization.py::test_no_authorization_header used to
enshrine our wrong error message, and failed when run against AWS.
After this patch, we fix the error message and the test - which now
passes against both Alternator and AWS.
Refs #7778.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201213133825.2759357-1-nyh@scylladb.com>
This series allows setting per-query timeout via CQL. It's possible via the existing `USING` clause, which is extended to be available for `SELECT` statement as well. This parameter accepts a duration and can also be provided as a marker.
The parameter acts as a regular part of the `USING` clause, which means that it can be used along with `USING TIMESTAMP` and `USING TTL` without issues.
The series comes with a pytest test suite.
Examples:
```cql
SELECT * FROM t USING TIMEOUT 200ms;
```
```cql
INSERT INTO t(a,b,c) VALUES (1,2,3) USING TIMESTAMP 42 AND TIMEOUT 50ms;
```
Working with prepared statements works as usual - the timeout parameter can be
explicitly defined or provided as a marker:
```cql
SELECT * FROM t USING TIMEOUT ?;
```
```cql
INSERT INTO t(a,b,c) VALUES (?,?,?) USING TIMESTAMP 42 AND TIMEOUT 50ms;
```
Tests: unit(dev)
Fixes#7777Closes#7781
* github.com:scylladb/scylla:
test: add prepared statement tests to USING TIMEOUT suite
docs: add an entry about USING TIMEOUT
test: add a test suite for USING TIMEOUT
storage_proxy: start propagating local timeouts as timeouts
cql3: allow USING clause for SELECT statement
cql3: add TIMEOUT attribute to the parser
cql3: add per-query timeout to select statement
cql3: add per-query timeout to batch statement
cql3: add per-query timeout to modification statement
cql3: add timeout to cql attributes
First of all, select statement is extended with an 'attrs' field,
which keeps the per-query attributes. Currently, only TIMEOUT
parameter is legal to use, since TIMESTAMP and TTL bear no meaning
for reads.
Secondly, if TIMEOUT attribute is set, it will be used as the effective
timeout for a particular query.
1. It's unused since cbe510d1b8
2. It's unsafe to keep a reference to token_metadata&
potentially across yield points.
The higher-level motivation is to make
storage_service::get_token_metadata() private so we
can control better how it's used.
For cdc, if the token_metadata is going to be needed
to the future, it'd be better get it from
db_context::_proxy.get_token_metadata_ptr().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201213162351.52224-2-bhalevy@scylladb.com>
"
This series fixes use-after-free via token_metadata&
We may currently get a token_metadata& via get_token_metadata() and
use it across yield points in a couple of sites:
- do_decommission_removenode_with_repair
- get_new_source_ranges
To fix that, get_token_metadata_ptr and hold on to it
across yielding.
Fixes#7790
Dtest: update_cluster_layout_tests:TestUpdateClusterLayout.simple_removenode_2_test(debug)
Test: unit(dev)
"
* tag 'storage_service-token_metadata_ptr-v2' of github.com:bhalevy/scylla:
storage_service: get_new_source_ranges: don't hold token_metadata& across yield point
storage_service: get_changed_ranges_for_leaving: no need to maybe_yield for each token_range
storage_service: get_changed_ranges_for_leaving: release token_metadata_ptr sooner
storage_service: get_changed_ranges_for_leaving: don't hold token_metadata& across yield
Provide the token_metadata& to get_new_source_ranges by the caller,
who keeps it valid throughout the call.
Note that there is no need to clone_only_token_map
since the token_metadata_ptr is immutable and can be
used just as well for calling strat.get_range_addresses.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When yielding in clone_only_token_map or clone_after_all_left
the token_metadata got with get_token_metadata() may go away.
Use get_token_metadata_ptr() instead to hold on to it.
And with that, we don't need to clone_only_token_map.
`metadata` is not modified by calculate_natural_endpoints, so we
can just refer to the immutable copy retrieved with
get_token_metadata_ptr.
Fixes#7790
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
The validate_column_family() helper uses the global proxy
reference to get database from. Fortunatelly, all the callers
of it can provide one via argument.
tests: unit(dev)
"
* 'br-no-proxy-in-validate' of https://github.com/xemul/scylla:
validation: Remove get_local_storage_proxy call
client_state: Call validate_column_family() with database arg
client_state: Add database& arg to has_column_family_access
storage_proxy: Add .local_db() getters
validate: Mark database argument const
"
The initial intent was to remove call for global storage service from
secondary index manager's create_view_for_index(), but while fixing it
one of intermediate schema table's helper managed to benefit from it
by re-using the database reference flying by.
The cleanup is done by simply pushing the database reference along the
stack from the code that already has it down the create_view_for_index().
tests: unit(dev)
"
* 'br-no-storages-in-index-and-schema' of https://github.com/xemul/scylla:
schema-tables: Use db from make_update_table_mutations in make_update_indices_mutations
schema-tables: Add database argument to make_update_table_mutations
schema-tables: Factor out calls getting database instance
index-manager: Move feature evaluation one level up
`ops` might be passed as a disengaged shared_ptr when called
from `decommission_with_repair`.
In this case we need to propagate to sync_data_using_repair a
disengaged std::optional<utils::UUID>.
Fixes#7788
DTest: update_cluster_layout_tests:TestUpdateClusterLayout.verify_latest_copy_decommission_node_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201213073743.331253-1-bhalevy@scylladb.com>
* seastar 8b400c7b45...2de43eb6bf (3):
> core: show span free sizes correctly in diagnostics
> Merge "IO queues to share capacities" from Pavel E
> file: make_file_impl: determine blockdev using st_mode
The switch to clang disabled the clang-specific -Wunused-value
since it generated some harmless warnings. Unfortunately, that also
prevent [[nodiscard]] violations from warning.
Fix by clearing all instances of the warning (including [[nodiscard]]
violations that crept in while it was disabled) and reinstating the warning.
Closes#7767
* github.com:scylladb/scylla:
build: reinstate -Wunused-value warning for [[nodiscard]]
test: lib: don't ignore future in compare_readers()
test: mutation_test: check both ranges when comparing summaries
serialializer: silence unused value warning in variant deserializer
tuned 2.11.0-9 and later writes to kerned.sched_wakeup_granularity_ns
and other sysctl tunables that we so laboriously tuned, dropping
performance by a factor of 5 (due to increased latency). Fix by
obsoleting tuned during install (in effect, we are a better tuned,
at least for us).
Not needed for .deb, since debian/ubunto do not install tuned by
default.
Fixes#7696Closes#7776
Two halves of the tunnel finally connect -- the
latter helper needs the local database instance and
is only called by the former one which already has it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are 3 callers of this helper (cdc, migration manager and tests)
and all of them already have the database object at hands.
The argument will be used by next patch to remove call for global
storage proxy instance from make_update_indices_mutations.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The make_update_indices_mutations gets database instance
for two things -- to find the cf to work with and to get
the value of a feature for index view creation.
To suit both and to remove calls for global storage proxy
and service instances get the database once in the
function entrance. Next patch will clean this further.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The create_view_for_index needs to know the state of the
correct-idx-token-in-secondary-index feature. To get one
it takes quite a long route through global storage service
instance.
Since there's only one caller of the method in question,
and the method is called in a loop, it's a bit faster to
get the feature value in caller and pass it in argument.
This will also help to get rid of the call for global
storage service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The get_next_partition uses global proxy instance to get
the local database reference. Now it's available in the
reader object itself, so it's possible to remove this
call for global storage proxy.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This reader uses local databse instance in its get_next_partition
method to find keyspaces to work with
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It is used in validate_column_family. The last caller of it was removed by
previous patch, so we may kill the helper itself
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The previous patch brought the databse reference arg. And since
the currently called validate_column_family() overload _just_
gets the database from global proxy, it's better to shortcut.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It is called from cql3/statements' check_access methods and from thrift
handlers. The former have proxy argument from which they can get the
database. The latter already have the database itself on board.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A sequel to #7692.
This series gets rid of linearization when validating collections and tuple types. (Other types were already validated without linearizing).
The necessary helpers for reading from fragmented buffers were introduced in #7692. All this series does is put them to use in `validate()`.
Refs: #6138Closes#7770
* github.com:scylladb/scylla:
types: add single-fragment optimization in validate()
utils: fragment_range: add with_simplified()
cql3: statements: select_statement: remove unnecessary use of with_linearized
cql3: maps: remove unnecessary use of with_linearized
cql3: lists: remove unnecessary use of with_linearized
cql3: tuples: remove unnecessary use of with_linearized
cql3: sets: remove unnecessary use of with_linearized
cql3: tuples: remove unnecessary use of with_linearized
cql3: attributes: remove unnecessary uses of with_linearized
types: validate lists without linearizing
types: validate tuples without linearizing
types: validate sets without linearizing
types: validate maps without linearizing
types: template abstract_type::validate on FragmentedView
types: validate_visitor: transition from FragmentRange to FragmentedView
utils: fragmented_temporary_buffer: add empty() to FragmentedView
utils: fragmented_temporary_buffer: don't add to null pointer
Manipulating fragmented views is costlier that manipulating contiguous views,
so let's detect the common situation when the fragmented view is actually
contiguous underneath, and make use of that.
Note: this optimization is only useful for big types. For trivial types,
validation usually only checks the size of the view.
Reading from contiguous memory (bytes_view) is significantly simpler
runtime-wise than reading from a fragmented view, due to less state and less
branching, so we often want to convert a fragmented view to a simple view before
processing it, if the fragmented view contains at most one fragment, which is
common. with_simplified() does just that.
This is primarily a stylistic change. It makes the interface more consistent
with deserialize(). It will also allow us to call `validate()` for collection
elements in `validate_aux()`.
This will allow us to easily get rid of linearizations when validating
collections and tuples, because the helpers used in validate_aux() already
have FragmentedView overloads.
When fragmented_temporary_buffer::view is created from a bytes_view,
_current is null. In that case, in remove_current(), null pointer offset
happens, and ubsan complains. Fix that.
The heuristic of STCS reshape is correct, and it built the compaction
descriptor correctly, but forgot to return it to the caller, so no
reshape was ever done on behalf of STCS even when the strategy
needed it.
Fixes#7774.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201209175044.1609102-1-raphaelsc@scylladb.com>
Currently removenode works like below:
- The coordinator node advertises the node to be removed in
REMOVING_TOKEN status in gossip
- Existing nodes learn the node in REMOVING_TOKEN status
- Existing nodes sync data for the range it owns
- Existing nodes send notification to the coordinator
- The coordinator node waits for notification and announce the node in
REMOVED_TOKEN
Current problems:
- Existing nodes do not tell the coordinator if the data sync is ok or failed.
- The coordinator can not abort the removenode operation in case of error
- Failed removenode operation will make the node to be removed in
REMOVING_TOKEN forever.
- The removenode runs in best effort mode which may cause data
consistency issues.
It means if a node that owns the range after the removenode
operation is down during the operation, the removenode node operation
will continue to succeed without requiring that node to perform data
syncing. This can cause data consistency issues.
For example, Five nodes in the cluster, RF = 3, for a range, n1, n2,
n3 is the old replicas, n2 is being removed, after the removenode
operation, the new replicas are n1, n5, n3. If n3 is down during the
removenode operation, only n1 will be used to sync data with the new
owner n5. This will break QUORUM read consistency if n1 happens to
miss some writes.
Improvements in this patch:
- This patch makes the removenode safe by default.
We require all nodes in the cluster to participate in the removenode operation and
sync data if needed. We fail the removenode operation if any of them is down or
fails.
If the user want the removenode operation to succeed even if some of the nodes
are not available, the user has to explicitly pass a list of nodes that can be
skipped for the operation.
$ nodetool removenode --ignore-dead-nodes <list_of_dead_nodes_to_ignore> <host_id>
Example restful api:
$ curl -X POST "http://127.0.0.1:10000/storage_service/remove_node/?host_id=7bd303e9-4c7b-4915-84f6-343d0dbd9a49&ignore_nodes=127.0.0.3,127.0.0.5"
- The coordinator can abort data sync on existing nodes
For example, if one of the nodes fails to sync data. It makes no sense for
other nodes to continue to sync data because the whole operation will
fail anyway.
- The coordinator can decide which nodes to ignore and pass the decision
to other nodes
Previously, there is no way for the coordinator to tell existing nodes
to run in strict mode or best effort mode. Users will have to modify
config file or run a restful api cmd on all the nodes to select strict
or best effort mode. With this patch, the cluster wide configuration is
eliminated.
Fixes#7359Closes#7626
Verify that the input types are iterators and their value types are compatible
with the compare function.
Because some of the inputs were not actually valid iterators, they are adjusted
too.
Closes#7631
* github.com:scylladb/scylla:
types: add constraint on lexicographical_tri_compare()
composite: make composite::iterator a real input_iterator
compound: make compount_type::iterator a real input_iterator
UpdateItem's "ADD" operation usually adds elements to an existing set
or adds a number to an existing counter. But it can *also* be used
to create a new set or counter (as if adding to an empty set or zero).
We unfortunately did not have a test for this case (creating a new set
or counter), and when I wrote such a test now, I discovered the
implementation was missing. So this patch adds both the test and the
implementation. The new test used to fail before this patch, and passes
with it - and passes on DynamoDB.
Note that we only had this bug for the newer UpdateItem syntax.
For the old AttributeUpdates syntax, we already support ADD actions
on missing attributes, and already tested it in test_update_item_add().
I just forgot to test the same thing for the newer syntax, so I missed
this bug :-(
Fixes#7763.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207085135.2551845-1-nyh@scylladb.com>
Snitch name needs to be exchanged within cluster once, on shadow
round, so joining nodes cannot use wrong snitch. The snitch names
are compared on bootstrap and on normal node start.
If the cluster already used mixed snitches, the upgrade to this
version will fail. In this case customer needs to add a node with
correct snitch for every node with the wrong snitch, then put
down the nodes with the wrong snitch and only then do the upgrade.
Fixes#6832Closes#7739
Whereas in CQL the client can pass a timeout parameter to the server, in
the DynamoDB API there is no such feature; The server needs to choose
reasonable timeouts for its own internal operations - e.g., writes to disk,
querying other replicas, etc.
Until now, Alternator had a fixed timeout of 10 seconds for its
requests. This choice was reasonable - it is much higher than we expect
during normal operations, and still lower than the client-side timeouts
that some DynamoDB libraries have (boto3 has a one-minute timeout).
However, there's nothing holy about this number of 10 seconds, some
installations might want to change this default.
So this patch adds a configuration option, "--alternator-timeout-in-ms",
to choose this timeout. As before, it defaults to 10 seconds (10,000ms).
In particular, some test runs are unusually slow - consider for example
testing a debug build (which is already very slow) in an extremely
over-comitted test host. In some cases (see issue #7706) we noticed
the 10 second timeout was not enough. So in this patch we increase the
default timeout chosen in the "test/alternator/run" script to 30 seconds.
Please note that as the code is structured today, this timeout only
applies to some operations, such as GetItem, UpdateItem or Scan, but
does not apply to CreateTable, for example. This is a pre-existing
issue that this patch does not change.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207122758.2570332-1-nyh@scylladb.com>
This reverts commit dc77d128e9. It was reverted
due to a strange and unexplained diff, which is now explained. The
HEAD on the working directory being pulled from was set back, so git
thought it was merging the intended commits, plus all the work that was
committed from HEAD to master. So it is safe to restore it.
"
The multishard_mutation_query test is toooo slow when built
with clang in dev mode. By reducing the number of scans it's
possible to shrink the full suite run time from half an hour
down to ~3 minutes.
tests: unit(dev)
"
* 'br-devel-mode-tests' of https://github.com/xemul/scylla:
test: Make multishard_mutation_query test do less scans
configure: Add -DDEVEL to dev build flags
When built by clang this dev-mode test takes ~30 minutes to
complete. Let's reduce this time by reducing the scale of
the test if DEVEL is set.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
scylla_setup command suggestion does not shows an argument of --io-setup,
because we mistakely stores bool value on it (recognized as 'store_true').
We always need to print '--io-setup X' on the suggestion instead.
Also, --nic is currently ignored by command suggestion, need to print it just like other options.
Related #7395Closes#7724
* github.com:scylladb/scylla:
scylla_setup: print --swap-directory and --swap-size on command suggestion
scylla_setup: print --nic on command suggestion
scylla_setup: fix wrong command suggestion on --io-setup scylla_setup command suggestion does not shows an argument of --io-setup, because we mistakely stores bool value on it (recognized as 'store_true'). We always need to print '--io-setup X' on the suggestion instead.
A sequel to #7692.
This series gets rid of linearization in `serialize_for_cql`, which serializes collections and user types from `collection_mutation_view` to CQL. We switch from `bytes` to `bytes_ostream` as the intermediate buffer type.
The only user of of `serialize_for_cql` immediately copies the result to another `bytes_ostream`. We could avoid some copies and allocations by writing to the final `bytes_ostream` directly, but it's currently hidden behind a template.
Before this series, `serialize_for_cql_aux()` delegated the actual writing to `collection_type_impl::pack` and `tuple_type_impl::build_value`, by passing them an intermediate `vector`. After this patch, the writing is done directly in `serialize_for_cql_aux()`. Pros: we avoid the overhead of creating an intermediate vector, without bloating the source code (because creating that intermediate vector requires just as much code as serializing the values right away). Cons: we duplicate the CQL collection format knowledge contained in `collection_type_impl::pack` and `tuple_type_impl::build_value`.
Refs: #6138Closes#7771
* github.com:scylladb/scylla:
types: switch serialize_for_cql from bytes to bytes_ostream
types: switch serialize_for_cql_aux from bytes to bytes_ostream
types: serialize user types to bytes_ostream
types: serialize lists to bytes_ostream
types: serialize sets to bytes_ostream
types: serialize maps to bytes_ostream
utils: fragment_range: use range-based for loop instead of boost::for_each
types: add write_collection_value() overload for bytes_ostream and value_view
Increase accepted disk-to-RAM ratio to 105 to accomodate even 7.5GB of
RAM for one NVMe log various reasons for not recommending the instance
type.
Fixes#7587Closes#7600
This change enhances the toppartitions api to also return
the cardinality of the read and write sample sets. It now uses
the size() method of space_saving_top_k class, counting the unique
operations in the sampled set for up to the given capacity.
Fixes#4089Closes#7766
When an Alternator table has partition keys or sort keys of type "bytes"
(blobs), a Scan or Query which required paging used to fail - we used
an incorrect function to output LastEvaluatedKey (which tells the user
where to continue at the next page), and this incorrect function was
correct for strings and numbers - but NOT for bytes (for bytes, we
need to encode them as base-64).
This patch also includes two tests - for bytes partition key and
for bytes sort key - that failed before this patch and now pass.
The test test_fetch_from_system_tables also used to fail after a
Limit was added to it, because one of the tables it scans had a bytes
key. That test is also fixed by this patch.
Fixes#7768
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207175957.2585456-1-nyh@scylladb.com>
Up until now, Scylla's debian packages dependencies versions were
unspecified. This was due to a technical difficulty to determine
the version of the dependent upon packages (such as scylla-python3
or scylla-jmx). Now, when those packages are also built as part of
this repo and are built with a version identical to the server package
itself we can depend all of our packages with explicit versions.
The motivation for this change is that if a user tries to install
a specific Scylla version by installing a specific meta package,
it will silently drag in the latest components instead of the ones
of the requested versions.
The expected change in behavior is that after this change an attempt
to install a metapackage with version which is not the latest will fail
with an explicit error hinting the user what other packages of the same
version should be explicitly included in the command line.
Fixes#5514Closes#7727
The switch to clang disabled the clang-specific -Wunused-value
since it generated some harmless warnings. Unfortunately, that also
prevent [[nodiscard]] violations from warning.
Fix by reinstating the warning, now that all instances of the warning
have been fixed.
A copy/paste error means we ignore the termination of one of the
ranges. Change the comma expression to a disjunction to avoid
the unused value warning from clang.
The code is not perfect, since if the two ranges are not the same
size we'll invoke undefined behavior, but it is no worse than before
(where we ignored the comparison completely).
The variant deserializer uses a fold expression to implement
an if-tree with a short-circuit, producing an intermediate boolean
value to terminate evaluation. This intermediate value is unneeded,
but evokes a warning from clang when -Wunused-value is enabled.
Since we want to enable the warning, add a cast to void to ignore
the intermediate value.
We want to pass bytes_ostream to this loop in later commits.
bytes_ostream does not conform to some boost concepts required by
boost::for_each, so let's just use C++'s native loop.
When getting local ranges, an assumption is made that
if a range does not contain an end or when its end is a maximum token,
then it must contain a start. This assumption proven not true
during manual tests, so it's now fortified with an additional check.
Here's a gdb output for a set of local ranges which causes an assertion
failure when calling `get_local_ranges` on it:
(gdb) p ranges
$1 = std::vector of length 2, capacity 2 = {{_interval = {_start = std::optional<interval_bound<dht::token>> = {[contained value] = {_value = {_kind = dht::token_kind::before_all_keys,
_data = 0}, _inclusive = false}}, _end = std::optional<interval_bound<dht::token>> [no contained value], _singular = false}}, {_interval = {
_start = std::optional<interval_bound<dht::token>> [no contained value], _end = std::optional<interval_bound<dht::token>> = {[contained value] = {_value = {
_kind = dht::token_kind::before_all_keys, _data = 0}, _inclusive = true}}, _singular = false}}}
Closes#7764
The test test_fetch_from_system_tables tests Alternator's system-table
feature by reading from all system tables. The intention was to confirm
we don't crash reading any of them - as they have different schemas and
can run into different problems (we had such problems in the initial
implementation). The intention was not to read *a lot* from each table -
we only make a single "Scan" call on each, to read one page of data.
However, the Scan call did not set a Limit, so the single page can get
pretty big.
This is not normally a problem, but in extremely slow runs - such as when
running the debug build on an extremely overcommitted test machine (e.g.,
issue #7706) reading this large page may take longer than our default
timeout. I'll send a separate patch for the timeout issue, but for now,
there is really no reason why we need to read a big page. It is good
enough to just read 50 rows (with Limit=50). This will still read all
the different types and make the test faster.
As an example, in the debug run on my laptop, this test spent 2.4
seconds to read the "compaction_history" table before this patch,
and only 0.1 seconds after this patch. 2.4 seconds is close to our
default timeout (10 seconds), 0.1 is very far.
Fixes#7706
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207075112.2548178-1-nyh@scylladb.com>
The original goal of this patch was to replace the two single-node dtests
allow_filtering_test and allow_filtering_secondary_indexes_test, which
recently caused us problems when we wanted to change the ALLOW FILTERING
behavior but the tests were outside the tree. I'm hoping that after this
patch, those two tests could be removed from dtest.
But this patch actually tests more cases then those original dtest, and
moreover tests not just whether ALLOW FILTERING is required or not, but
also that the results of the filtering is correct.
Currently, four of the included tests are expected to fail ("xfail") on
Scylla, reproducing two issues:
1. Refs #5545:
"WHERE x IN ..." on indexed column x wrongly requires ALLOW FILTERING
2. Refs #7608:
"WHERE c=1" on clustering key c should require ALLOW FILTERING, but
doesn't.
All tests, except the one for issue #5545, pass on Cassandra. That one
fails on Cassandra because doesn't support IN on an indexed column at all
(regardless of whether ALLOW FILTERING is used or not).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201115124631.1224888-1-nyh@scylladb.com>
There is a typo in schema.cql of snapshot, lack of comma after
compaction strategy. It will fail to restore schema by the file.
AND compaction = {'class': 'SizeTieredCompactionStrategy''max_compaction_threshold': '32'}
map_as_cql_param() function has a `first` parameter to smartly add
comma, the compaction_strategy_options is always not the first.
Fixes#7741
Signed-off-by: Amos Kong <amos@scylladb.com>
Closes#7734
"
The storage service is called there to get the cached value
of db::system_keyspace::get_local_host_id(). Keeping the value
on database decouples it from storage service and kills one
more global storage service reference.
tests: unit(dev)
"
* 'br-remove-storage-service-from-counters-2' of https://github.com/xemul/scylla:
counters: Drop call to get_local_storage_service and related
counters: Use local id arg in transform_counter_update_to_shards
database: Have local id arg in transform_counter_updates_to_shards()
storage_service: Keep local host id to database
This PR adds the Sphinx documentation generator and the custom theme ``sphinx-scylladb-theme``. Once merged, the GitHub Actions workflow should automatically publish the developer notes stored under ``docs`` directory on http://scylladb.github.io/scylla
1. Run the command ``make preview`` from the ``docs`` directory.
3. Check the terminal where you have executed the previous command. It should not raise warnings.
3. Open in a new browser tab http://127.0.0.1:5500/ to see the generated documentation pages.
The table of contents displays the files sorted as they appear on GitHub. In a subsequent iteration, @lauranovich and I will submit an additional PR proposing a new folder organization structure.
Closes#7752
* github.com:scylladb/scylla:
docs: fixed warnings
docs: added theme
The previous way of deleting records based on the whole
sstatble data_size causes overzealous deletions (#7668)
and inefficiency in the rows cache due to the large number
of range tombstones created.
Therefore we'd be better of by juts letting the
records expire using he 30 days TTL.
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201206083725.1386249-1-bhalevy@scylladb.com>
This reverts commit 0aa1f7c70a, reversing
changes made to 72c59e8000. The diff is
strange, including unrelated commits. There is no understanding of the
cause, so to be safe, revert and try again.
The local host id is now passed by argument, so we don't
need the counter_id::local() and some other methods that
call or are called by it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Only few places in it need the uuid. And since it's only 16 bytes
it's possibvle to safely capture it by value in the called lambdas.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two places that call it -- database code itself and
tests. The former already has the local host id, so just pass
one.
The latter are a bit trickier. Currently they use the value from
storage_service created by storage_service_for_tests, but since
this version of service doesn't pass through prepare_to_join()
the local_host_id value there is default-initialized, so just
default-initialize the needed argument in place.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The value in question is cached from db::system_keyspace
for places that want to have it without waiting for
futures. So far the only place is database counters code,
so keep the value on database itself. Next patches will
make use of it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Citing #6138: > In the past few years we have converted most of our codebase to
work in terms of fragmented buffers, instead of linearised ones, to help avoid
large allocations that put large pressure on the memory allocator. > One
prominent component that still works exclusively in terms of linearised buffers
is the types hierarchy, more specifically the de/serialization code to/from CQL
format. Note that for most types, this is the same as our internal format,
notable exceptions are non-frozen collections and user types. > > Most types
are expected to contain reasonably small values, but texts, blobs and especially
collections can get very large. Since the entire hierarchy shares a common
interface we can either transition all or none to work with fragmented buffers.
This series gets rid of intermediate linearizations in deserialization. The next
steps are removing linearizations from serialization, validation and comparison
code.
Series summary:
- Fix a bug in `fragmented_temporary_buffer::view::remove_prefix`. (Discovered
while testing. Since it wasn't discovered earlier, I guess it doesn't occur in
any code path in master.)
- Add a `FragmentedView` concept to allow uniform handling of various types of
fragmented buffers (`bytes_view`, `temporary_fragmented_buffer::view`,
`ser::buffer_view` and likely `managed_bytes_view` in the future).
- Implement `FragmentedView` for relevant fragmented buffer types.
- Add helper functions for reading from `FragmentedView`.
- Switch `deserialize()` and all its helpers from `bytes_view` to
`FragmentedView`.
- Remove `with_linearized()` calls which just became unnecessary.
- Add an optimization for single-fragment cases.
The addition of `FragmentedView` might be controversial, because another concept
meant for the same purpose - `FragmentRange` - is already used. Unfortunately,
it lacks the functionality we need. The main (only?) thing we want to do with a
fragmented buffer is to extract a prefix from it and `FragmentRange` gives us no
way to do that, because it's immutable by design. We can work around that by
wrapping it into a mutable view which will track the offset into the immutable
`FragmentRange`, and that's exactly what `linearizing_input_stream` is. But it's
wasteful. `linearizing_input_stream` is a heavy type, unsuitable for passing
around as a view - it stores a pair of fragment iterators, a fragment view and a
size (11 words) to conform to the iterator-based design of `FragmentRange`, when
one fragment iterator (4 words) already contains all needed state, just hidden.
I suggest we replace `FragmentRange` with `FragmentedView` (or something
similar) altogether.
Refs: #6138Closes#7692
* github.com:scylladb/scylla:
types: collection: add an optimization for single-fragment buffers in deserialize
types: add an optimization for single-fragment buffers in deserialize
cql3: tuples: don't linearize in in_value::from_serialized
cql3: expr: expression: replace with_linearize with linearized
cql3: constants: remove unneeded uses of with_linearized
cql3: update_parameters: don't linearize in prefetch_data_builder::add_cell
cql3: lists: remove unneeded use of with_linearized
query-result-set: don't linearize in result_set_builder::deserialize
types: remove unneeded collection deserialization overloads
types: switch collection_type_impl::deserialize from bytes_view to FragmentedView
cql3: sets: don't linearize in value::from_serialized
cql3: lists: don't linearize in value::from_serialized
cql3: maps: don't linearize in value::from_serialized
types: remove unused deserialize_aux
types: deserialize: don't linearize tuple elements
types: deserialize: don't linearize collection elements
types: switch deserialize from bytes_view to FragmentedView
types: deserialize tuple types from FragmentedView
types: deserialize set type from FragmentedView
types: deserialize map type from FragmentedView
types: deserialize list type from FragmentedView
types: add FragmentedView versions of read_collection_size and read_collection_value
types: deserialize varint type from FragmentedView
types: deserialize floating point types from FragmentedView
types: deserialize decimal type from FragmentedView
types: deserialize duration type from FragmentedView
types: deserialize IP address types from FragmentedView
types: deserialize uuid types from FragmentedView
types: deserialize timestamp type from FragmentedView
types: deserialize simple date type from FragmentedView
types: deserialize time type from FragmentedView
types: deserialize boolean type from FragmentedView
types: deserialize integer types from FragmentedView
types: deserialize string types from FragmentedView
types: remove unused read_simple_opt
types: implement read_simple* versions for FragmentedView
utils: fragmented_temporary_buffer: implement FragmentedView for view
utils: fragment_range: add single_fragmented_view
serializer: implement FragmentedView for buffer_view
utils: fragment_range: add linearized and with_linearized for FragmentedView
utils: fragment_range: add FragmentedView
utils: fragmented_temporary_buffer: fix view::remove_prefix
Values usually come in a single fragment, but we pay the cost of fragmented
deserialization nevertheless: bigger view objects (4 words instead of 2 words)
more state to keep updated (i.e. total view size in addition to current fragment
size) and more branches.
This patch adds a special case for single-fragment buffers to
abstract_type::deserialize. They are converted to a single_fragmented_view
before doing anything else. Templates instantiated with single_fragmented_view
should compile to better code than their multi-fragmented counterparts. If
abstract_type::deserialize is inlined, this patch should completely prevent any
performance penalties for switching from with_linearized to fragmented
deserialization.
with_linearized creates an additional internal `bytes` when the input is
fragmented. linearized copies the data directly to the output `bytes`, so it's
more efficient.
Devirtualizes collection_type_impl::deserialize (so it can be templated) and
adds a FragmentedView overload. This will allow us to deserialize collections
with explicit cql_serialization_format directly from fragmented buffers.
The final part of the transition of deserialize from bytes_view to
FragmentedView.
Adds a FragmentedView overload to abstract_type::deserialize and
switches deserialize_visitor from bytes_view to FragmentedView, allowing
deserialization of all types with no intermediate linearization.
The partition builder doesn't expect the looked-up row to exist. In fact
it already existing is a sign of a bug. Currently bugs resulting in
duplicate rows will manifest by tripping an assert in
`row::append_cell()`. This however results in poor diagnostics, so we
want to catch these errors sooner to be able to provide higher level
diagnostics. To this end, switch to the freshly introduced
`append_clustering_row()` so that duplicate rows are found early and in
a context where their identity is known.
This abstraction is used to merge the output of multiple readers, each
opened for a single partition query, into a non-decreasing stream
of mutation_fragments.
It is similar to `mutation_reader_merger`,
but an important difference is that the new merger may select new readers
in the middle of a partition after it already returned some fragments
from that partition. It uses the new `position_reader_queue` abstraction
to select new readers. It doesn't support multi-partition (ring range) queries.
The new merger will be later used when reading from sstable sets created
by TimeWindowCompactionStrategy. This strategy creates many sstables
that are mostly disjoint w.r.t the contained clustering keys, so we can
delay opening sstable readers when querying a partition until after we have
processed all mutation fragments with positions before the keys
contained by these sstables.
A microbenchmark was added that compares the existing combining reader
(which uses `mutation_reader_merger` underneath) with a new combining reader
built using the new `clustering_order_reader_merger` and a simple queue of readers
that returns readers from some supplied set. The used set of readers is built from the following
ranges of keys (each range corresponds to a single reader):
`[0, 31]`, `[30, 61]`, `[60, 91]`, `[90, 121]`, `[120, 151]`.
The microbenchmark runs the reader and divides the result by the number of mutation fragments.
The results on my laptop were:
```
$ build/release/test/perf/perf_mutation_readers -t clustering_combined.* -r 10
single run iterations: 0
single run duration: 1.000s
number of runs: 10
test iterations median mad min max
clustering_combined.ranges_generic 2911678 117.598ns 0.685ns 116.175ns 119.482ns
clustering_combined.ranges_specialized 3005618 111.015ns 0.349ns 110.063ns 111.840ns
```
`ranges_generic` denotes the existing combining reader, `ranges_specialized` denotes the new reader.
Split from https://github.com/scylladb/scylla/pull/7437.
Closes#7688
* github.com:scylladb/scylla:
tests: mutation_source_test for clustering_order_reader_merger
perf: microbenchmark for clustering_order_reader_merger
mutation_reader_test: test clustering_order_reader_merger in memory
test: generalize `random_subset` and move to header
mutation_reader: introduce clustering_order_reader_merger
In issue #7722, it was suggested that we should port Cassandra's CQL unit
tests into our own repository, by translating the Java tests into Python
using the new cql-pytest framework. Cassandra's CQL unit test framework is
orders of magnitude faster than dtest, and in-tree, so Cassandra have been
moving many CQL correctness tests there, and we can also benefit from their
test cases.
In this patch, we take the first step in a long journey:
1. I created a subdirectory, test/cql-pytest/cassandra_tests, where all the
translated Cassandra tests will reside. The structure of this directory
will mirror that of the test/unit/org/apache/cassandra/cql3 directory in
the Cassandra repository.
pytest conveniently looks for test files recursively, so when all the
cql-pytest are run, the cassandra_tests files will be run as well.
As usual, one can also run only a subset of all the tests, e.g.,
"test/cql-pytest/run -vs cassandra_tests" runs only the tests in the
cassandra_tests subdirectory (and its subdirectories).
2. I translated into Python two of the smallest test files -
validation/entities/{TimeuuidTest,DataTypeTest}.java - containing just
three test functions.
The plan is to translate entire Java test files one by one, and to mirror
their original location in our own repository, so it will be easier
to remember what we already translated and what remains to be done.
3. I created a small library, porting.py, of functions which resemble the
common functions of the Java tests (CQLTester.java). These functions aim
to make porting the tests easier. Despite the resemblence, the ported code
is not 100% identical (of course) and some effort is still required in
this porting. As we continue this porting effort, we'll probably need
more of these functions, can can also continue to improve them to reduce
the porting effort.
Refs #7722.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201201192142.2285582-1-nyh@scylladb.com>
This series introduces a `large_data_counters` element to `scylla_metadata` component to explicitly count the number of `large_{partitions,rows,cells}` and `too_many_rows` in the sstable. These are accounted for in the sstable writer whenever the respective large data entry is encountered.
It is taken into account in `large_data_handler::maybe_delete_large_data_entries`, when engaged.
Otherwise, if deleting a legacy sstable that has no such entry in `scylla_metadata`, just revert to using the current method of comparing the sstable's `data_size` to the various thresholds.
Fixes#7668
Test: unit(dev)
Dtest: wide_rows_test.py (in progress)
Closes#7669
* github.com:scylladb/scylla:
docs: sstable-scylla-format: add large_data_stats subcomponent
large_data_handler: maybe_delete_large_data_entries: use sstable large data stats
large_data_handler: maybe_delete_large_data_entries: accept shared_sstable
large_data_handler: maybe_delete_large_data_entries: move out of line
sstables: load large_data_stats from scylla_metadata
sstables: store large_data_stats in scylla_metadata
sstables: writer: keep track of large data stats
large_data_handler: expose methods to get threshold
sstables: kl/writer: never record too many rows
large_data_handler: indicate recording of large data entries
large_data_handler: move constructor out of line
In test/cql-pytest/run.py we have a 200 second timeout to boot Scylla.
I never expected to reach this timeout - it normally takes (in dev
build mode) around 2 seconds, but in one run on Jenkins we did reach it.
It turns out that the code does not recognize this timeout correctly,
thought that Scylla booted correctly - and then failed all the
subtests when they fail to connect to Scylla.
This patch fixes the timeout logic. After the timeout, if Scylla's
CQL port is still not responsive, the test run is failed - without
trying to run many individual tests.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201201150927.2272077-1-nyh@scylladb.com>
When a row was inserted into a table with no regular columns, and no
such row existed in the first place, postimage would not be produced.
Fix this.
Fixes#7716.
Closes#7723
If the sstable has scylla_metadata::large_data_stats use them
to determine whether to delete the corresponding large data records.
Otherwise, defer to the current method of comparing the sstable
data_size to the respective thresholds.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since the actual deletion if the large data entries
is done in the background, and we don't captures the shared_sstable,
we can safely pass it to maybe_delete_large_data_entries when
deleting the sstable in sstable::unlink and it will be release
as soon as maybe_delete_large_data_entries returns.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Load the large data stats from the scylla_metadata component
if they are present. Otherwise, if we're opening a legacy sstable
that has scylla_metadata_type::LargeDataStats, leave
sstable::_large_data_stats disengaged.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Store the large data statistics in the scylla_metadata component.
These will be retrieved when loading the sstable and be
used for determining whether to delete the corresponding
large data entries upon sstable deletion.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Previously, statement_restrictions::find_idx() would happily return an
index for a non-EQ restriction (because it checked only the column
name, not the operator). This is incorrect: when the selected index
is for a non-EQ restriction, it is impossible to query that index
table.
Fixes#7659.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#7665
* seastar 010fb0df1e...8b400c7b45 (6):
> append_challenged_posix_file_impl::read_dma: allow iovec to cross _logical_size
> Merge "Extend per task-queue timing statistics" from Pavel E
> tls_test: Create test certs at build time
> cook: upgrade hwloc version
> memory: rate-limit diagnostics messages
> util/log: add rate-limited version of writer version of log()
Currently, if the user provides a cell name with too many components,
we will accept it and construct an invalid clusterin key. This may
result in undefined behavior down the stream.
It was caught by ASAN in a debug build when executing dtest
cql_tests.py:MiscellaneousCQLTester.cql3_insert_thrift_test with
nodetool flush manually added after the write. Triggered during
sstable writing to an MC-format sstable:
seastar::shared_ptr<abstract_type const>::operator*() const at ././seastar/include/seastar/core/shared_ptr.hh:577
sstables::mc::clustering_blocks_input_range::next() const at ./sstables/mx/writer.cc:180
To prevent corrupting the state in this way, we should fail
early. This patch addds validation which will fail thrift requests
which attempt to create invalid clustering keys.
Fixes#7568.
Example error:
Internal server error: Cell name of ks.test has too many components, expected 1 got 2 in 0x0004000000040000017600
Message-Id: <1605550477-24810-1-git-send-email-tgrabiec@scylladb.com>
This patch adds an option to scylla_setup to configure an rsyslog destination.
The monitoring stack has an option to get information from rsyslog it
requires that rsyslog on the scylla machines will send the trace line to
it.
The configuration will be in a Scylla configuration file, so it is safe to run it multiple times.
Fixes#7589
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Closes#7634
* github.com:scylladb/scylla:
dist/common/scripts/scylla_setup: Optionally config rsyslog destination
Adding dist/common/scripts/scylla_rsyslog_setup utility
This patch adds an option to scylla_setup to configure an rsyslog
destination.
The monitoring stack has an option to get information from rsyslog, it
requires that rsyslog on the scylla machines will send the trace line to
it.
If the /etc/rsyslog.d/ directory exists (that means the current system
runs rsyslog) it will ask if to add rsyslog configuration and if yes, it
would run scylla_rsyslog_setup.
Fixes#7589
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
scylla_setup command suggestion does not shows an argument of --io-setup,
because we mistakely stores bool value on it (recognized as 'store_true').
We always need to print '--io-setup X' on the suggestion instead.
Related #7395
* scylla-dev/snapshot_fixes_v1:
raft: ignore append_reply from a peer in SNAPSHOT state
raft: Ignore outdated snapshots
raft: set next_idx to correct value after snapshot transfer
This abstraction is used to merge the output of multiple readers, each
opened for a single partition query, into a non-decreasing stream
of mutation_fragments.
It is similar to `mutation_reader_merger`,
an important difference is that the new merger may select new readers
in the middle of a partition after it already returned some fragments
from that partition. It uses the new `position_reader_queue` abstraction
to select new readers. It doesn't support multi-partition (ring range) queries.
The new merger will be later used when reading from sstable sets created
by TimeWindowCompactionStrategy. This strategy creates many sstables
that are mostly disjoint w.r.t the contained clustering keys, so we can
delay opening sstable readers when querying a partition until after we have
processed all mutation fragments with positions before the keys
contained by these sstables.
Fix#7680 by never using secondary index for multi-column restrictions.
Modify expr::is_supported_by() to handle multi-column correctly.
Tests: unit (dev)
Closes#7699
* github.com:scylladb/scylla:
cql3/expr: Clarify multi-column doesn't use indexing
cql3: Don't use index for multi-column restrictions
test: Add eventually_require_rows
The first two patches in this series are small improvements to cql-pytest to prepare for the third and main patch. This third patch adds cql-pytest tests which check that we fail CQL queries that try to inject non-ASCII and non-UTF-8 strings for ascii and text columns, respectively.
The tests do not discover any unknown bug in Scylla, however, they do show that Scylla is more strict in its definition of "valid UTF-8" compared to Cassandra.
Closes#7719
* github.com:scylladb/scylla:
test/cql-pytest: add tests for validation of inserted strings
test/cql-pytest: add "scylla_only" fixture
test/cpy-pytest: enable experimental features
This change adds tracking of all the CQL errors that can be
raised in response to a CQL message from a client, as described
in the CQL v4 protocol and with Scylla's CDC_WRITE_FAILUREs
included.
Fixes#5859Closes#7604
We have "Conflicts: kernel < 3.10.0-514" on rpm package to make sure
the environment is running newer kernel.
However, user may use non-standard kernel which has different package name,
like kernel-ml or kernel-uek.
On such environment Conflicts tag does not works correctly.
Even the system running with newer kernel, rpm only checks "kernel" package
version number.
To avoid such issue, we need to drop Conflicts tag.
Fixes#7675
This patch adds comprehensive cql-pytest tests for checking the validation
of strings - ASCII or UTF-8 - in CQL. Strings can be represented in CQL
using several methods - a strings can be a string literal as
part of the statement, can be encoded as a blob (0x...), or
can be a binding parameter for a prepared statement, or returned
by user-defined functions - and these tests check all of them.
We already have low-level unit tests for UTF-8 parsing in
test/boost/utf8_test.cc, but the new tests here confirms that we really
call these low-level functions in the correct way. Moreover, since these
are CQL tests, they can also be run against Cassandra, and doing that
demonstrated that Scylla's UTF-8 parsing is *stricter* than Cassandra's -
Scylla's UTF-8 parser rejects the following sequences which Cassandra's
accepts:
1. \xC0\x80 as another non-minimal representation of null. Note that other
non-minimal encodings are rejected by Cassandra, as expected.
2. Characters beyond the official Unicode range (or what Scylla considers
the end of the range).
3. UTF-16 surrogates - these are not considered valid UTF-8, but Cassandra
accepts them, and Scylla does not.
In the future, we should consider whether Scylla is more correct than
Cassandra here (so we're fine), or whether compatibility is more important
than correctness (so this exposed a bug).
The ASCII tests reproduces issue #5421 - that trying to insert a
non-ASCII string into an "ascii" column should produce an error on
insert - not later when fetching the string. This test now passes,
because issue 5421 was already fixed.
These tests did not exposed any bug in Scylla (other than the differences
with Cassandra mentioned a bug), so all of them pass on Scylla. Two
of the tests fail on Cassandra, because Cassandra does not recognize
some invalid UTF-8 (according to Scylla's definition) as invalid.
Refs #5421.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Reject the previously accepted case where the multi-column restriction
applied to just a single column, as it causes a crash downstream. The
user can drop the parentheses to avoid the rejection.
Fixes#7710
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#7712
"
This series adds maybe_yield called from
cleanup_compaction::get_ranges_for_invalidation
to avoid reactor stalls.
To achieve that, we first extract bool_class can_yield
to utils/maybe_yield.hh, and add a convience helper:
utils::maybe_yield(can_yield) that conditionally calls
seastar::thread::maybe_yield if it can (when called in a
seastar thread).
With that, we add a can_yield parameter to dht::to_partition_ranges
and dht::partition_range::deoverlap (defaults to false), and
use it from cleanup_compaction::get_ranges_for_invalidation,
as the latter is always called from `consume_in_thread`.
Fixes#7674
Test: unit(dev)
"
* tag 'unstall-get_ranges_for_invalidation-v2' of github.com:bhalevy/scylla:
compaction: cleanup_compaction: get_ranges_for_invalidation: add yield points
dht/i_partitioner: to_partition_ranges: support yielding
locator: extract can_yield to utils/maybe_yield.hh
It is used to force remove a node from gossip membership if something
goes wrong.
Note: run the force_remove_endpoint api at the same time on _all_ the
nodes in the cluster in order to prevent the removed nodes come back.
Becasue nodes without running the force_remove_endpoint api cmd can
gossip around the removed node information to other nodes in 2 *
ring_delay (2 * 30 seconds by default) time.
For instance, in a 3 nodes cluster, node 3 is decommissioned, to remove
node 3 from gossip membership prior the auto removal (3 days by
default), run the api cmd on both node 1 and node 2 at the same time.
$ curl -X POST --header "Accept: application/json"
"http://127.0.0.1:10000/gossiper/force_remove_endpoint/127.0.0.3"
$ curl -X POST --header "Accept: application/json"
"http://127.0.0.2:10000/gossiper/force_remove_endpoint/127.0.0.3"
Then run 'nodetool gossipinfo' on all the nodes to check the removed nodes
are not present.
Fixes#2134Closes#5436
This patch adds a fixture "scylla_only" which can be used to mark tests
for Scylla-specific features. These tests are skipped when running against
other CQL servers - like Apache Cassandra.
We recognize Scylla by looking at whether any system table exists with
the name "scylla" in its name - Scylla has several of those, and Cassandra
has none.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
bytes_view is one of the types we want to deserialize from (at least for now),
so we want to be able to pass it to deserialize() after it's transitioned to
FragmentView.
single_fragmented_view is a wrapper implementing FragmentedView for bytes_view.
It's constructed from bytes_view explicitly, because it's typically used in
context where we want to phase linearization (and by extension, bytes_view) out.
This patch introduces FragmentedView - a concept intented as a general-purpose
interface for fragmented buffers.
Another concept made for this purpose, FragmentedRange, already exists in the
codebase. However, it's unwieldy. The iterator-based design of FragmentRange is
harder to implement and requires more code, but more importantly it makes
FragmentRange immutable.
Usually we want to read the beginning of the buffer and pass the rest of it
elsewhere. This is impossible with FragmentRange.
FragmentedView can do everything FragmentRange can do and more, except for
playing nicely with iterator-based collection methods, but those are useless for
fragmented buffers anyway.
disk parsing expects output from recursive listing of GCP
metadata REST call, the method used to do it by default,
but now it requires a boolean flag to run in recursive mode
Fixes#7684Closes#7685
Since f3bcd4d205 ("Merge 'Support SSL Certificate Hot
Reloading' from Calle"), we reload certificates as they are
modified on disk. This uses inotify, which is limited by a
sysctl fs.inotify.max_user_instances, with a default of 128.
This is enough for 64 shards only, if both rpc and cql are
encrypted; above that startup fails.
Increase to 1200, which is enough for 6 instances * 200 shards.
Fixes#7700.
Closes#7701
When we introduced dependencies.conf, we mistakenly added it on rpm as %ghost,
but it should be normal file, should be installed normally on package installation.
Fixes#7703Closes#7704
Fixes#7211
If we start a sharded<> object, then proceed to do potentially
exceptional stuff, we should destroy it on said exception.
Otherwise, the exception propagation will abort on RAII
destruction of the sharded<>. And we get no exception logging.
Closes#7697
* github.com:scylladb/scylla:
redis::service: Shut down sharded<> subobject on startup exception
transport::controller: Shut down distributed object on startup exception
Refs #7211
If we start a sharded<> object, then proceed to do potentially
exceptional stuff, we should destroy it on said exception.
Otherwise, the exception propagation will abort on RAII
destruction of the sharded<>. And we get no exception logging.
Fixes#7211
If we start a sharded<> object, then proceed to do potentially
exceptional stuff, we should destroy it on said exception.
Otherwise, the exception propagation will abort on RAII
destruction of the sharded<>. And we get no exception logging.
The downstream code expects a single-column restriction when using an
index. We could fix it, but we'd still have to filter the rows
fetched from the index table, unlike the code that queries the base
table directly. For instance, WHERE (c1,c2,c3) = (1,2,3) with an
index on c3 can fetch just the right rows from the base table but all
the c3=3 rows from the index table.
Fixes#7680
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
After snapshot is transferred progress::next_idx is set to its index,
but the code uses current snapshot to set it instead of the snapshot
that was transferred. Those can be different snapshots.
After a node becomes leader it needs to do two things: send an append
message to establish its leadership and commit one entry to make sure
all previous entries with smaller terms are committed as well.
Snapshot index cannot be used to check snapshot correctness since some
entries may not be command and thus do not affect snapshot value. Lest
use applied entries count instead.
Move the definition of bool_class can_yield to a standalone
header file and define there a maybe_yield(can_yield) helper.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In commit 9b28162f88 (repair: Use label
for node ops metrics), we switched to use label for different node
operations. We should use the same description for the same metric name.
Fixes#7681Closes#7682
1. sstables: move `sstable_set` implementations to a separate module
All the implementations were kept in sstables/compaction_strategy.cc
which is quite large even without them. `sstable_set` already had its
own header file, now it gets its own implementation file.
The declarations of implementation classes and interfaces (`sstable_set_impl`,
`bag_sstable_set`, and so on) were also exposed in a header file,
sstable_set_impl.hh, for the purposes of potential unit testing.
2. mutation_reader: move `mutation_reader::forwarding` to flat_mutation_reader.hh
Files which need this definition won't have to include
mutation_reader.hh, only flat_mutation_reader.hh (so the inclusions are
in total smaller; mutation_reader.hh includes flat_mutation_reader.hh).
3. sstables: move sstable reader creation functions to `sstable_set`
Lower level functions such as `create_single_key_sstable_reader`
were made methods of `sstable_set`.
The motivation is that each concrete sstable_set
may decide to use a better sstable reading algorithm specific to the
data structures used by this sstable_set. For this it needs to access
the set's internals.
A nice side effect is that we moved some code out of table.cc
and database.hh which are huge files.
4. sstables: pass `ring_position` to `create_single_key_sstable_reader`
instead of `partition_range`.
It would be best to pass `partition_key` or `decorated_key` here.
However, the implementation of this function needs a `partition_range`
to pass into `sstable_set::select`, and `partition_range` must be
constructed from `ring_position`s. We could create the `ring_position`
internally from the key but that would involve a copy which we want to
avoid.
5. sstable_set: refactor `filter_sstable_for_reader_by_pk`
Introduce a `make_pk_filter` function, which given a ring position,
returns a boolean function (a filter) that given a sstable, tells
whether the sstable may contain rows with the given position.
The logic has been extracted from `filter_sstable_for_reader_by_pk`.
Split from #7437.
Closes#7655
* github.com:scylladb/scylla:
sstable_set: refactor filter_sstable_for_reader_by_pk
sstables: pass ring_position to create_single_key_sstable_reader
sstables: move sstable reader creation functions to `sstable_set`
mutation_reader: move mutation_reader::forwarding to flat_mutation_reader.hh
sstables: move sstable_set implementations to a separate module
This piece of logic was wrong for two unrelated reasons:
1. When fragmented_temporary_buffer::view is constructed from bytes_view,
_current is null. When remove_prefix was used on such view, null pointer
dereference happened.
2. It only worked for the first remove_prefix call. A second call would put a
wrong value in _current_position.
For sstable versions greater or equal than md, the `min_max_column_names`
sstable metadata gives a range of position-in-partitions such that all
clustering rows stored in this sstable have positions in this range.
Partition tombstones in this context are understood as covering the
entire range of clustering keys; thus, if the sstable contains at least
one partition tombstone, the sstable position range is set to be the
range of all clustered rows.
Therefore, by checking that the position range is *not* the range of all
clustered rows we know that the sstable cannot have any partition tombstones.
Closes#7678
It is not legal to fast forward a reader before it enters a partition.
One must ensure that there even is a partition in the first place. For
this one must fetch a `partition_start` fragment.
Closes#7679
Fixes, features needed for testing, snapshot testing.
Free election after partitioning (replication test) .
* https://github.com/alecco/scylla/tree/raft-ale-tests-05e:
raft: replication test: partitioning with leader
raft: replication test: run free election after partitioning
raft: expose fsm tick() to server for testing
raft: expose is_leader() for testing
raft: replication test: test take and load snapshot
raft: fix a bug in leader election
raft: fix default randomized timeout
raft: replication test: fix custom next leader
raft: replication test: custom next leader noop for same
raft: replication test: fix failure detector for disconnected
Introduce a `make_pk_filter` function, which given a ring position,
returns a boolean function (a filter) that given a sstable, tells
whether the sstable may contain rows with the given position.
The logic has been extracted from `filter_sstable_for_reader_by_pk`.
instead of partition_range.
It would be best to pass `partition_key` or `decorated_key` here.
However, the implementation of this function needs a `partition_range`
to pass into `sstable_set::select`, and `partition_range` must be
constructed from `ring_position`s. We could create the `ring_position`
internally from the key but that would involve a copy which we want to
avoid.
Currently, each internal page fetched during aggregating
gets a timeout based on the time the page fetch was started,
rather than the query start time. This means the query can
continue processing long after the client has abandoned it
due to its own timeout, which is based on the query start time.
Fix by establishing the timeout once when the query starts, and
not advancing it.
Test: manual (SELECT count(*) FROM a large table).
Fixes#1175.
Closes#7662
The C and C++ sub-builds were placed in submodule_pool to
reduce concurrency, as they are memory intensive (well, at least
the C++ jobs are), and we choose build concurrency based on memory.
But the other submodules are not memory intensives, and certainly
the packaging jobs are not (and they are single-threaded too).
To allow these simple jobs to utilize multicores more efficiently,
remove them from submodule_pool so they can run in parallel.
Closes#7671
The unified package is quite large (1GB compressed), and it
is the last step in the build so its build time cannot be
parallized with other tasks. Compress it with pigz to take
advantage of multiple cores and speed up the build a little.
Closes#7670
We initially implemented run() and out() functions because we couldn't use
subprocess.run() since we were on Python 3.4.
But since we moved to relocatable python3, we don't need to implement it ourselves.
Why we keep using these functions are, because we needed to set environemnt variable to set PATH.
Since we recently moved away these codes to python thunk, we finally able to
drop run() and out(), switch to subprocess.run().
When partitioning without keeping the existing leader, run an election
without forcing a particular leader.
To force a leader after partitioning, a test can just set it with new_leader{X}.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
For tests to advance servers they need to invoke tick().
This is needed to advance free elections.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Through configuration trigger automatic snapshotting.
For now, handle expected log index within the test's state machine and
pass it with snapshot_value (within the test file).
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
If a server responds favourably to RequestVote RPC, it should
reset its election timer, otherwise it has very high chances of becoming
a candidate with an even newer term, despite successful elections.
A candidate with a term larger than the leader rejects AppendEntries
RPCs and can not become a leader itself (because of protection
against of disruptive leaders), so is stuck in this state.
Range after election timeout should start at +1.
This matches existing update_current_term() code adding dist(1, 2*n).
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Adjustments after changes due to free election in partitioning and changes in
the code.
Elapse previous leader after isolating it.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
scylla_rsyslog_setup adds a configuration file to rsyslog to forward the
trances to a remote server.
It will override any existing file, so it is safe to run it multiple
times.
It takes an ip, or ip and port from the users for that configuration, if
no port is provided, the default port of Scylla-Monitoring promtail is
used.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Follow-up to https://github.com/scylladb/scylla/pull/6916.
- Fixes wrong usage of `resource_manager::prepare_per_device_limits`,
- Improves locking in `resource_manager` so that it is more safe to call its methods concurrently,
- Adds comments around `resource_manager::register_manager` so that it's more clear what this method does and why.
Closes#7660
* github.com:scylladb/scylla:
hints/resource_manager: add comments to register_manager
hints/resource_manager: fix indentation
hints/resource_manager: improve mutual exclusion
hints/resource_manager: correct prepare_per_device_limits usage
The "ninja dist-server-tar" command is a full replacement for
"build_reloc.sh" script. We release engineering infrastructure has been
switched to ninja, so let's remove "build_reloc.sh" as obsolete.
Now that CDC is GA, it should be enabled in all the tests by default.
To achieve that the PR adds a special db::config::add_cdc_extension()
helper which is used in cql_test_envm to make sure CDC is usable in
all the tests that use cql_test_env.m As a result, cdc_tests can be
simplified.
Finally, some trailing whitespaces are removed from cdc_tests.
Tests: unit(dev)
Closes#7657
* github.com:scylladb/scylla:
cdc: Remove trailing whitespaces from cdc_tests
cdc: Remove mk_cdc_test_config from tests
config: Add add_cdc_extension function for testing
cdc: Add missing includes to cdc_extension.hh
The patch which introduces build-dependent testing
has a regression: it quietly filters out all tests
which are not part of ninja output. Since ninja
doesn't build any CQL tests (including CQL-pytest),
all such tests were quietly disabled.
Fix the regression by only doing the filtering
in unit and boost test suites.
test: dev (unit), dev + --build-raft
Message-Id: <20201119224008.185250-1-kostja@scylladb.com>
Some systems (at least, Centos 7, aarch64) block the membarrier()
syscall via seccomp. This causes Scylla or unit tests to burn cpu
instead of sleeping when there is nothing to do.
Fix by instructing podman/docker not to block any syscalls. I
tested this with podman, and it appears [1] to be supported on
docker.
[1] https://docs.docker.com/engine/security/seccomp/#run-without-the-default-seccomp-profileCloses#7661
Lower level functions such as `create_single_key_sstable_reader`
were made methods of `sstable_set`.
The motivation is that each concrete sstable_set
may decide to use a better sstable reading algorithm specific to the
data structures used by this sstable_set. For this it needs to access
the set's internals.
A nice side effect is that we moved some code out of table.cc
and database.hh which are huge files.
Files which need this definition won't have to include
mutation_reader.hh, only flat_mutation_reader.hh (so the inclusions are
in total smaller; mutation_reader.hh includes flat_mutation_reader.hh).
All the implementations were kept in sstables/compaction_strategy.cc
which is quite large even without them. `sstable_set` already had its
own header file, now it gets its own implementation file.
The declarations of implementation classes and interfaces (`sstable_set_impl`,
`bag_sstable_set`, and so on) were also exposed in a header file,
sstable_set_impl.hh, for the purposes of potential unit testing.
"
The qctx is global object that references query processor and
database to let the rest of the code query system keyspace.
As the first step of de-globalizing it -- remove the database
reference from it. After the set the qctx remains a simple
wrapper over the query processor (which is already de-globalized)
and the query processor in turn is mostly needed only to parse
the query string into prepared statement only. This, in turn,
makes it possible to remove the qctx later by parsing the
query strings on boot and carrying _them_ around, not the qctx
itself.
tests: unit(dev), dtest(simple_cluster_driver_test:dev), manual start/stop
"
* 'br-remove-database-from-qctx' of https://github.com/xemul/scylla:
query-context: Remove database from qctx
schema-tables: Use query processor referece in save_system(_keyspace)?_schema
system-keyspace: Rewrite force_blocking_flush
system-keyspace: Use cluster_name string in check_health
system-keyspace: Use db::config in setup_version
query-context: Kill global helpers
test: Use cql_test_env::evecute_cql instead of qctx version
code: Use qctx::evecute_cql methods, not global ones
system-keyspace: Do not call minimal_setup for the 2nd time
system-keyspace: Fix indentation after previous patch
system-keyspace: Do not do invoke_on_all by hands
system-keyspace: Remove dead code
The save_system_schema and save_system_keyspace_schema are both
called on start and can the needed get query processor reference
from arguments.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method is called after query_processor::execute_internal
to flush the cf. Encapsulating this flush inside database and
getting the database from query_processor lets removing
database reference from global qctx object.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The check_help needs global qctx to get db.config.cluster_name,
which is already available at the caller side.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is the beginning of de-globalizing global qctx thing.
The setup_version() needs global qctx to get config from.
It's possible to get the config from the caller instead.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similar to previous patch, but for tests. Since cql_test_env
does't have qctx on board, the patch makes one step forward
and calls what is called by qctx::execute_cql.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are global db::execute_cql() helpers that just forward
the args into qctx::execute_cql(). The former are going away,
so patch all callers to use qctx themselves.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
THe system_keyspace::minimal_setup is called by main.cc by hands
already, some steps before the regular ::setup().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The cache_truncation_record needs to run cf.cache_truncation_record
on each shard's DB, so the invoke_on_all can be used.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This commit causes start, stop and register_manager methods of the
resource_manager to be serialized with respect to each other using the
_operation_lock.
Those function modify internal state, so it's best if they are
protected with a semaphore. Additionally, those function are not going
to be used frequently, therefore it's perfectly fine to protect them in
such a coarse manner.
Now, space_watchdog has a dedicated lock for serializing its on_timer
logic with resource_manager::register_manager. The reason for separate
lock is that resource_manager::stop cannot use the same lock as the
space_watchdog - otherwise a situation could occur in which
space_watchdog waits for semaphore units held by
resource_manager::stop(), and resource_manager::stop() waits until the
space_watchdog stops its asynchronous event loop.
The resource_manager::prepare_per_device_limits function calculates disk
quota for registered hints managers, and creates an association map:
from a storage device id to those hints manager which store hints on
that device (_per_device_limits_map)
This function was used with an assumption that it is idempotent - which
is a wrong assumption. In resource_manager::register_manager, if the
resource_manager is already started, prepare_per_device_limits would be
called, and those hints managers which were previously added to the
_per_device_limits_map would be added again. This would cause the space
used by those managers to be calculated twice, which would artificially
lower the limit which we impose on the space hints are allowed to occupy
on disk.
This patch fixes this problem by changing the prepare_per_device_limits
function to operate on a hints manager passed by argument. Now, we make
sure that this function is called on each hints manager only once.
Now that CDC is GA and enabled by default, there's no longer a need
for a specific config in CDC tests.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
It is now called `merging_reader`, and is used to change a `FragmentProducer`
that produces a non-decreasing stream of mutation fragments batches into
a `flat_mutation_reader` producing a non-decreasing stream of fragments.
The resulting stream of fragments is increasing except for places where
we encounter range tombstones (multiple range tombstones may be produced
with the same position_in_partition)
`merging_reader` is a simple adapter over `mutation_fragment_merger`.
The old `combined_mutation_reader` is simply a specialization of `merging_reader`
where the used `FragmentProducer` is `mutation_reader_merger`, an abstraction that
merges the output of multiple readers into one non-decreasing stream of fragment
batches.
There is no separate class for `combined_mutation_reader` now. Instead,
`make_combined_reader` works directly with `merging_reader`.
The PR also improves some comments.
Split from https://github.com/scylladb/scylla/pull/7437.
Closes#7656
* github.com:scylladb/scylla:
mutation_reader: `generalize combined_mutation_reader`
mutation_reader: fix description of mutation_fragment_merger
After the concept of the seed nodes was removed we can distinguish
whether the node is the first node in the cluster or not.
Thanks to this we can avoid adding delay to the timestamp of the first
CDC generation.
The delay is added to the timestamp to make sure that all the nodes
in the cluster manage to learn about it before the timestamp becomes in the past.
It is safe to not add the delay for the first node because we know it's the only node
in the cluster and no one else has to learn about the timestamp.
Fixes#7645
Tests: unit(dev)
Closes#7654
* github.com:scylladb/scylla:
cdc: Don't add delay to the timestamp of the first generation
cdc: Change for_testing to add_delay in make_new_cdc_generation
It is now called `merging_reader`, and is used to change a `FragmentProducer`
that produces a non-decreasing stream of mutation fragments batches into
a `flat_mutation_reader` producing a non-decreasing stream of fragments.
The resulting stream of fragments is increasing except for places where
we encounter range tombstones (multiple range tombstones may be produced
with the same position_in_partition)
`merging_reader` is a simple adapter over `mutation_fragment_merger`.
The old `combined_mutation_reader` is simply a specialization of `merging_reader`
where the used `FragmentProducer` is `mutation_reader_merger`, an abstraction that
merges the output of multiple readers into one non-decreasing stream of fragment
batches.
There is no separate class for `combined_mutation_reader` now. Instead,
`make_combined_reader` works directly with `merging_reader`.
"
We've recently seen failures in this unit test as follows:
```
test/boost/network_topology_strategy_test.cc(0): Entering test case "testCalculateEndpoints"
unknown location(0): fatal error: in "testCalculateEndpoints": std::out_of_range: _Map_base::at
./seastar/src/testing/seastar_test.cc(43): last checkpoint
test/boost/network_topology_strategy_test.cc(0): Leaving test case "testCalculateEndpoints"; testing time: 15192us
test/boost/network_topology_strategy_test.cc(0): Entering test case "test_invalid_dcs"
network_topology_strategy_test: ./seastar/include/seastar/core/future.hh:634: void seastar::future_state<seastar::internal::monostate>::set(A &&...) [T = seastar::internal::monostate, A = <>]: Assertion `_u.st == state::future' failed.
Aborting on shard 0.
```
This series fixes 2 issues in this test:
1. The core issue where std::out_of_range exception
is not handled in calculate_natural_endpoints().
2. A secondary issue where the static `snitch_inst` isn't
stopped when the first exception is hit, failing
the next time the snitch is started, as it wasn't
stopped properly.
Test: network_topology_strategy_test(release)
"
* tag 'nts_test-harden-calculate_natural_endpoints-v1' of github.com:bhalevy/scylla:
test: network_topology_strategy_test: has_sufficient_replicas: handle empty dc endpoints case
test: network_topology_strategy_test: fixup indentation
test: network_topology_strategy_test: always stop_snitch after create_snitch
After the concept of the seed nodes was removed we can distinguish
whether the node is the first node in the cluster or not.
Thanks to this we can avoid adding delay to the timestamp of the first
CDC generation.
Fixes#7645
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The meaning of the parameter changes from defining whether the function
is called in testing environment to deciding whether a delay should be
added to a timestamp of a newly created CDC generation.
This is a preparation for improvement in the following patch that does
not always add delay to every node but only to non-first node.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Asias He reports that git on Windows filesystem is unhappy about the
colon character (":") present in dist-check files:
$ git reset --hard origin/master
error: invalid path 'tools/testing/dist-check/docker.io/centos:7.sh'
fatal: Could not reset index file to revision 'origin/master'.
Rename the script to use a dash instead.
Closes#7648
Current tests uses hash state machine that checks for specific order of
entries application. The order is not always guaranty though.
Backpressure may delay some entires to be submitted and when they are
released together they may be reordered in the debug mode due to
SEASTAR_SHUFFLE_TASK_QUEUE. Introduce an ability for test to choose
state machine type and implement commutative state machine that does
not care about ordering.
To prevent the log to take too much memory introduce a mechanism that
limits the log to a certain size. If the size is reached no new log
entries can be submitted until previous entries are committed and
snapshotted.
If scylla_raid_setup script called without --raiddev argument
then try to use any of /dev/md[0-9] devices instead of only
one /dev/md0. Do it in this way because on Ubuntu 20.04
/dev/md0 used by OS already.
Closes#7628
gcc fails to compile current master like this
In file included from ./service/client_state.hh:44,
from ./cql3/cql_statement.hh:44,
from ./cql3/statements/prepared_statement.hh:47,
from ./cql3/statements/raw/select_statement.hh:45,
from build/dev/gen/cql3/CqlParser.hpp:64,
from build/dev/gen/cql3/CqlParser.cpp:44:
./auth/service.hh:188:21: error: declaration of ‘const auth::resource& auth::command_desc::resource’ changes meaning of ‘resource’ [-fpermissive]
188 | const resource& resource; ///< Resource impacted by this command.
| ^~~~~~~~
In file included from ./auth/authenticator.hh:57,
from ./auth/service.hh:33,
from ./service/client_state.hh:44,
from ./cql3/cql_statement.hh:44,
from ./cql3/statements/prepared_statement.hh:47,
from ./cql3/statements/raw/select_statement.hh:45,
from build/dev/gen/cql3/CqlParser.hpp:64,
from build/dev/gen/cql3/CqlParser.cpp:44:
./auth/resource.hh:98:7: note: ‘resource’ declared here as ‘class auth::resource’
98 | class resource final {
| ^~~~~~~~
clang doesn't fail
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201118155905.14447-1-xemul@scylladb.com>
If a list of target endpoints for sending view updates contains
duplicates, it results in benign (but annoying) broken promise
errors happening due to duplicated write response handlers being
instantiated for a single endpoint.
In order to avoid such errors, target remote endpoints are deduplicated
from the list of pending endpoints.
A similar issue (#5459) solved the case for duplicated local endpoints,
but that didn't solve the general case.
Fixes#7572Closes#7641
This PR allows changing the hinted_handoff_enabled option in runtime, either by modifying and reloading YAML configuration, or through HTTP API.
This PR also introduces an important change in semantics of hinted_handoff_enabled:
- Previously, hinted_handoff_enabled controlled whether _both writing and sending_ hints is allowed at all, or to particular DCs,
- Now, hinted_handoff_enabled only controls whether _writing hints_ is enabled. Sending hints from disk is now always enabled.
Fixes: #5634
Tests:
- unit(dev) for each commit of the PR
- unit(debug) for the last commit of the PR
Closes#6916
* github.com:scylladb/scylla:
api: allow changing hinted handoff configuration
storage_proxy: fix wrong return type in swagger
hints_manager: implement change_host_filter
storage_proxy: always create hints manager
config: plug in hints::host_filter object into configuration
db/hints: introduce host_filter
hints/resource_manager: allow registering managers after start
hints: introduce db::hints::directory_initializer
directories.cc: prepare for use outside main.cc
Fixes#7064
Iff broadcast address is set to ipv6 from main (meaning prefer
ipv6), determine the "public" ipv6 address (which should be
the same, but might not be), via aws metadata query.
Closes#7633
available_memory is used to seed many caches and controllers. Usually
it's detected from the environment, but unit tests configure it
on their own with fake values. If they forget, then the undefined
behavior sanitizer will kick in in random places (see 8aa842614a
("test: gossip_test: configure database memory allocation correctly")
for an example.
Prevent this early by asserting that available_memory is nonzero.
Closes#7612
std::iterator is deprecated since C++17 so define all the required iterator_traits directly and stop using std::iterator at all.
More context: https://www.fluentcpp.com/2018/05/08/std-iterator-deprecated
Tests: unit(dev)
Closes#7635
* github.com:scylladb/scylla:
log_heap: Remove std::iterator from hist_iterator
types: Remove std::iterator from tuple_deserializing_iterator
types: Remove std::iterator from listlike_partial_deserializing_iterator
sstables: remove std::iterator from const_iterator
token_metadata: Remove std::iterator from tokens_iterator
size_estimates_virtual_reader: Remove std::iterator
token_metadata: Remove std::iterator from tokens_iterator_impl
counters: Remove std::iterator from iterators
compound_compat: Remove std::iterator from iterators
compound: Remove std::iterator from iterator
clustering_interval_set: Remove std::iterator from position_range_iterator
cdc: Remove std::iterator from collection_iterator
cartesian_product: Remove std::iterator from iterator
bytes_ostream: Remove std::iterator from fragment_iterator
We saw this intermittent failure in testCalculateEndpoints:
```
unknown location(0): fatal error: in "testCalculateEndpoints": std::out_of_range: _Map_base::at
```
It turns out that there are no endpoints associated with the dc passed
to has_sufficient_replicas in the `all_endpoints` map.
Handle this case by returning true.
The dc is still required to appear in `dc_replicas`,
so if it's not found there, fail the test gracefully.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently stop_snitch is not called if the test fails on exception.
This causes a failure in create_snitch where snitch_inst fails to start
since it wasn't stopped earlier.
For example:
```
test/boost/network_topology_strategy_test.cc(0): Entering test case "testCalculateEndpoints"
unknown location(0): fatal error: in "testCalculateEndpoints": std::out_of_range: _Map_base::at
./seastar/src/testing/seastar_test.cc(43): last checkpoint
test/boost/network_topology_strategy_test.cc(0): Leaving test case "testCalculateEndpoints"; testing time: 15192us
test/boost/network_topology_strategy_test.cc(0): Entering test case "test_invalid_dcs"
network_topology_strategy_test: ./seastar/include/seastar/core/future.hh:634: void seastar::future_state<seastar::internal::monostate>::set(A &&...) [T = seastar::internal::monostate, A = <>]: Assertion `_u.st == state::future' failed.
Aborting on shard 0.
Backtrace:
0x0000000002825e94
0x000000000282ffa9
0x00007fd065f971df
/lib64/libc.so.6+0x000000000003dbc4
/lib64/libc.so.6+0x00000000000268a3
/lib64/libc.so.6+0x0000000000026788
/lib64/libc.so.6+0x0000000000035fc5
0x0000000000b484cf
0x0000000002a7c69f
0x0000000002a7c62f
0x0000000000b47b9e
0x0000000002595da2
0x0000000002595913
0x0000000002a83a31
```
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
scoped_critical_alloc_section was recently introduced to replace
disable_failure_guard and made the old class deprecated.
This patch replaces all occurences of disable_failure_guard with
scoped_critical_alloc_section.
Without this patch the build prints many warnings like:
warning: 'disable_failure_guard' is deprecated: Use scoped_critical_section instead [-Wdeprecated-declarations]
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <ca2a91aaf48b0f6ed762a6aa687e6ac5e936355d.1605621284.git.piotr@scylladb.com>
As requested in #7057, allow certain alterations of system_auth tables. Potentially destructive alterations are still rejected.
Tests: unit (dev)
Closes#7606
* github.com:scylladb/scylla:
auth: Permit ALTER options on system_auth tables
auth: Add command_desc
auth: Add tests for resource protections
And since now there is no danger of them filling the logs, the log-level
is promoted to info, so users can see the diagnostics messages by
default.
The rate-limit chosen is 1/30s.
Refs: #7398
Tests: manual
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201117091253.238739-1-bdenes@scylladb.com>
This commit makes it possible to change hints manager's configuration at
runtime through HTTP API.
To preserve backwards compatibility, we keep the old behavior of not
creating and checking hints directories if they are not enabled at
startup. Instead, hint directories are lazily initialized when hints are
enabled for the first time through HTTP API.
The GET `hinted_handoff_enabled_by_dc` endpoint had an incorrect return
type specified. Although it does not have an implementation, yet, it was
supposed to return a list of strings with DC names for which generating
hints is enabled - not a list of string pairs. Such return type is
expected by the JMX.
Implements a function which is responsible for changing hints manager
configuration while it is running.
It first starts new endpoint managers for endpoints which weren't
allowed by previous filter but are now, and then stops endpoint managers
which are rejected by the new filter.
The function is blocking and waits until all relevant ep managers are
started or stopped.
Now, the hints manager object for regular hints is always created, even
if hints are disabled in configuration. Please note that the behavior of
hints will be unchanged - no hints will be sent when they are disabled.
The intent of this change is to make enabling and disabling hints in
runtime easier to implement.
Uses db::hints::host_filter as the type of hinted_handoff_enabled
configuration option.
Previously, hinted_handoff_enabled used to be a string option, and it
was parsed later in a separate function during startup. The function
returned a std::optional<std::unordered_set<sstring>>, whose meaning in
the context of hints is rather enigmatic for an observer not familiar
with hints.
Now, hinted_handoff_enabled has type of db::hints::host_filter, and it
is plugged into the config parsing framework, so there is no need for
later post-processing.
Adds a db::hints::host_filter structure, which determines if generating
hints towards a given target is currently allowed. It supports
serialization and deserialization between the hinted_handoff_enabled
configuration/cli option.
This patch only introduces this structure, but does not make other code
use it. It will be plugged into the configuration architecture in the
following commits.
This change modifies db::hints::resource_manager so that it is now
possible to add hints::managers after it was started.
This change will make it possible to register the regular hints manager
later in runtime, if it wasn't enabled at boot time.
Introduces a db::hints::directory_initializer object, which encapsulates
the logic of initializing directories for hints (creating/validating
directories, segment rebalancing). It will be useful for lazy
initialization of hints manager.
Currently, the `directories` class is used exclusively during
initialization, in the main() function. This commit refactors this class
so that it is possible to use it to initialize directories much later
after startup.
The intent of this change is to make it possible for hints manager to
create directories for hints lazily. Currently, when Scylla is booted
with hinted handoff disabled, the `hints_directory` config parameter is
ignored and directories for hints are neither created nor verified.
Because we would like to preserve this behavior and introduce
possibility to switch hinted handoff on in runtime, the hints
directories will have to be created lazily the first time hinted handoff
is enabled.
* seastar 043ecec7...c861dbfb (3):
> Merge "memory: allow configuring when to dump memory diagnostics on allocation failures" from Botond
> perftune.py: support kvm-clock on tune-clock
> execution_stage: inheriting_concrete_execution_stage: add get_stats()
These alterations cannot break the database irreparably, so allow
them.
Expand command_desc as required.
Add a type (rather than command_desc) parameter to
has_column_family_access() to minimize code changes.
Fixes#7057
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When a node bootstraps or upgrades from a pre-CDC version, it creates a
new CDC generation, writes it to a distributed table
(system_distributed.cdc_generation_descriptions), and starts gossiping
its timestamp. When other nodes see the timestamp being gossiped, they
retrieve the generation from the table.
The bootstrapping/upgrading node therefore assumes that the generation
is made durable and other nodes will be able to retrieve it from the
table. This assumption could be invalidated if periodic commitlog mode
was used: replicas would acknowledge the write and then immediately
crash, losing the write if they were unlucky (i.e. commitlog wasn't
synced to disk before the write was acknowledged).
This commit enforces all writes to the generations table to be
synced to commitlog immediately. It does not matter for performance as
these writes are very rare.
Fixes https://github.com/scylladb/scylla/issues/7610.
Closes#7619
An entry can be snapshotted, before the outgoing message is sent, so the
message has to hold to it to avoid use after free.
Message-Id: <20201116113323.GA1024423@scylladb.com>
Materialized view updates participate in a retirement program,
which makes sure that they are immediately taken down once their
target node is down, without having to wait for timeout (since
views are a background operation and it's wasteful to wait in the
background for minutes). However, this mechanism has very delicate
lifetime issues, and it already caused problems more than once,
most recently in #5459.
In order to make another bug in this area less likely, the two
implementations of the mechanism, in on_down() and drain_on_shutdown(),
are unified.
Possibly refs #7572Closes#7624
Commit e5be3352cf ("database, streaming, messaging: drop
streaming memtables") removed streaming memtables; this removes
the mechanisms to synchronize them: _streaming_flush_gate and
_streaming_flush_phaser. The memory manager for streaming is removed,
and its 10% reserve is evenly distributed between memtables and
general use (e.g. cache).
Note that _streaming_flush_phaser and _streaming_flush_date are
no longer used to syncrhonize anything - the gate is only used
to protect the phaser, and the phaser isn't used for anything.
Closes#7454
DEBIAN_FRONTEND environment variable was added just for prevent opening
dialog when running 'apt-get install mdadm', no other program depends on it.
So we can move it inside of apt_install()/apt_uninstall() and drop scylla_env,
since we don't have any other environment variables.
To passing the variable, added env argument on run()/out().
"
This is a follow-up on 052a8d036d
"Avoid stalls in token_metadata and replication strategy"
The added mutate_token_metadata helper combines:
- with_token_metadata_lock
- get_mutable_token_metadata_ptr
- replicate_to_all_cores
Test: unit(dev)
"
* tag 'mutate_token_metadata-v1' of github.com:bhalevy/scylla:
storage_service: fixup indentation
storage_service: mutate_token_metadata: do replicate_to_all_cores
storage_service: add mutate_token_metadata helper
Replicate the mutated token_metadata to all cores on success.
This moves replication out of update_pending_ranges(mutable_token_metadata_ptr, sstring),
so add explicit call to replicate_to_all_cores where it is called outside
of mutate_token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Replace a repeating pattern of:
with_token_metadata_lock([] {
return get_mutable_token_metadata_ptr([] (mutable_token_metadata_ptr tmptr) {
// mutate token_metadata via tmptr
});
});
With a call to mutate_token_metadata that does both
and calls the function with then mutable_token_metadata_ptr.
A following patch will also move the replication to all
cores to mutate_token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The "dist" target fails as follows:
$ ./tools/toolchain/dbuild ninja dist
ninja: error: 'build/dev/scylla-unified-package-..tar.gz', needed by 'dist-unified-tar', missing and no known rule to make it
Fix two issues:
- Fix Python variable references to "scylla_version" and
"scylla_release", broken by commit bec0c15ee9 ("configure.py: Add
version to unified tarball filename"). The breakage went unnoticed
because ninja default target does not call into dist...
- Remove dependencies to build/<mode>/scylla-unified-package.tar.gz. The
file is now in build/<mode>/dist/tar/ directory and contains version
and release in the filename.
Message-Id: <20201113110706.150533-1-penberg@scylladb.com>
To test handling of connectivity issues and recovery add support for
disconnecting servers.
This is not full partitioning yet as it doesn't allow connectivity
across the disconnected servers (having multiple active partitions.
* https://github.com/alecco/scylla/pull/new/raft-ale-partition-simple-v3:
raft: replication test: connectivity partitioning support
raft: replication test: block rpc calls to disconnected servers
raft: replication test: add is_disconnected helper
raft: replication test: rename global variable
raft: replication test: relocate global connection state map
We currently keep a copy of scylla-package.tar.gz in "build/<mode>" for
compatibility. However, we've long since switched our CI system over to
the new location, so let's remove the duplicate and use the one from
"build/<mode>/dist/tar" instead.
Message-Id: <20201113075146.67265-1-penberg@scylladb.com>
Add to the DynamoDB compatibility document, docs/alternator/compatibility.md,
a mention that Alternator streams are still an experimental features, and
how to turn it on (at this point CDC is no longer an experimental feature,
but Alternator Streams are).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201112184436.940497-1-nyh@scylladb.com>
Drop the adjective "experimental" used to describe Alternator in
docs/alternator/getting-started.md.
In Scylla, the word "experimental" carries a specific meaning - no support
for upgrades, not enough QA, not ready for general use) and Alternator is
no longer experimental in that sense.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201112185249.941484-1-nyh@scylladb.com>
This adds some test cases for ALTER KEYSPACE:
- ALTER KEYSPACE happy path
- ALTER KEYSPACE wit invalid options
- ALTER KEYSPACE for non-existing keyspace
- CREATE and ALTER KEYSPACE using NetworkTopologyStrategy with
non-existing data center in configuration, which triggers a bug in
Scylla:
https://github.com/scylladb/scylla/issues/7595
Message-Id: <20201112073110.39475-1-penberg@scylladb.com>
Introduce partition update command consisting of nodes still seeing
each other. Nodes not included are disconnected from everything else.
If the previous leader is not part of the new partition, the first node
specified in the partition will become leader.
For other nodes to accept a new leader it has to have a committed log.
For example, if the desired leader is being re-connected and it missed
entries other nodes saw it will not win the election. Example A B C:
partition{A,C},entries{2},partition{B,C}
In this case node C won't accept B as a new leader as it's missing 2
entries.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
In main.cc, we spawn a future which starts the hints manager, but we
don't wait for it to complete. This can have the following consequences:
- The hints manager does some asynchronous operations during startup,
so it can take some time to start. If it is started after we start
handling requests, and we admit some requests which would result in
hints being generated, those hints will be dropped instead because we
check if hints manager is started before writing them.
- Initialization of hints manager may fail, and Scylla won't be stopped
because of it (e.g. we don't have permissions to create hints
directories). The consequence of this is that hints manager won't be
started, and hints will be dropped instead of being written. This may
affect both regular hints manager, and the view hints manager.
This commit causes us to wait until hints manager start and see if there
were any errors during initialization.
Fixes#7598Closes#7599
CDC is ready to be a non-experimental feature so remove the experimental flag for it.
Also, guard Alternator Streams with their own experimental flag. Previously, they were using CDC experimental flag as they depend on CDC.
Tests: unit(dev)
Closes#7539
* github.com:scylladb/scylla:
alternator: guard streams with an experimental flag
Mark CDC as GA
cdc: Make it possible for CDC generation creation to fail
Add new alternator-streams experimental flag for
alternator streams control.
CDC becomes GA and won't be guarded by an experimental flag any more.
Alternator Streams stay experimental so now they need to be controlled
by their own experimental flag.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Following patch enables CDC by default and this means CDC has to work
will all the clusters now.
There is a problematic case when existing cluster with no CDC support
is stopped, all the binaries are updated to newer version with
CDC enabled by default. In such case, nodes know that they are already
members of the cluster but they can't find any CDC generation so they
will try to create one. This creation may fail due to lack of QUORUM
for the write.
Before this patch such situation would lead to node failing to start.
After the change, the node will start but CDC generation will be
missing. This will mean CDC won't be able to work on such cluster before
nodetool checkAndRepairCdcStreams is run to fix the CDC generation.
We still fail to bootstrap if the creation of CDC generation fails.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
When a missing base column happens to be named `idx_token`,
an additional helper message is printed in logs.
This additional message does not need to have `error` severity,
since the previous, generic message is already marked as `error`.
This patch simply makes it easier to write tests, because in case
this error is expected, only one message needs to be explicitly
ignored instead of two.
Closes#7597
This miniseries adds metrics which can help the users detect potential overloads:
* due to having too many in-flight hints
* due to exceeding the capacity of the read admission queue, on replica side
Closes#7584
* github.com:scylladb/scylla:
reader_concurrency_semaphore: add metrics for shed reads
storage_proxy: add metrics for too many in-flight hints failures
If interposer consumer is enabled, partition filtering will be done by the
consumer instead, but that's not possible because only the producer is able
to skip to the next partition if the current one is filtered out, so scylla
crashes when that happens with a bad function call in queue_reader.
This is a regression which started here: 55a8b6e3c9
To fix this problem, let's make sure that partition filtering will only
happen on the producer side.
Fixes#7590.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201111221513.312283-1-raphaelsc@scylladb.com>
"
This series is a rebased version of 3 patchsets that were sent
separately before:
1. [PATCH v4 00/17] Cleanup storage_service::update_pending_ranges et al.
This patchset cleansup service/storage_service use of
update_pending_ranges and replicate_to_all_cores.
It also moves some functionality from gossiping_property_file_snitch::reload_configuration
into a new method - storage_service::update_topology.
This prepares storage_service for using a shared ptr to token_metadata,
updating a copy out of line under a semaphore that serializes writers,
and eventually replicating to updated copy to all shards and releasing
the lock. This is a follow up to #7044.
2. [PATCH v8 00/20] token_metadata versioned shared ptr
Rather than keeping references on token_metadata use a shared_token_metadata
containing a lw_shared_ptr<token_metadata> (a.k.a token_metadata_ptr)
to keep track of the token_metadata.
Get token_metadata_ptr for a read-only snapshot of the token_metadata
or clone one for a mutable snapshot that is later used to safely update
the base versioned_shared_object.
token_metadata_ptr is used to modify token_metadata out of line, possibly with
multiple calls, that could be preeempted in-between so that readers can keep a consistent
snapshot of it while writers prepare an updated version.
Introduce a token_metadata_lock used to serialize mutators of token_metadata_ptr.
It's taken by the storage_service before cloning token_metadata_ptr and held
until the updated copy is replicated on all shards.
In addition, this series introduces token_metadata::clone_async() method
to copy the tokne_metadata class using a asynchronous function with
continuations to avoid reactor stalls as seen in #7220.
Fixes#7044
3. [PATCH v3 00/17] Avoid stalls in token_metadata and replication strategy
This series uses the shared_token_metadata infrastructure.
First patches in the series deal wth cloning token_metadata
using continuations to allow preemption while cloning (See #7220).
Then, the rest of the series makes sure to always run
`update_pending_ranges` and `calculate_pending_ranges_for_*` in a thread,
it then adds a `can_yield` parameter to the token_metadata and abstract_replication_strategy
`get_pending_ranges` and friends, and finally it adds `maybe_yield` calls
in potentially long loops.
Fixes#7313Fixes#7220
Test: unit (dev)
Dtest: gating(dev)
"
* tag 'replication_strategy_can_yield-v4' of github.com:bhalevy/scylla: (54 commits)
token_metadata_impl: set_pending_ranges: add can_yield_param
abstract_replication_strategy: get rid of get_ranges_in_thread
repair: call get_ranges_in_thread where possible
abstract_replication_strategy: add can_yield param to get_pending_ranges and friends
abstract_replication_strategy: define can_yield bool_class
token_metadata_impl: calculate_pending_ranges_for_* reindent
token_metadata_impl: calculate_pending_ranges_for_* pass new_pending_ranges by ref
token_metadata_impl: calculate_pending_ranges_for_* call in thread
token_metadata: update_pending_ranges: create seastar thread
abstract_replication_strategy: add get_address_ranges method for specific endpoint
token_metadata_impl: clone_after_all_left: sort tokens only once
token_metadata: futurize clone_after_all_left
token_metadata: futurize clone_only_token_map
token_metadata: use mutable_token_metadata_ptr in calculate_pending_ranges_for_*
repair: replace_with_repair: use token_metadata::clone_async
storage_service: reindent token_metadata blocks
token_metadata: add clone_async
abstract_replication_strategy: accept a token_metadata_ptr in get_pending_address_ranges methods
abstract_replication_strategy: accept a token_metadata_ptr in get_ranges methods
boot_strapper: get_*_tokens: use token_metadata_ptr
...
Add a test that better clarifies what StartingSequenceNumber returned by
DescribeStream really guarantees (this question was raised in a review
of a different patch). The main thing we can guarantee is that reading a
shard from that position returns all the information in that shard -
similar to TRIM_HORIZON. This test verifies this, and it passes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201112081250.862119-1-nyh@scylladb.com>
When the admission queue capacity reaches its limits, excessive
reads are shed in order to avoid overload. Each such operation
now bumps the metrics, which can help the user judge if a replica
is overloaded.
This change adds metrics for counting request message types
listed in the CQL v.4 spec under section 4.1
(https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v4.spec).
To organize things properly, we introduce a new cql_server::transport_stats
object type for aggregating the message and server statistics.
Fixes#4888Closes#7574
Rewrite in a more readable way that will later allow us to split the WHERE expression in two: a storage-reading part and a post-read filtering part.
Tests: unit (dev,debug)
Closes#7591
* github.com:scylladb/scylla:
cql3: Rewrite need_filtering() from scratch
cql3: Store index info in statement_restrictions
The name of the utility function test_object_name() is confusing - by
starting with the word "test", pytest can think (if it's imported to the
top-level namespace) that it is a test... So this patch gives it a better
name - unique_name().
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201111140638.809189-1-nyh@scylladb.com>
This patch adds a new document, docs/alternator/compatibility.md,
which focuses on what users switching from DynamoDB to Alternator
need to know about where Alternator differs from DynamoDB and which
features are missing.
The compatibility information in the old alternator.md is not deleted
yet. It probably should.
Fixes#7556
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201110180242.716295-1-nyh@scylladb.com>
* seastar a62a80ba1d...043ecec732 (8):
> semaphore: make_expiry_handler: explicitly use this lambda capture
> configure: add --{enable,disable}-debug-shared-ptr option
> cmake: add SEASTAR_DEBUG_SHARED_PTR also in dev mode
> tls_test: Update the certificates to use sha256
> logger: allow applying a rate-limit to log messages
> Merge "Handle CPUs not attached to any NUMA nodes" from Pavel E
> memory: fix malloc_usable_size() during early initialization
> Merge "make semaphore related functions noexcept" from Benny
Makes it easier to understand, in preparation for separating the WHERE
expression into filtering and storage-reading parts.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
To rewrite need_filtering() in a more readable way, we need to store
info on found indexes in statement_restrictions data members.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
The functions can be simplified as they are all now being called
from a seastar thread.
Make them sequential, returning void, and yielding if necessary.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Some of the callers of get_address_ranges are interested in the ranges
of a specific endpoint.
Rather than building a map for all endpoints and then traversing
it looking for this specific endpoint, build a multimap of token ranges
relating only to the specified endpoint.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the sorted tokens are copied needlessly by on this path
by `clone_only_token_map` and then recalculated after calling
remove_endpoint for each leaving endpoint.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Call the futurized clone_only_token_map and
remove the _leaving_endpoints from the cloned token_metadata_impl.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Does part of clone_async() using continuations to prevent stalls.
Rename synchronous variant to clone_only_token_map_sync
that is going to be deprecated once all its users will be futurized.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Replacing old code using lw_shared_ptr<token_metadata> with the "modern"
mutable_token_metadata_ptr alias.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Clone the input token_metadata asynchronously using
clone_async() before modifying it using update_normal_tokens.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Many code blocks using with_token_metadata_lock
and get_mutable_token_metadata_ptr now need re-indenting.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Only replace_with_repair needs to clone the token_metadata
and update the local copy, so we can safely pass a read-only
snapshot of the token_metadata rather than copying it in all cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Perform replication in 2 phases.
First phase just clones the mutable_token_metadata_ptr on all shards.
Second phase applies the cloned copies onto each local_ss._shared_token_metadata.
That phase should never fail.
To add suspenders over the belt, in the impossible case we do get an
exception, it is logged and we abort.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
clone _token_metadata for updating into _updated_token_metadata
and use it to update the local token_metadata on all shard via
do_update_pending_ranges().
Adjust get_token_metadata to get either the update the updated_token_metadata,
if available, or the base token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than using `serialized_action`, grab a lock before mutating
_token_metadata and hold it until its replicated to all shards.
A following patch will use a mutable token_metadata_ptr
that is updated out of line under the lock.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the replication to other shards happens later in `prepare_to_join`
that is called in `init_server`.
We should isolate the changes made by init_server and update them first
to all shards so that we can serialize them easily using a lock
and a mutable_token_metadata_ptr, otherwise the lock and the mutable_token_metadata_ptr
will have to be handed over (from this call path) to `prepare_to_join`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
get() the latest token_metadata_ptr from the
shared_token_metadata before each use.
expose get_token_metadata_ptr() rather than get_token_metadata()
so that caller can keep it across continuations.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To facilitate that, keep a const shared_token_metadata& in class database
rather than a const token_metadata&
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In preparation to chaging network_topology_strategy to
accept a const shared_token_metadata& rather than token_metadata&.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than accessing abstract_replication_strategy::_token_metedata directly.
In preparation to changing it to a shared_token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And use it to get a token_metadata& compatible
with current usage, until the services are converted to
use token_metadata_ptr.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that `replicate_tm_only` doesn't throw, we handle all errors
in `replicate_tm_only().handle_exception`.
We can't just proceed with business as usual if we failed to replicate
token_metadata on all shards and continue working with inconsistent
copies.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And with that mark also do_replicate_to_all_cores as noexcept.
The motivation to do so is to catch all errors in replicate_tm_only
and calling on_internal_error in the `handle_exception` continuation
in do_replicate_to_all_cores.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than calling invalidate_cached_rings and update_topology
on all shards do that only on shard 0 and then replicate
to all other shards using replicate_to_all_cores as we do
in all other places that modify token_metadata.
Do this in preparation to using a token_metadata_ptr
with which updating of token_metadata is done on a cloned
copy (serialized under a lock) that becomes visible only when
applied with replicate_to_all_cores.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move the functionality from gossiping_property_file_snitch::reload_configuration
to the storage_service class.
With that we can make get_mutable_token_metadata private.
TODO: update token_metadata on shard 0 and then
replicate_to_all_cores rather than updating on all shards
in parallel.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
keyspace_changed just calls update_pending_ranges (and ignoring any
errors returned from it), so invoke it on shard 0, and with
that update_pending_ranges() is always called on shard 0
and it doesn't need to use `invoke_on` shard 0 by itself.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We need to assert in only 2 places:
do_update_pending_ranges, that updates token metadata,
and replicate_tm_only, that copies the token metadata
to all other shards.
Currently we throw errors if this is violated
but it should never happen and it's not really recoverable.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently update_pending_ranges involves 2 serialized actions:
do_update_pending_ranges, and then replicate_to_all_cores.
These can be combind by calling do_replicate_to_all_cores
directly from do_update_pending_ranges.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It was introduced in 74b4035611
As part of the fix for #3203.
However, the reactor stalls have nothing to do with gossip
waiting for update_pending_ranges - they are related to it being
synchronous and quadratic in the number of tokens
(e.g. get_address_ranges calling calculate_natural_endpoints
for every token then simple_strategy::calculate_natural_endpoints
calling get_endpoint for every token)
There is nothing special in handle_state_leaving that requires
moving update_pending_ranges to the background, we call
update_pending_ranges in many other places and wait for it
so if gossip loop waiting on it was a real problem, then it'd
be evident in many other places.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently _update_pending_ranges_action is called only on shard 0
and only later update_pending_ranges() updates shard 0 again and replicates
the result to all shards.
There is no need to wait between the two, and call _update_pending_ranges_action
again, so just call update_pending_ranges() in the first place.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
so that the updated host_id (on shard 0) will get replicated to all shards
via update_pending_ranges.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It's already done by each handle_state_* function
either by directly calling replicate_to_all_cores or indirectly, via
update_pending_renages.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the updates to token_metadata are immediately visible
on shard 0, but not to other shards until replicate_to_all_cores
syncs them.
To prepare for converting to using shared token_metadata.
In the new world the updated token_metadata is not visible
until committed to the shared_token_metadata, so
commit it here and replicate to all other shards.
It is not clear this isn't needed presently too.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If the consumer happens to check the EOS flag before it hits the
exception injected by the abort (by calling fill_buffer()), they can
think the stream ended normally and expect it to be valid. However this
is not guaranteed when the reader is aborted. To avoid consumers falsely
thinking the stream ended normally, don't set the EOS flag on abort at
all.
Additionally make sure the producer is aborted too on abort. In theory
this is not needed as they are the one initiating the abort, but better
to be safe then sorry.
Fixes: #7411
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201102100732.35132-1-bdenes@scylladb.com>
"
Today, if scylla crashes mid-way in sstable::idempotent-move-sstable
or sstable::create_links we may end up in an inconsistent state
where it refuses to restart due to the presence of the moved-
sstable component files in both the staging directory and
main directory.
This series hardens scylla against this scenario by:
1. Improving sstable::create_links to identify the replay condition
and support it.
2. Modifying the algorithm for moving sstables between directories
to never be in a state where we have two valid sstable with the
same generation, in both the source and destination directories.
Instead, it uses the temporary TOC file as a marker for rolling
backwards or forewards, and renames it atomically from the
destination directory back to the source directory as a commit
point. Before which it is preparing the sstable in the destination
dir, and after which it starts the process of deleting the sstable
in the source dir.
Fixes#7429
Refs #5714
"
* tag 'idempotent-move-sstable-v3' of github.com:bhalevy/scylla:
sstable: create_links: support for move
sstable_directory: support sstables with both TemporaryTOC and TOC
sstable: create_links: move automatic sstring variables
sstable: create_links: use captured comps
sstable: create_links: capture dir by reference
sstable: create_links: fix indentation
sstable: create_links: no need to roll-back on failure anymore
sstable: create_links: support idempotent replay
sstable: create_links: cleanup style
sstable: create_links: add debug/trace logging
sstable: move_to_new_dir: rm TOC last
sstable: move_to_new_dir: io check remove calls
test: add sstable_move_test
Previously, test/cql-pytest/run was a Python script, while
test/cql-pytest/run-cassandra (to run the tests against Cassandra)
was still a shell script - modeled after test/alternator/run.
This patch makes rewrites run-cassandra in Python.
A lot of the same code is needed for both run and run-cassandra
tools. test/cql-pytest/run was already written in a way that this
common code was separate functions. For example, functions to start a
server in a temporary directory, to check when it finishes booting,
and to clean up at the end. This patch moves this common code to
a new file, "run.py" - and the tools "run" and "cassandra-run" are
very short programs which mostly use functions from run.py (run-cassandra
also has some unique code to run Cassandra, that no other test runner
will need).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201110215210.741753-1-nyh@scylladb.com>
We currently set PATH for relocatable CLI tools in scylla_util.run() and
scylla_util.out(), but it doesn't work for perftune.py, since it's not part of
Scylla, does not use scylla_util module.
We can set PATH in python thunk instead, it can set PATH for all python scripts.
Fixes#7350
After full cluster shutdown, the node which is being replaced will not have its
STATUS set to NORMAL (bug #6088), so listeners will not update _token_metadata.
The bootstrap procedure of replacing node has a workaround for this
and calls update_normal_tokens() on token metadata on behalf of the
replaced node based on just its TOKENS state obtained in the shadow
round.
It does this only for the replacing_a_node_with_same_ip case, but not
for replacing_a_node_with_diff_ip. As a result, replacing the node
with the same ip after full cluster shutdown fails.
We can always call update_normal_tokens(). If the cluster didn't
crash, token_metadata would get the tokens.
Fixes#4325
Message-Id: <1604675972-9398-1-git-send-email-tgrabiec@scylladb.com>
This patch introduces a new way to do functional testing on Scylla,
similar to Alternator's test/alternator but for the CQL API:
The new tests, in test/cql-pytest, are written in Python (using the pytest
framework), and use the standard Python CQL driver to connect to any CQL
implementation - be it Scylla, Cassandra, Amazon Keyspaces, or whatever.
The use of standard CQL allows the test developer to easily run the same
test against both Scylla and Cassandra, to confirm that the behaviour that
our test expects from Scylla is really the "correct" (meaning Cassandra-
compatible) behavior.
A developer can run Scylla or Cassandra manually, and run "pytest"
to connect to them (see README.md for more instructions). But even more
usefully, this patch also provides two scripts: test/cql-pytest/run and
test/cql-pytest/run-cassandra. These scripts automate the task of running
Scylla or Cassandra (respectively) in a random IP address and temporary
directory, and running the tests against it.
The script test/cql-pytest/run is inspired by the existing test run
scripts of Alternator and Redis, but rewritten in Python in a way that
will make it easy to rewrite - in a future patch - all these other run
scripts to use the same common code to safely run a test server in a
temporary directory.
"run" is extremely quick, taking around two seconds to boot Scylla.
"run-cassandra" is slower, taking 13 seconds to boot Cassandra (maybe
this can be improved in the future, I still don't know how).
The tests themselves take milliseconds.
Although the 'run' script runs a single Scylla node, the developer
can also bring up any size of Scylla or Cassandra cluster manually
and run the tests (with "pytest") against this cluster.
This new test framework differs from the existing alternatives in the
following ways:
dtest: dtest focuses on testing correctness of *distributed* behavior,
involving clusters of multiple nodes and often cluster changes
during the test. In contrast, cql-pytest focuses on testing the
*functionality* of a large number of small CQL features - which
can usually be tested on a single-node cluster.
Additionally, dtest is out-of-tree, while cql-pytest is in-tree,
making it much easier to add or change tests together with code
patches.
Finally, dtest tests are notoriously slow. Hundreds of tests in
the new framework can finish faster than a single dtest.
Slow and out-of-tree tests are difficult to write, and I believe
this explains why no developer loves writing dtests and maintainers
do not insist on having them. I hope cql-pytest can change that.
test/cql: The defining difference between the existing test/cql suite
and the new test/cql-pytest is the new framework is programmatic,
Python code, not a text file with desired output. Tests written with
` code allow things like looping, repeating the same test with different
parameters. Also, when a test fails, it makes it easier to understand
why it failed beyond just the fact that the output changed.
Moreover, in some cases, the output changes benignly and cql-pytest
may check just the desired features of the output.
Beyond this, the current version of test/cql cannot run against
Cassandra. test/cql-pytest can.
The primary motivation for this new framework was
https://github.com/scylladb/scylla/issues/7443 - where we had an
esoteric feature (sort order of *partitions* when an index is addded),
which can be shown in Cqlsh to have what we think is incorrect behavior,
and yet: 1. We didn't catch this bug because we never wrote a test for it,
possibly because it too difficult to contribute tests, and 2. We *thought*
that we knew what Cassandra does in this case, but nobody actually tested
it. Yes, we can test it manually with cqlsh, but wouldn't everything be
better if we could just run the same test that we wrote for Scylla against
Cassandra?
So one of the tests we add in this patch confirms issue #7443 in Scylla,
and that our hunch was correct and Cassandra indeed does not have this
problem. I also add a few trivial tests for keyspace create and drop,
as additional simple examples.
Refs #7443.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201110110301.672148-1-nyh@scylladb.com>
I noticed that we require filtering for continuous clustering key, which is not necessary. I dropped the requirement and made sure the correct data is read from the storage proxy.
The corresponding dtest PR: https://github.com/scylladb/scylla-dtest/pull/1727
Tests: unit (dev,debug), dtest (next-gating, cql*py)
Closes#7460
* github.com:scylladb/scylla:
cql3: Delete some newlines
cql3: Drop superfluous ALLOW FILTERING
cql3: Drop unneeded filtering for continuous CK
When there are too many in-flight hints, writes start returning
overloaded exceptions. We're missing metrics for that, and these could
be useful when judging if the system is in overloaded state.
This commit changes the build file generation and the package
creation scripts to be product aware. This will change the
relocatable package archives to be named after the product,
this commit deals with two main things:
1. Creating the actual Scylla server relocatable with a product
prefixed name - which is independent of any other change
2. Expect all other packages to create product prefixed archive -
which is dependant uppon the actual submodules creating
product prefixed archives.
If the support is not introduced in the submodules first this
will break the package build.
Tests: Scylla full build with the original product and a
different product name.
Closes#7581
Currently debian_files_gen.py mistakenly renames scylla-server.service to
"scylla-server." on non-standard product name environment such as
scylla-enterprise, it should be fix to correct filename.
Fixes#7423
This patch introduces many changes to the Scylla `CMakeLists.txt`
to enable building Scylla without resorting to pre-building
with a previous configure.py build, i.e. cmake script can now
be used as a standalone solution to build and execute scylla.
Submodules, such as Seastar and Abseil, are also dealt with
by importing their CMake scripts directly via `add_subdirectory`
calls. Other submodules, such as `libdeflate` now have a
custom command to build the library at runtime.
There are still a lot of things that are incomplete, though:
* Missing auxiliary packaging targets
* Unit-tests are not built (First priority to address in the
following patches)
* Compile and link flags are mostly hardcoded to the values
appropriate for the most recent Fedora 33 installation.
System libraries should be found via built-in `Find*` scripts,
compiler and linker flags should be observed and tested by
executing feature tests.
* The current build is aimed to be built by GCC, need to support
Clang since we are moving to it.
* Utility cmake functions should be moved to a separate "cmake"
directory.
The script is updated to use the most recent CMake version available
in Fedora 33, which is 3.18.
Right now this is more of a PoC rather that a full-fledged solution
but as far as it's not used widely, we are free to evolve it in
a relaxed manner, improving it step by step to achieve feature
parity with `configure.py` solution.
The value in this patch is that now we are able to use any
C++ IDE capable of dealing with CMake solutions and take
advantage of their built-in capabilities, such as:
* Building a code model to efficiently navigate code.
* Find references to symbols.
* Use pretty-printers, beautifiers and other tools conveniently.
* Run scylla and debug it right from the IDE.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201103221619.612294-1-pa.solodovnikov@scylladb.com>
DescribeTable should return a UUID "TableId" in its reponse.
We alread had it for CreateTable, and now this patch adds it to
DescribeTable.
The test for this feature is no longer xfail. Moreover, I improved
the test to not only check that the TableId field is present - it
should also match the documented regular expression (the standard
representation of a UUID).
Refs #5026
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201104114234.363046-1-nyh@scylladb.com>
When moving a sstable between directories, we would like to
be able to crash at any point during the algorithm with a
clear way to either roll the operation forwards or backwards.
To achieve that, define sstable::create_links_common that accepts
a `mark_for_removal` flag, implementing the following algorithm:
1. link src.toc to dst.temp_toc.
until removed, the destination sstable is marked for removal.
2. link all src components to dst.
crashing here will leave dst with both temp_toc and toc.
3.
a. if mark_for_removal is unset then just remove dst.temp_toc.
this is commit the destination sstable and complete create_links.
b. if mark_for_removal is set then move dst.temp_toc to src.temp_toc.
this will atomically toggle recovery after crash from roll-back
to roll-forward.
here too, crashing at this point will leave src with both
temp_toc and toc.
Adjust the unit test for the revised algorithm.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Keep descriptors in a map so it could be searched easily by generation.
and possibly delete the descriptor, if found, in the presence of
a temporary toc component.
A following patch will add support to create_links for moving
sstables between directories. It is based on keeping a TemporaryTOC
file in the destination directory while linking all source components.
If scylla crashes here, the destination sstable will have both
its TemporaryTOC and TOC components and it needs to be removed
to roll the move backwards.
Then, create_links will atomically move the TemporaryTOC from
the destination back to the source directory, to toggle rolling
back to rolling forward by marking the source sstable for removal.
If scylla crashes here, the source sstable will have both
its TemporaryTOC and TOC components and it needs to be removed
to roll the move forward.
Add unit test for this case.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that we use `idempotent_link_file` it'll no longer
fail with EEXIST in a replay scenario.
It may fail on ENOENT, and return an exceptional future.
This will be propagated up the stack. Since it may indicate
parallel invokation of move_to_new_dir, that deletes the source
sstable while this thread links it to the same destination,
rolling back by removing the destination links would
be dangerous.
For an other error, the node is going to be isolated
and stop operation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Handle the case where create_link is replayed after crashing in the middle.
In particular, if we restart when moving sstables from staging to the base dir,
right after create_links completes, and right before deleting the source links,
we end up with seemingly 2 valid sstables, one still in staging and the other
already in the base table directory, both are hard linked to the same inodes.
Make create_links idempotent so it can replay the operation safely if crashed and
restarted at any point of its operation.
Add unit tests for replay after partial create_links that is expected to succeed,
and a test for replay when an sstable exist in the destination that is not
hard-linked to the source sstable; create_links is expected to fail in this case.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To facilitate cleanup on crash, first rename the TOC file to TOC.tmp,
and keep until all other files are removed, finally remove TOC.tmp.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This use-after move was apprently exposed after switching to clang
in commit eb861e68e9.
The directory_entry is required for std::stoi(de.name.c_str())
and later in the catch{} clause.
This shows in the node logs as a "Ignore invalid directory" debug
log message with an empty name, and caused the hintedhandoff_rebalance_test
to fail when hints files aren't rebalanced.
Test: unit(dev)
DTest: hintedhandoff_additional_test.py:TestHintedHandoff.hintedhandoff_rebalance_test (dev, debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201106172017.823577-1-bhalevy@scylladb.com>
On older distribution such as CentOS7, it does not support systemd user mode.
On such distribution nonroot mode does not work, show warning message and
skip running systemctl --user.
Fixes#7071
... to config descriptions
We allow setting the transitional auth as one of the options
in scylla.yaml, but don't mention it at all in the field's
description. Let's change that.
Closes#7565
The function is used by raft and fails with ubsan and clang.
The ub is harmless. Lets wait for it to be fixed in boost.
Message-Id: <20201109090353.GZ3722852@scylladb.com>
Retry mechanism didn't work when URLError happend. For example:
urllib.error.URLError: <urlopen error [Errno 101] Network is unreachable>
Let's catch URLError instead of HTTP since URLError is a base exception
for all exceptions in the urllib module.
Fixes: #7569Closes#7567
If _offset falls beyond compound_type->types().size()
ignore the extra components instead of accessing out of the types
vector range.
FIXME: we should validate the thrift key against the schema
and reject it in the thrift handler layer.
Refs #7568
Test: unit(dev)
DTest: cql_tests.py:MiscellaneousCQLTester.cql3_insert_thrift_test (dev, debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201108175738.1006817-1-bhalevy@scylladb.com>
Users can change `durable_writes` anytime with ALTER KEYSPACE.
Cassandra reads the value of `durable_writes` every time when applying
a mutation, so changes to that setting take effect immediately. That is,
mutations are added to the commitlog only when `durable_writes` is `true`
at the moment of their application.
Scylla reads the value of `durable_writes` only at `keyspace` construction time,
so changes to that setting take effect only after Scylla is restarted.
This patch fixes the inconsistency.
Fixes#3034Closes#7533
This series provides assorted fixes which are a
pre-requisite for the joint consensus implementation
series which follows.
* scylla-dev/raft-misc:
raft: fix raft_fsm_test flakiness
raft: drop a waiter of snapshoted entry
raft: use correct type for node info in add_server()
raft: overload operator<< for debugging
An index that is waited can be included in an installed snapshot in
which case there is no way to know if the entry was committed or not.
Abort such waiters with an appropriate error.
Overload operator<< for ostream and print relevant state for server, fsm, log,
and typed_uint64 types.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
The query processor is present in the global namespace and is
widely accessed with global get(_local)?_query_processor().
There's a long-term task to get rid of this globality and make
services and componenets reference each-other and, for and
due-to this, start and stop in specific order. This set makes
this for the query processor.
The remaining users of it are -- alternator, controllers for
client services, schema_tables and sys_dist_ks. All of them
except for the schema_tables are fixed just by passing the
reference on query processor with small patches. The schema
tables accessing qp sit deep inside the paxos code, but can
be "fixed" with the qctx thing until the qctx itself is
de-globalized.
* https://github.com/xemul/scylla/tree/br-rip-global-query-processor:
code: RIP global query processor instance
cql test env: Keep query processor reference on board
system distributed keyspace: Start sharded service erarlier
schema_tables: Use qctx to make internal requests
transport: Keep sharded query processor reference on controller
thrift: Keep sharded query processor reference on controller
alternator: Use local query processor reference to get keys
alternator: Keep local query processor reference in server
The only purpose of this change is to compile (git-bisect
safety) and thus prove that the next patch is correct.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the view builder cannot read view building progress from an
internal CQL table it produces an error message, but that only confuses
the user and the test suite -- this situation is entirely recoverable,
because the builder simply assumes that there is no progress and the
view building should start from scratch.
Fixes#7527Closes#7558
repair: Use single writer for all followers
Currently, repair master create one writer for each follower to write
rows from follower to sstables. That are RF - 1 writers in total. Each
writer creates 1 sstable for the range repaired, usually a vnode range.
Those sstables for a given vnode range are disjoint.
To reduce the compaction work, we can create one writer for all the
followers. This reduces the number of sstables generated by repair
significantly to one per vnode range from RF - 1 per vnode range.
Fixes#7525Closes#7528
* github.com:scylladb/scylla:
repair: No more vector for _writer_done and friends
repair: Use single writer for all followers
The default of DBUILD_TOOL=docker requires passwordless access to docker
by the user of dbuild. This is insecure, as any user with unconstrained
access to docker is root equivalent. Therefore, users might prefer to
run docker as root (e.g. by setting DBUILD_TOOL="sudo docker").
However, `$tool -e HOME` exports HOME as seen by $tool.
This breaks dbuild when `$tool` runs docker as a another user.
`$tool -e HOME="$HOME"` exports HOME as seen by dbuild, which is
the intended behaviour.
Closes#7555
Instead of invoking `$tool`, as is done everywhere else in dbuild,
kill_it() invoked `docker` explicitly. This was slightly breaking the
script for DBUILD_TOOL other than `docker`.
Closes#7554
Cleanup compaction is using consume_pausable_in_thread() to skip over
disowned partitions, which uses flat_mutation_reader::next_partition().
The implementation of next_partition() for the sstable reader has a
bug which may cause the following assertion failure:
scylla: sstables/mp_row_consumer.hh:422: row_consumer::proceed sstables::mp_row_consumer_k_l::flush(): Assertion `!_ready' failed.
This happens when the sstable reader's buffer gets full when we reach
the partition end. The last fragment of the partition won't be pushed
into the buffer but will stay in the _ready variable. When
next_partition() is called in this state, _ready will not be cleared
and the fragment will be carried over to the next partition. This will
cause assertion failure when the reader attempts to emit the first
fragment of the next partition.
The fix is to clear _ready when entering a partition, just like we
clear _range_tombstones there.
Fixes#7553.
Message-Id: <1604534702-12777-1-git-send-email-tgrabiec@scylladb.com>
Fixes returned rows ordering to proper signed token ordering. Before this change, rows were sorted by token, but using unsigned comparison, meaning that negative tokens appeared after positive tokens.
Rename `token_column_computation` to `legacy_token_column_computation` and add some comments describing this computation.
Added (new) `token_column_computation` which returns token as `long_type`, which is sorted using signed comparison - the correct ordering of tokens.
Add new `correct_idx_token_in_secondary_index` feature, which flags that the whole cluster is able to use new `token_column_computation`.
Switch token computation in secondary indexes to (new) `token_column_computation`, which fixes the ordering. This column computation type is only set if cluster supports `correct_idx_token_in_secondary_index` feature to make sure that all nodes
will be able to compute new `token_column_computation`. Also old indexes will need to be rebuilt to take advantage of this fix, as new token column computation type is only set for new indexes.
Fix tests according to new token ordering and add one new test to validate this aspect explicitly.
Fixes#7443
Tested manually a scenario when someone created an index on old version of Scylla and then migrated to new Scylla. Old index continued to work properly (but returning in wrong order). Upon dropping and re-creating the index, it still returned the same data, but now in correct order.
Closes#7534
* github.com:scylladb/scylla:
tests: add token ordering test of indexed selects
tests: fix tests according to new token ordering
secondary_index: use new token_column_computation
feature: add correct_idx_token_in_secondary_index
column_computation: add token_column_computation
token_column_computation: rename as legacy
The shared_from_this lw_shared_ptr must not be accessed
across shards. Capturing it in the lambda passed to
mutation_writer::distribute_reader_and_consume_on_shards
causes exactly that since the captured lw_shared_ptr
is copied on other shards, and ends up in memory corruption
as seen in #7535 (probably due to lw_shared_ptr._count
going out-of-sync when incremented/decremented in parallel
on other shards with no synchronization.
This was introduced in 289a08072a.
The writer is not needed in the body of this lambda anyways
so it doesn't need to capture it. It is already held
by the continuations until the end of the chain.
Fixes#7535
Test: repair_additional_test:RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test (dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201104142216.125249-1-bhalevy@scylladb.com>
"
Since we are switching to clang due to raft make it actually compile
with clang.
"
tgrabiec: Dropped the patch "raft: compile raft by default" because
the replication_test still fails in debug mode:
/usr/include/boost/container/deque.hpp:1802:63: runtime error: applying non-zero offset 8 to null pointer
* 'raft-clang-v2' of github.com:scylladb/scylla-dev:
raft: Use different type to create type dependent statement for static assertion
raft: drop use of <ranges> for clang
raft: make test compile with clang
raft: drop -fcoroutines support from configure.py
Now that both repair followers and repair master use a single writer. We
can get rid of the vector associated with _writer_done and friends.
Fixes#7525
Currently, repair master create one writer for each follower to write
rows from follower to sstables. That are RF - 1 writers in total. Each
writer creates 1 sstable for the range repaired, usually a vnode range.
Those sstables for a given vnode range are disjoint.
To reduce the compaction work, we can create one writer for all the
followers. This reduces the number of sstables generated by repair
significantly to one per vnode range from RF - 1 per vnode range.
Fixes#7525
A gcc bug [1] caused objects built by different versions of gcc
not to interoperate. Gcc helpfully warns when it encounters code that
could be affected.
Since we build everything with one version, and as that versions is far
newer than the last version generating incorrect code, we can silence
that warning without issue.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77728Closes#7495
Do not run tests which are not built.
For that, pass the test list from configure.py to test.py
via ninja unit_test_list target.
Minor cleanups.
* scylla-dev.git/test.py-list:
test: enable raft tests
test.py: do not run tests which are not built
configure.py: add a ninja command to print unit test list
test.py: handle ninja mode_list failure
configure.py: don't pass modes_list unless it's used
Add new test validating that rows returned from both non-indexed selects
and indexed selects return rows sorted in token order (making sure
that both positive and negative tokens are present to test if signed
comparison order is maintained).
Switches token column computation to (new) token_column_computation,
which fixes#7443, because new token column will be compared using
signed comparisons, not the previous unsigned comparison of CQL bytes
type.
This column computation type is only set if cluster supports
correct_idx_token_in_secondary_index feature to make sure that all nodes
will be able to compute (new) token_column_computation. Also old
indexes will need to be rebuilt to take advantage of this fix, as new
token column computation type is only set for new indexes.
Add new correct_idx_token_in_secondary_index feature, which will be used
to determine if all nodes in the cluster support new
token_column_computation. This column computation will replace
legacy_token_column_computation in secondary indexes, which was
incorrect as this column computation produced values that when compared
with unsigned comparison (CQL type bytes comparison) resulted in
different ordering than token signed comparison. See issue:
https://github.com/scylladb/scylla/issues/7443
Introduce new token_column_computation class which is intended to
replace legacy_token_column_computation. The new column computation
returns token as long_type, which means that it will be ordered
according to signed comparison (not unsigned comparison of bytes), which
is the correct ordering of tokens.
Raname token_column_computation to legacy_token_column_computation, as
it will be replaced with new column_computation. The reason is that this
computation returns bytes, but all tokens in Scylla can now be
represented by int64_t. Moreover, returning bytes causes invalid token
ordering as bytes comparison is done in unsigned way (not signed as
int64_t). See issue:
https://github.com/scylladb/scylla/issues/7443
meaningful
When computing moving average rates too early after startup, the
rate can be infinite, this is simply because the sample interval
since the system started is too small to generate meaningful results.
Here we check for this situation and keep the rate at 0 if it happens
to signal that there are still no meaningful results.
This incident is unlikely to happen since it can happen only during a
very small time window after restart, so we add a hint to the compiler
to optimize for that in order to have a minimum impact on the normal
usecase.
Fixes#4469
The memory configuration for the database object was left at zero.
This can cause the following chain of failures:
- the test is a little slow due to the machine being overloaded,
and debug mode
- this causes the memtable flush_controller timer to fire before
the test completes
- the backlog computation callback is called
- this calculates the backlog as dirty_memory / total_memory; this
is 0.0/0.0, which resolves to NaN
- eventually this gets converted to an integer
- UBSAN dooesn't like the convertion from NaN to integer, and complains
Fix by initializing dbcfg.available_memory.
Test: gossip_test(debug), 1000 repetitions with concurrency 6
Closes#7544
Fixes#7325
When building with clang on fedora32, calling the string_view constructor
of bignum generates broken ID:s (i.e. parsing borks). Creating a temp
std::string fixes it.
Closes#7542
Since 11a8912093, get_gossip_status
returns a std::string_view rather than a sstring.
As seen in dtest we may print garbage to the log
if we print the string_view after preemption (calling
_gossiper.reset_endpoint_state_map().get())
Test: update_cluster_layout_tests:TestUpdateClusterLayout.simple_add_two_nodes_in_parallel_test (dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201103132720.559168-1-bhalevy@scylladb.com>
"
Introduce a gentle (yielding) implementation of reserve for chunked
vector and use it when reserving the backing storage vector for large
bitset. Large bitset is used by bloom filters, which can be quite large
and have been observed to cause stalls when allocating memory for the
storage.
Fixes: #6974
Tests: unit(dev)
"
* 'gentle-reserve/v1' of https://github.com/denesb/scylla:
utils/large_bitset: use reserve_partial() to reserve _storage
utils/chunked_vector: add reserve_partial()
There's a perf_bptree test that compares B+ tree collection with
std::set and std::map ones. There will come more, also the "patterns"
to compare are not just "fill with keys" and "drain to empty", so
here's the perf_collection test, that measures timings of
- fill with keys
- drain key by key
- empty with .clear() call
- full scan with iterator
- insert-and-remove of a single element
for currently used collections
- std::set
- std::map
- intrusive_set_external_comparator
- bplus::tree
* https://github.com/xemul/scylla/tree/br-perf-collection-test:
test: Generalize perf_bptree into perf_collection
perf_collection: Clear collection between itartions
perf_collection: Add intrusive_set_external_comparator
perf_collection: Add test for single element insertion
perf_collection: Add test for destruction with .clear()
perf_collection: Add test for full scan time
To avoid stalls when reserving memory for a large bloom filter. The
filter creation already has a yielding loop for initialization, this
patch extends it to reservation of memory too.
A variant of reserve() which allows gentle reserving of memory. This
variant will allocate just one chunk at a time. To drive it to
completion, one should call it repeatedly with the return value of the
previous call, until it returns 0.
This variant will be used in the next patch by the large bitset creation
code, to avoid stalls when allocating large bloom filters (which are
backed by large bitset).
Although the code for it existed already, the validation function
hasn't been invoked properly. This change fixes that, adding
a validating check when converting from text to specific value
type and throwing a marshal exception if some characters
are not ASCII.
Fixes#5421Closes#7532
When unit tests fail the test.py dump their output on the screen. This is impossible
to read this output from the terminal, all the more so the logs are anyway saved in
the testlog/ directory. At the same time the names of the failed tests are all left
_before_ these logs, and if the terminal history is not large enough, it becomes
quite annoying to find the names out.
The proposal is not to spoil the terminal with raw logs -- just names and summaries.
Logs themselves are at testlog/$mode/$name_of_the_test.log
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201031154518.22257-1-xemul@scylladb.com>
Otherwise all readers will be created with the default forwarding::yes.
This inhibits some optimizations (e.g. results in more sstable read-ahead).
It will also be problematic when we introduce mutation sources which don't support
forwarding::yes in the future.
Message-Id: <1604065206-3034-1-git-send-email-tgrabiec@scylladb.com>
Clang brings us working support for coroutines, which are
needed for Raft and for code simplification.
perf_simple_query as well as full system tests show no
significant performance regression.
Test: unit(dev, release, debug)
Closes#7531
Fixes#7496
Since cdc log now has an end-of-batch/record marker that tells
us explicitly that we've read the last row of a change, we
can use this instead of timestamp checks + limit extra to
ensure we have complete records.
Note that this does not try to fulfill user query limit
exact. To do this we would need to add a loop and potentially
re-query if quried rows are not enough. But that is a
separate exercise, and superbly suited for coroutines!
Closes#7498
* github.com:scylladb/scylla:
alternator::streams: Reduce the query limit depending on cdc opts
alternator::streams: Use end-of-record info in get_records
This series cleans up the gossiper endpoint_state interface
marking methods const and const noexcept where possible.
To achieve that, endpoint_state::get_status was changed to
return a string_view rather than a sstring so it won't
need to allocate memory.
Also, the get_cluster_name and get_partitioner_name were
changes to return a const sstring& rather than sstring
so they won't need to allocate memory.
The motivation for the series stems from #7339
where an exception in get_host_id within a storage_service
notification handler, called from seastar::defer crashed
the server.
With this series, get_host_id may still throw exceptions on
logical error, but not from calling get_application_state_ptr.
Refs #7339
Test: unit(dev)
* tag 'gossiper-endpoint-noexcept-v2':
gossiper: mark trivial methods noexcept
gossiper: get_cluster_name, get_partitioner_name: make noexcept
gossiper: get_gossip_status: return string_view and make noexcept
gms/endpoint_state: mark methods using get_status noexcept
gms/endpoint_state: get_status: return string_view and make noexcept
gms/endpoint_state: mark get_application_state_ptr and is_cql_ready noexcept
gms/endpoint_state: mark trivial methods noexcept
gms/heart_beat_state: mark methods noexcept
gms/versioned_value: mark trivial methods noexcept
gms/version_generator: mark get_next_version noexcept
fb_utilities.hh: mark methods noexcept
messaging: msg_addr: mark methods noexcept
gms/inet_address: mark methods noexcept
when logging in to the GCE instance that is created from the GCE image it takes 10 seconds to understand that we are not running on AWS. Also, some unnecessary debug logging messages are printed:
```
bentsi@bentsi-G3-3590:~/devel/scylladb$ ssh -i ~/.ssh/scylla-qa-ec2 bentsi@35.196.8.86
Warning: Permanently added '35.196.8.86' (ECDSA) to the list of known hosts.
Last login: Sun Nov 1 22:14:57 2020 from 108.128.125.4
_____ _ _ _____ ____
/ ____| | | | | __ \| _ \
| (___ ___ _ _| | | __ _| | | | |_) |
\___ \ / __| | | | | |/ _` | | | | _ <
____) | (__| |_| | | | (_| | |__| | |_) |
|_____/ \___|\__, |_|_|\__,_|_____/|____/
__/ |
|___/
Version:
666.development-0.20201101.6be9f4938
Nodetool:
nodetool help
CQL Shell:
cqlsh
More documentation available at:
http://www.scylladb.com/doc/
By default, Scylla sends certain information about this node to a data collection server. For information, see http://www.scylladb.com/privacy/
WARNING:root:Failed to grab http://169.254.169.254/latest/...
WARNING:root:Failed to grab http://169.254.169.254/latest/...
Initial image configuration failed!
To see status, run
'systemctl status scylla-image-setup'
[bentsi@artifacts-gce-image-jenkins-db-node-aa57409d-0-1 ~]$
```
this PR fixes this
Closes#7523
* github.com:scylladb/scylla:
scylla_util.py: remove unnecessary logging
scylla_util.py: make is_aws_instance faster
scylla_util.py: added ability to control sleep time between retries in curl()
Old secondary index schemas did not have their idx_token column
marked as computed, and there already exists code which updates
them. Unfortunately, the fix itself contains an error and doesn't
fire if computed columns are not yet supported by the whole cluster,
which is a very common situation during upgrades.
Fixes#7515Closes#7516
Fixes#7496
Since cdc log now has an end-of-batch/record marker that tells
us explicitly that we've read the last row of a change, we
can use this instead of timestamp checks + limit extra to
ensure we have complete records.
Note that this does not try to fulfill user query limit
exact. To do this we would need to add a loop and potentially
re-query if quried rows are not enough. But that is a
separate exercise, and superbly suited for coroutines!
The instructions are updated for multiarch images (images that
can be used on x86 and ARM machines).
Additionally,
- docker is replaced with podman, since that is now used by
developers. Docker is still supported for developers, but
the image creation instructions are only tested with podman.
- added instructions about updating submodules
- `--format docker` is removed. It is not necessary with
more recent versions of docker.
Closes#7521
connection_notifier.hh defines a number of template-specialized
variables in a header. This is illegal since you're allowed to
define something multiple times if it's a template, but not if it's
fully specialized. gcc doesn't care but clang notices and complains.
Fix by defining the variiables as inline variables, which are
allowed to have definitions in multiple translation units.
Closes#7519
Some ARM cores are slow, and trip our current timeout of 3000
seconds in debug mode. Quadrupling the timeout is enough to make
debug-mode tests pass on those machines.
Since the timeout's role is to catch rare infinite loops in unsupervised
testing, increasing the timeout has no ill effect (other than to
delay the report of the failure).
Closes#7518
The main goal of this this series is to fix issue #6951 - a Query (or Scan) with
a combination of filtering and projection parameters produced wrong results if
the filter needs some attributes which weren't projected.
This series also adds new tests for various corner cases of this issue. These
new tests also pass after this fix, or still fail because some other missing
feature (namely, nested attributes). These additional tests will be important if
we ever want to refactor or optimize this code, because they exercise some rare
corner code paths at the intersection of filtering and projection.
This series also fixes some additional problems related to this issue, like
combining old and new filtering/projection syntaxes (should be forbidden), and
even one fix to a wrong comment.
Closes#7328
* github.com:scylladb/scylla:
alternator test: tests for nested attributes in FilterExpression
alternator test: fix comment
alternator tests: additional tests for filter+projection combination
alternator: forbid combining old and new-style parameters
alternator: fix query with both projection and filtering
when calling curl and exception is raised we can see unnecessary log messages that we can't control.
For example when used in scylla_login we can see following messages:
WARNING:root:Failed to grab http://169.254.169.254/latest/...
WARNING:root:Failed to grab http://169.254.169.254/latest/...
Initial image configuration failed!
To see status, run
'systemctl status scylla-image-setup'
These methods can return a const sstring& rather than
allocating a sstring. And with that they can be marked noexcept.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Change get_gossip_status to return string_view,
and with that it can be noexcept now that it doesn't
allocate memory via sstring.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that get_status returns string_view, just compare it with a const char*
rather than making a sstring out of it, and consequently, can be marked noexcept.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
get_status doesn't need to allocate a sstring, it can just
return a std::string_view to the status string, if found.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Although std::map::find is not guaranteed to be noexcept
it depends on the comperator used and in this case comparing application_state
is noexcept. Therefore, we can safely mark get_application_state_ptr noexcept.
is_cql_ready depends on get_application_state_ptr and otherwise
handles an exceptions boost::lexical_cast so it can be marked
noexcept as well.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that get_next_version() is noexcept,
update_heart_beat can be noexcept too.
All others are trivially noexcept.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Based on gms::inet_address.
With that, gossiper::get_msg_addr can be marked noexcept (and const while at it).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Clang does not implement P1814R0 (class template argument deduction
for alias templates), so it can't deduce the template arguments
for range_bound, but it can for interval_bound, so switch to that.
Using the modern name rather than the compatibility alias is preferred
anyway.
Closes#7422
In commit de38091827 the two IO priority classes streaming_read
and streaming_write into just one. The document docs/isolation.md
leaves a lot to be desired (hint, hint, to anyone reading this and
can write content!) but let's at least not have incorrect information
there.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201101102220.2943159-1-nyh@scylladb.com>
This series adds more context to debugging information in case a view gets out of sync with its base table.
A test was conducted manually, by:
1. creating a table with a secondary index
2. manually deleting computed column information from system_schema.computed_columns
3. restarting the target node
4. trying to write to the index
Here's what's logged right after the index metadata is loaded from disk:
```
ERROR 2020-10-30 12:30:42,806 [shard 0] view - Column idx_token in view ks.t_c_idx_index was not found in the base table ks.t
ERROR 2020-10-30 12:30:42,806 [shard 0] view - Missing idx_token column is caused by an incorrect upgrade of a secondary index. Please recreate index ks.t_c_idx_index to avoid future issues.
```
And here's what's logged during the actual failure - when Scylla notices that there exists
a column which is not computed, but it's also not found in the base table:
```
ERROR 2020-10-30 12:31:25,709 [shard 0] storage_proxy - exception during mutation write to 127.0.0.1: seastar::internal::backtraced<std::runtime_error> (base_schema(): operation unsupported when initialized only for view reads. Missing column in the base table: idx_token Backtrace: 0x1d14513
0x1d1468b
0x1d1492b
0x109bbad
0x109bc97
0x109bcf4
0x1bc4370
0x1381cd3
0x1389c38
0xaf89bf
0xaf9b20
0xaf1654
0xaf1afe
0xb10525
0xb10ad8
0xb10c3a
0xaaefac
0xabf525
0xabf262
0xac107f
0x1ba8ede
0x1bdf749
0x1be338c
0x1bfe984
0x1ba73fa
0x1ba77a4
0x9ea2c8
/lib64/libc.so.6+0x27041
0x9d11cd
--------
seastar::lambda_task<seastar::execution_stage::flush()::{lambda()#1}>
```
Hopefully, this information will make it much easier to solve future problems with out-of-sync views.
Tests: unit(dev)
Fixes#7512Closes#7513
* github.com:scylladb/scylla:
view: add printing missing base column on errors
view: simplify creating base-dependent info for reads only
view: fix typo: s/dependant/dependent
view: add error logs if a view is out of sync with its base
In certain CQL statements it's possible to provide a custom timestamp via the USING TIMESTAMP clause. Those values are accepted in microseconds, however, there's no limit on the timestamp (apart from type size constraint) and providing a timestamp in a different unit like nanoseconds can lead to creating an entry with a timestamp way ahead in the future, thus compromising the table.
To avoid this, this change introduces a sanity check for modification and batch statements that raises an error when a timestamp of more than 3 days into the future is provided.
Fixes#5619Closes#7475
The constructors just set up the references, real start happens in .start()
so it is safe to do this early. This helps not carrying migration manager
and query processor down the storage service cluster joining code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The query processor global instance is going away. The schema_tables usage
of it requires a huge rework to push the qp reference to the needed places.
However, those places talk to system keyspace and are thus the users of the
"qctx" thing -- the query context for local internal requests.
To make cql tests not crash on null qctx pointer, its initialization should
come earlier (conforming to the main start sequence).
The qctx itself is a global pointer, which waits for its fix too, of course.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When an out-of-sync view is attempted to be used in a write operation,
the whole operation needs to be aborted with an error. After this patch,
the error contains more context - namely, the missing column.
The code which created base-dependent info for materialized views
can be expressed with fewer branches. Also, the constructor
which takes a single parameter is made explicit.
When Scylla finds out that a materialized view contains columns
which are not present in the base table (and they are not computed),
it now presents comprehensible errors in the log.
* seastar 6973080cd1...57b758c2f9 (11):
> http: handle 'match all' rule correctly
> http: add missing HTTP methods
> memory: remove unused lambda capture in on_allocation_failure()
> Support seastar allocator when seastar::alien is used
> Merge "make timer related functions noexcept" from Benny
> script: update dependecy packages for centos7/8
> tutorial: add linebreak between sections
> doc: add nav for the second last chap
> doc: add nav bar at the bottom also
> doc: rename add_prologue() to add_nav_to_body()
> Wrong name used in an example in mini tutorial.
An upcoming change in Seastar only initializes the Seastar allocator in
reactor threads. This causes imr_test and double_decker_test to fail:
1. Those tests rely on LSA working
2. LSA requires the Seastar allocator
3. Seastar is not initialized, so the Seastar allocator is not initialized.
Fix by switching to the Seastar test framework, which initializes Seastar.
Closes#7486
test.py estimates the amount of memory needed per test
in order not to overload the machine, but it underestimates
badly and so machines with many cores but not a lot of memory
fail the tests (in debug mode principally) due to running out
of memory.
Increase the estimate from 2GB per test to 6GB.
Closes#7499
gcc collects all the initialization code for thread-local storage
and puts it in one giant function. In combination with debug mode,
this creates a very large stack frame that overflows the stack
on aarch64.
Work around the problem by placing each initializer expression in
its own function, thus reusing the stack.
Closes#7509
Add additional comments to select_statement_utils, fix formatting, add
missing #pragma once and introduce set_internal_paging_size_guard to
set internal_paging in RAII fashion.
Closes#7507
Gossip currently runs inside the default (main) scheduling group. It is
fine to run inside default scheduling group. From time to time, we see
many tasks in main scheduling group and we suspect gossip. It is best
we can move gossip to a dedicated scheduling group, so that we can catch
bugs that leak tasks to main group more easily.
After this patch, we can check:
scylla_scheduler_time_spent_on_task_quota_violations_ms{group="gossip",shard="0"}
Fixes: #7154
Tests: unit(dev)
We require a kernel that is at least 3.10.0-514, because older
kernel have an XFS related bug that causes data corruption. However
this Requires: clause pulls in a kernel even in Docker installation,
where it (and especially the associated firmware) occupies a lot of
space.
Change to a Conflicts: instead. This prevents installation when
the really old kernel is present, but doesn't pull it in for the
Docker image.
Closes#7502
Overview
Fixes#7355.
Before this changes, there were a few invalid results of aggregates/GROUP BY on tables with secondary indexes (see below).
Unfortunately, it still does NOT fix the problem in issue #7043. Although this PR moves forward fixing of that issue, there is still a bug with `TOKEN(...)` in `WHERE` clauses of indexed selects that is not addressed in this PR. It will be fixed in my next PR.
It does NOT fix the problems in issues #7432, #7431 as those are out-of-scope of this PR and do not affect the correctness of results (only return a too large page).
GROUP BY (first commit)
Before the change, `GROUP BY` `SELECT`s with some `WHERE` restrictions on an indexed column would return invalid results (same grouped column values appearing multiple times):
```
CREATE TABLE ks.t(pk int, ck int, v int, PRIMARY KEY(pk, ck));
CREATE INDEX ks_t on ks.t(v);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 2, 3);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 4, 3);
SELECT pk FROM ks.t WHERE v=3 GROUP BY pk;
pk
----
1
1
```
This is fixed by correctly passing `_group_by_cell_indices` to `result_set_builder`. Fixes the third failing example from issue #7355.
Paging (second commit)
Fixes two issues related to improper paging on indexed `SELECT`s. As those two issues are closely related (fixing one without fixing the other causes invalid results of queries), they are in a single commit (second commit).
The first issue is that when using `slice.set_range`, the existing `_row_ranges` (which specify clustering key prefixes) are not taken into account. This caused the wrong rows to be included in the result, as the clustering key bound was set to a half-open range:
```
CREATE TABLE ks.t(a int, b int, c int, PRIMARY KEY ((a, b), c));
CREATE INDEX kst_index ON ks.t(c);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 3);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 4);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 5);
SELECT COUNT(*) FROM ks.t WHERE c = 3;
count
-------
2
```
The second commit fixes this issue by properly trimming `row_ranges`.
The second fixed problem is related to setting the `paging_state` to `internal_options`. It was improperly set to the value just after reading from index, making the base query start from invalid `paging_state`.
The second commit fixes this issue by setting the `paging_state` after both index and base table queries are done. Moreover, the `paging_state` is now set based on `paging_state` of index query and the results of base table query (as base query can return more rows than index query).
The second commit fixes the first two failing examples from issue #7355.
Tests (fourth commit)
Extensively tests queries on tables with secondary indices with aggregates and `GROUP BY`s.
Tests three cases that are implemented in `indexed_table_select_statement::do_execute` - `partition_slices`,
`whole_partitions` and (non-`partition_slices` and non-`whole_partitions`). As some of the issues found were related to paging, the tests check scenarios where the inserted data is smaller than a page, larger than a page and larger than two pages (and some in-between page boundaries scenarios).
I found all those parameters (case of `do_execute`, number of inserted rows) to have an impact of those fixed bugs, therefore the tests validate a large number of those scenarios.
Configurable internal_paging_size (third commit)
Before this change, internal `page_size` when doing aggregate, `GROUP BY` or nonpaged filtering queries was hard-coded to `DEFAULT_COUNT_PAGE_SIZE` (10,000). This change adds new internal_paging_size variable, which is configurable by `set_internal_paging_size` and `reset_internal_paging_size` free functions. This functionality is only meant for testing purposes.
Closes#7497
* github.com:scylladb/scylla:
tests: Add secondary index aggregates tests
select_statement: Introduce internal_paging_size
select_statement: Fix paging on indexed selects
select_statement: Fix GROUP BY on indexed select
Update the toolchain to Fedora 33 with clang 11 (note the
build still uses gcc).
The image now creates a /root/.m2/repository directory; without
this the tools/jmx build fails on aarch64.
Add java-1.8.0-openjdk-devel since that is where javac lives now.
Add a JAVA8_HOME environment variable; wihtout this ant is not
able to find javac.
The toolchain is enabled for x86_64 and aarch64.
Extensively tests queries on tables with secondary indices with
aggregates and GROUP BYs. Tests three cases that are implemented
in indexed_table_select_statement::do_execute - partition_slices,
whole_partitions and (non-partition_slices and non-whole_partitions).
As some of the issues found were related to paging, the tests check
scenarios where the inserted data is smaller than a page, larger than
a page and larger than two pages (and some boundary scenarios).
Before this change, internal page_size when doing aggregate, GROUP BY
or nonpaged filtering queries was hard-coded to DEFAULT_COUNT_PAGE_SIZE.
This made testing hard (timeouts in debug build), because the tests had
to be large to test cases when there are multiple internal pages.
This change adds new internal_paging_size variable, which is
configurable by set_internal_paging_size and reset_internal_paging_size
free functions. This functionality is only meant for testing purposes.
Fixes two issues related to improper paging on indexed SELECTs. As those
two issues are closely related (fixing one without fixing the other
causes invalid results of queries), they are in a single commit.
The first issue is that when using slice.set_range, the existing
_row_ranges (which specify clustering key prefixes) are not taken into
account. This caused the wrong rows to be included in the result, as the
clustering key bound was set to a half-open range:
CREATE TABLE ks.t(a int, b int, c int, PRIMARY KEY ((a, b), c));
CREATE INDEX kst_index ON ks.t(c);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 3);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 4);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 5);
SELECT COUNT(*) FROM ks.t WHERE c = 3;
count
-------
2
This change fixes this issue by properly trimming row_ranges.
The second fixed problem is related to setting the paging_state
to internal_options. It was improperly set just after reading from
index, making the base query start from invalid paging_state.
This change fixes this issue by setting the paging_state after both
index and base table queries are done. Moreover, the paging_state is
now set based on paging_state of index query and the results of base
table query (as base query can return more rows than index query).
Fixes the first two failing examples from issue #7355.
Before the change, GROUP BY SELECTs with some WHERE restrictions on an
indexed column would return invalid results (same grouped column values
appearing multiple times):
CREATE TABLE ks.t(pk int, ck int, v int, PRIMARY KEY(pk, ck));
CREATE INDEX ks_t on ks.t(v);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 2, 3);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 4, 3);
SELECT pk FROM ks.t WHERE v=3 GROUP BY pk;
pk
----
1
1
This is fixed by correctly passing _group_by_cell_indices to
result_set_builder. Fixes the third failing example from issue #7355.
This is the continuation of 30722b8c8e, so let me re-cite Rafael:
The constructors of these global variables can allocate memory. Since
the variables are thread_local, they are initialized at first use.
There is nothing we can do if these allocations fail, so use
disable_failure_guard.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201028140553.21709-1-xemul@scylladb.com>
The future of the fiber that writes data into sstables inside
the repair_writer is stored in _writer_done like below:
class repair_writer {
_writer_done[node_idx] =
mutation_writer::distribute_reader_and_consume_on_shards().then([this] {
...
}).handle_exception([this] {
...
});
}
The fiber access repair_writer object in the error handling path. We
wait for the _writer_done to finish before we destroy repair_meta
object which contains the repair_writer object to avoid the fiber
accessing already freed repair_writer object.
To be safer, we can make repair_writer a shared pointer and take a
reference in the distribute_reader_and_consume_on_shards code path.
Fixes#7406Closes#7430
LD_PRELOAD libraries usually have dependencies in the host system,
which they will not have access to in a relocatable environment
since we use a different libc. Detect that LD_PRELOAD is in use and if
so, abort with an error.
Fixes#7493.
Closes#7494
Clang complains if it sees linker-only flags when called for compilation,
so move the compile-time flags from cxx_ld_flags to cxxflags, and remove
cxx_ld_flags from the compiler command line.
The linker flags are also passed to Seastar so that the build-id and
interpreter hacks still apply to iotune.
Closes#7466
python3 has its own relocatable package, no need to include it
in scylla-package.tar.gz.
Python has its own relocatable package, so packaging it in scylla-package.ta
Closes#7467
We already have a test for the behavior of a closed shard and how
iterators previously created for it are still valid. In this patch
we add to this also checking that the shard id itself, not just the
iterator, is still valid.
Additionally, although the aforementioned test used a disabled stream
to create a closed shard, it was not a complete test for the behavior
of a disabled stream, and this patch adds such a test. We check that
although the stream is disabled, it is still fully usable (for 24 hours) -
its original ARN is still listed on ListStreams, the ARN is still usable,
its shards can be listed, all are marked as closed but still fully readable.
Both tests pass on DynamoDB, and xfail on Alternator because of
issue #7239 - CDC drops the CDC log table as soon as CDC is disabled,
so the stream data is lost immediately instead of being retained for
24 hours.
Refs #7239
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201006183915.434055-1-nyh@scylladb.com>
This patch fills the following columns in `system.clients` table:
* `connection_stage`
* `driver_name`
* `driver_version`
* `protocol_version`
It also improves:
* `client_type` - distinguishes cql from thrift just in case
* `username` - now it displays correct username iff `PasswordAuthenticator` is configured.
What is still missing:
* SSL params (I'll happily get some advice here)
* `hostname` - I didn't find it in tested drivers
Refs #6946Closes#7349
* github.com:scylladb/scylla:
transport: Update `connection_stage` in `system.clients`
transport: Retrieve driver's name and version from STARTUP message
transport: Notify `system.clients` about "protocol_version"
transport: On successful authentication add `username` to system.clients
This patch change the code that iterates over the metrics to use a copy
of the metrics names to make it safe to remove the metrics from the
metrics object.
Fixes#7488
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Allows QA to bypass the normal hardcoded 24h ttl of data and still
get "proper" behaviour w.r.t. available stream set/generations.
I.e. can manually change cdc ttl option for alternator table after
streams enabled. Should not be exposed, but perhaps useful for
testing.
Closes#7483
Refs #7364
The number of tombstones can be large. As a stopgap measure to
just returning a source range (with keepalive), we can at least
alleviate the problem by using a chunked vector.
Closes#7433
Fixes#7435
Adds an "eor" (end-of-record) column to cdc log. This is non-null only on
last-in-timestamp group rows, i.e. end of a singular source "event".
A client can use this as a shortcut to knowing whether or not he has a
full cdc "record" for a given source mutation (single row change).
Closes#7436
In bd6855e, we reverted to Boost ranges and commented out the concept
check. But Boost has its own concept check, which this patch enables.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#7471
Currently, we linearize large UTF8 cells in order to validate them.
This can cause large latency spikes if the cell is large.
This series changes UTF8 validation to work on fragmented buffers.
This is somewhat tricky since the validation routines are optimized
for single-instruction-multiple-data (SIMD) architectures.
The unit tests are expanded to cover the new functionality.
Fixes#7448.
Closes#7449
* github.com:scylladb/scylla:
types: don't linearize utf8 for validation
test: utf8: add fragmented buffer validation tests
utils: utf8: add function to validate fragmented buffers
utils: utf8: expose validate_partial() in a header
utils: utf8: introduce validate_partial()
utils: utf8: extract a function to evaluate a single codepoint
This PR purpose is to handle schema integrity issues that can arise in races involving materialized views.
The possibility of such integrity issues was found in #7420 , where a view schema was used for reading without
it's _base_info member initialized resulting in a segfault.
We handle this doing 3 things:
1. First guard against using an uninitialized base info - this will be considered as an internal error as it will indicate that
there is a path in our code that creates a view schema to be used for reads or writes but is not initializing the base info.
2. We allow the base info to be initialized also from partially matching base (most likely a newer one that this used to create the view).
3. We fix the suspected path that create such a view schema to initialize it. (in migration manager)
It is worth mentioning that this PR is a workaround to a probable design flaw in our materialized views which requires the base
table's information to be retrieved in the first place instead of just being self contained.
Refs #7420Closes#7469
* github.com:scylladb/scylla:
materialized views: add a base table reference if missing
view info: support partial match between base and view for only reading from view.
view info: guard against null dereference of the base info
schema pointers can be obtained from two distinct entities,
one is the database, those schema are obtained from the table
objects and the other is from the schema registry.
When a schema or a new schema is attached to a table object that
represents a base table for views, all of the corresponding attached
view schemas are guarantied to have their base info in sync.
However if an older schema is inserted into the registry by the
migratrion manager i.e loaded from other node, it will be
missing this info.
This becomes a problem when this schema is published through the
schema registry as it can be obtained for an obsolete read command
for example and then eventually cause a segmentation fault by null
dereferencing the _base_info ptr.
Refs #7420
only reading from view.
The current implementation of materialized views does
no keep the version to which a specific version of materialized
view schema corresponds to. This complicate things especially on
old views versions that the schema doesn't support anymore. However,
the views, being also an independent table should allow reading from
them as long as they exist even if the base table changed since then.
For the reading purpose, we don't need to know the exact composition
of view primary key columns that are not part of the base primary
key, we only need to know that there are any, and this is a much
looser constrain on the schema.
We can rely on a table invariants such as the fact that pk columns are
not going to disappear on newer version of the table.
This means that if we don't find a view column in the base table, it is
not a part of the base table primary key.
This information is enough for us to perform read on the view.
This commit adds support for being able to rely on such partial
information along with a validation that it is not going to be used for
writes. If it is, we simply abort since this means that our schema
integrity is compromised.
The change's purpose is to guard against segfault that is the
result of dereferencing the _base_info member when it is
uninitialized. We already know this can happen (#7420).
The only purpose of this change is to treat this condition as
an internal error, the reason is that it indicates a schema integrity
problem.
Besides this change, other measures should be taken to ensure that
the _base_table member is initialized before calling methods that
rely on it.
We call the internal_error as a last resort.
Use the new non-linearizing validator, avoiding linearization.
Linearization can cause large contiguous memory allocations, which
in turn causes latency spikes.
Fixes#7448.
Since there are a huge number of variations, we use random
testing. Each test case is composed of a random number of valid
code points, with a possible invalid code point somehwere. The
test case is broken up into a random number of fragments. We
test both validation success and error position indicator.
Add a function to validate fragmented buffers. We validate
each buffer with SIMD-optimized validate_partial(), then
collect the codepoint that spans buffer boundaries (if any)
in a temporary buffer, validate that too, and continue.
The current validators expect the buffer to contain a full
UTF-8 string. This won't be the case for fragmented buffers,
since a codepoint can straddle two (or more) buffers.
To prepare for that, convert the existing validators to
validate_partial(), which returns either an error, or
success with an indication of the size of the tail that
was not validated and now many bytes it is missing.
This is natural since the SIMD validators already
cannot process a tail in SIMD mode if it's smaller than
the vector size, so only minor rearrangements are needed.
In addition, we now have validate_partial() for non-SIMD
architectures, since we'll need it for fragmented buffer
validation.
Our SIMD optimized validators cannot process a codepoint that
spans multiple buffers, and adapting them to be able to will slow
them down. So our strategy is to special-case any codepoint that
spans two buffers.
To do that, extract an evaluate_codepoint() function from the
current validate_naive() function. It returns three values:
- if a codepoint was successfully decoded from the buffer,
how many bytes were consumed
- if not enough bytes were in the buffer, how many more
are needed
- otherwise, an error happened, so return an indication
The new function uses a table to calculate a codepoint's
size from its first byte, similar to the SIMD variants.
validate_naive() is now implemented in terms of
evaluate_codepoint().
This is a regression caused by aebd965f0.
After the sstable_directory changes, resharding now waits for all sstables
to be exhausted before releasing reference to them, which prevents their
resources like disk space and fd from being released. Let's restore the
old behavior of incrementally releasing resources, reducing the space
requirement significantly.
Fixes#7463.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201020140939.118787-1-raphaelsc@scylladb.com>
... type deduction and parenthesized aggregate construction' from Avi Kivity
Clang does not implement P0960R3 and P1816R0, so constructions
of aggregates (structs with no constructors) have to used braced initialization
and cannot use class template argument deduction. This series makes the
adjustments.
Closes#7456
* github.com:scylladb/scylla:
reader_concurrency_semaphore: adjust permit_summary construction for clang
schema_tables: adjust altered_schema construction for clang
types: adjust validation_visitor construction for clang
Makes files shorter while still keeping the lines under 120 columns.
Separate from other commits to make review easier.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Don't require filtering when a continuous slice of the clustering key
is requested, even if partition is unrestricted. The read command we
generate will fetch just the selected data; filtering is unnecessary.
Some tests needed to update the expected results now that we're not
fetching the extra data needed for filtering. (Because tests don't do
the final trim to match selectors and assert instead on all the data
read.)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Clang does not implement P0960R3, parenthesized initialization of
aggregates, so we have to use brace initialization in
permit_summary.
As the parenthesized constructor call is done by emplace_back(),
we have to do the braced call ourselves.
Clang does not implement P0960R3, parenthesized initialization of
aggregates, so we have to use brace initialization in
altered_schema.
As the parenthesized constructor call is done by emplace_back(),
we have to do the braced call ourselves.
Clang does not implement P0960R3, parenthesized initialization of
aggregates, so we have to use brace initialization in
validation_visitor. It also does not implement class template
argument deduction for aggregates (P1816r0), so we have to
specify the template parameters explicity.
Clang has trouble compiling libstdc++'s `<ranges>`. It is not known whether
the problem is in clang or in libstdc++; I filed bugs for both [1] [2].
Meanwhile, we wish to use clang to gain working coroutine support,
so drop the failing uses of `<ranges>`. Luckily the changes are simple.
[1] https://bugs.llvm.org/show_bug.cgi?id=47509
[2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97120Closes#7450
* github.com:scylladb/scylla:
test: view_build_test: drop <ranges>
test: mutation_reader_test: drop <ranges>
cql3: expression: drop <ranges>
sstables: leveled_compaction_strategy: drop use of <ranges>
utils: to_range(): relax constraint
The check was added to support migration from schema tables format v2
to v3. It was needed to handle the rolling upgrade from 2.x to 3.x
scylla version. Old nodes wouldn't recognize new schema mutations, so
the pull handler in v3 was changed to ignore requests from v2 nodes based
on their advertised SCHEMA_TABLES_VERSION gossip state.
This started to cause problems after 3b1ff90 (get rid of the seed
concept). The bootstrapping node sometimes would hang during boot
unable to reach schema agreement.
It's relevant that gossip exchanges about new nodes are unidirectional
(refs #2862).
It's also relevant that pulls are edge-triggered only (refs #7426).
If the bootstrapping node (A) is listed as a seed in one of the
existing node's (B) configuration then node A can be contacted before
it contacts node B. Node A may then send schema pull request to node B
before it learns about node A, and node B will assume it's an old node
and give an empty response. As a result, node A will end up with an
old schema.
The fix is to drop the check so that pull handler always responds with
the schema. We don't support upgrades from nodes using v2 schema
tables format anymore so this should be safe.
Fixes#7396
Tests:
- manual (ccm)
- unit (dev)
Message-Id: <1602612578-21258-1-git-send-email-tgrabiec@scylladb.com>
The input range to utils::to_range() should be indeed a range,
but clang has trouble compiling <ranges> which causes it to fail.
Relax the constraint until this is fixed.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Hopefully, most of these lambda captures will be replaces with
coroutines.
Closes#7445
* github.com:scylladb/scylla:
test: mutation_reader_test: don't capture structured bindings in lambdas
api: column_family: don't capture structured bindings in lambdas
thrift: don't capture structured bindings in lambdas
test: partition_data_test: don't capture structured bindings in lambdas
test: querier_cache_test: don't capture structured bindings in lambdas
test: mutation_test: don't capture structured bindings in lambdas
storage_proxy: don't capture structured bindings in lambdas
db: hints/manager: don't capture structured bindings in lambdas
db: commitlog_replayer: don't capture structured bindings in lambdas
cql3: select_statement: don't capture structured bindings in lambdas
cql3: statement_restrictions: don't capture structured bindings in lambdas
cdc: log: don't capture structured bindings in lambdas
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
"
Focusing on different aspects of debugging Scylla. Also expand some of
the existing segments and fix some small issues around the document.
"
* 'debugging.md-advanced-guides/v1' of https://github.com/denesb/scylla:
docs/debugging.md: add thematic debugging guides
docs/debugging.md: tips and tricks: add section about optimized-out variables
docs/debugging.md: TLS variables: add missing $ to terminal command
docs/debugging.md: TUI: describe how to switch between windows
docs/debugging.md: troubleshooting: expand on crash on backtrace
Before this change, invalid query exception on selects with both normal
and token restrictions was only thrown when token restriction was after
normal restriction.
This change adds proper validation when token restriction is before normal restriction.
**Before the change - does not return error in last query; returns wrong results:**
```
cqlsh> CREATE TABLE ks.t(pk int, PRIMARY KEY(pk));
cqlsh> INSERT INTO ks.t(pk) VALUES (1);
cqlsh> INSERT INTO ks.t(pk) VALUES (2);
cqlsh> INSERT INTO ks.t(pk) VALUES (3);
cqlsh> INSERT INTO ks.t(pk) VALUES (4);
cqlsh> SELECT pk, token(pk) FROM ks.t WHERE pk = 2 AND token(pk) > 0;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Columns "ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}" cannot be restricted by both a normal relation and a token relation"
cqlsh> SELECT pk, token(pk) FROM ks.t WHERE token(pk) > 0 AND pk = 2;
pk | system.token(pk)
----+---------------------
3 | 9010454139840013625
(1 rows)
```
Closes#7441
* github.com:scylladb/scylla:
tests: Add token and non-token conjunction tests
token_restriction: Add non-token merge exception
This patch introduces a new system table: `system.scylla_table_schema_history`,
which is used to keep track of column mappings for obsolete table
schema versions (i.e. schema becomes obsolete when it's being changed
by means of `CREATE TABLE` or `ALTER TABLE` DDL operations).
It is populated automatically when a new schema version is being
pulled from a remote in get_schema_definition() at migration_manager.cc
and also when schema change is being propagated to system schema tables
in do_merge_schema() at schema_tables.cc.
The data referring to the most recent table schema version is always
present. Other entries are garbage-collected when the corresponding
table schema version is obsoleted (they will be updated with a TTL equal
to `DEFAULT_GC_GRACE_SECONDS` on `ALTER TABLE`).
In case we failed to persist column mapping after a schema change,
missing entries will be recreated on node boot.
Later, the information from this table is used in `paxos_state::learn`
callback in case we have a mismatch between the most recent schema
version and the one that is stored inside the `frozen_mutation`
for the accepted proposal.
Such situation may arise under following circumstances:
1. The previous LWT operation crashed on the "accept" stage,
leaving behind a stale accepted proposal, which waits to be
repaired.
2. The table affected by LWT operation is being altered, so that
schema version is now different. Stored proposal now references
obsolete schema.
3. LWT query is retried, so that Scylla tries to repair the
unfinished Paxos round and apply the mutation in the learn stage.
When such mismatch happens, prior to that patch the stored
`frozen_mutation` is able to be applied only if we are lucky enough
and column_mapping in the mutation is "compatible" with the new
table schema.
It wouldn't work if, for example, the columns are reordered, or
some columns, which are referenced by an LWT query, are dropped.
With this patch we try to look up the column mapping for
the obsolete schema version, then upgrade the stored mutation
using obtained column mapping and apply an upgraded mutation instead.
* git@github.com:ManManson/scylla.git feature/table_schema_history_v7:
lwt: add column_mapping history persistence tests
schema: add equality operator for `column_mapping` class
lwt: store column_mapping's for each table schema version upon a DDL change
schema_tables: extract `fill_column_info` helper
frozen_mutation: introduce `unfreeze_upgrading` method
There are two basic tests, which:
* Test that column mappings are serialized and deserialized
properly on both CREATE TABLE and ALTER TABLE
* Column mappings for obsoleted schema versions are updated
with a TTL value on schema change
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Add a comparator for column mappings that will be used later
in unit-tests to check whether two column mappings match or not.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This patch introduces a new system table: `system.scylla_table_schema_history`,
which is used to keep track of column mappings for obsolete table
schema versions (i.e. schema becomes obsolete when it's being changed
by means of `CREATE TABLE` or `ALTER TABLE` DDL operations).
It is populated automatically when a new schema version is being
pulled from a remote in get_schema_definition() at migration_manager.cc
and also when schema change is being propagated to system schema tables
in do_merge_schema() at schema_tables.cc.
The data referring to the most recent table schema version is always
present. Other entries are garbage-collected when the corresponding
table schema version is obsoleted (they will be updated with a TTL equal
to `DEFAULT_GC_GRACE_SECONDS` on `ALTER TABLE`).
In case we failed to persist column mapping after a schema change,
missing entries will be recreated on node boot.
Later, the information from this table is used in `paxos_state::learn`
callback in case we have a mismatch between the most recent schema
version and the one that is stored inside the `frozen_mutation`
for the accepted proposal.
Such situation may arise under following circumstances:
1. The previous LWT operation crashed on the "accept" stage,
leaving behind a stale accepted proposal, which waits to be
repaired.
2. The table affected by LWT operation is being altered, so that
schema version is now different. Stored proposal now references
obsolete schema.
3. LWT query is retried, so that Scylla tries to repair the
unfinished Paxos round and apply the mutation in the learn stage.
When such mismatch happens, prior to that patch the stored
`frozen_mutation` is able to be applied only if we are lucky enough
and column_mapping in the mutation is "compatible" with the new
table schema.
It wouldn't work if, for example, the columns are reordered, or
some columns, which are referenced by an LWT query, are dropped.
With this patch we try to look up the column mapping for
the obsolete schema version, then upgrade the stored mutation
using obtained column mapping and apply an upgraded mutation instead.
In case we don't find a column_mapping we just return an error
from the learn stage.
Tests: unit(dev, debug), dtests(paxos_tests.py:TestPaxos.schema_mismatch_*_test)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Now that scylla-ccm and scylla-dtest conform to PEP-440
version comparison (See https://www.python.org/dev/peps/pep-0440/)
we can safely change scylla version on master to be the development
branch for the next release.
The version order logic is:
4.3.dev is followed by
4.3.rc[i] followed by
4.3.[n]
Note that also according to
https://blog.jasonantman.com/2014/07/how-yum-and-rpm-compare-versions/
4.3.dev < 4.3.rc[i] < 4.3.[n]
as "dev" < "rc" by alphabetical order
and both "dev" and "rc*" < any number, based on the general
rule that alphabetical strings compare as less than numbers.
Test: unit
Dtest: gating
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201015151153.726637-1-bhalevy@scylladb.com>
Commit bc65659a46 adjusted the inlining parameters for gcc. Here
we do the same for clang. With this adjustement, clang lags gcc
by 3% in throughput (perf_simple_query --smp 1) compared to 20%
without it.
The value 2500 was derived by binary search. At 5000 compilation
of storage_proxy never completes, at 1250 throughput is down by
10%.
Closes#7418
Support snapshotting for raft. The patch series only concerns itself
with raft logic, not how a specific state machine implements
take_snapshot() callback.
* scylla-dev/raft-snapshots-v2:
raft: test: add tests for snapshot functionality
raft: preserve trailing raft log entries during snapshotting
raft: implement periodic snapshotting of a state machine
raft: add snapshot transfer logic
Checks for invalid_request_exception in case of trying to run a query
with both normal and token relations. Tests both orderings of those
relations (normal or token relation first).
Add exception that is thrown when merging of token and non-token
restrictions is attempted. Before this change only merging non-token
and token restriction was validated (WHERE pk = 0 AND token(pk) > 0)
and not the other way (WHERE token(pk) > 0 AND pk = 0).
This patch allows to leave snapshot_trailing amount of entries
when a state machine is snapshotted and raft log entries are dropped.
Those entries can be used to catch up nodes that are slow without
requiring snapshot transfer. The value is part of the configuration
and can be changed.
The patch implements periodic taking of a snapshot and trimming of
the raft log.
In raft the only way the log of already committed entries can be shorten
is by taking a snapshot of the state machine and dropping log entries
included in the snapshot from the raft log. To not let log to grow too
large the patch takes the snapshot periodically after applying N number
of entries where N can be configured by setting snapshot_threshold
value in raft's configuration.
This patch adds the logic that detects that a follower misses data from
a snapshot and initiate snapshot transfer in that case. Upon receiving
the snapshot the follower stores it locally and applies it to its state
machine. The code assumes that the snapshot is already exists on a
leader.
"
This series cleans up the legacy and common ssatble writer code.
metadata_collector::_ancestors were moved to class sstable
so that the former can be moved out of sstable into file_writer_impl.
Moved setting of replay position and sstable level via
sstable_writer_config so that compaction won't need to access
the metadata_collector via the sstable.
With that, metadata_collector could be moved from class sstable
to sstable_writer::writer_impl along with the column_stats.
That allowed moved "generic" file_writer methods that were actually
k/l format specific into sstable_writer_k_l.
Eventually `file_writer` code is moved into sstables/writer.cc
and sstable_writer_k_l into sstables/kl/writer.{hh,cc}
A bonus cleanup is the ability to get rid of
sstable::_correctly_serialize_non_compound_range_tombstones as
it's now available to the writers via the writer configuration
and not required to be stored in the sstable object.
Fixes#3012
Test: unit(dev)
"
* tag 'cleanup-sstable-writer-v2' of github.com:bhalevy/scylla:
sstables: move writer code away to writer.cc
sstables: move sstable_writer_k_l away to kl/writer
sstables: get rid of sstable::_correctly_serialize_non_compound_range_tombstones
sstables: move writer methods to sstable_writer_k_l
sstables: move compaction ancestors to sstable
sstables: sstable_writer: optionally set sstable level via config
sstables: sstable_writer: optionally set replay position via config
sstables: compaction: make_sstable_writer_config
sstables: open code update_stats_on_end_of_stream in sstable_writer::consume_end_of_stream
sstables: fold components_writer into sstable_writer_k_l
sstables: move sstable_writer_k_l definition upwards
sstables: components_writer: turn _index into unique_ptr
Now it's available to the writers via the writer configuration
and not required to be stored in the sstable object.
Refs #3012
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
They are called solely from the sstable_writer_k_l path.
With that, moce the metadata collector and column stats
to writer_impl. They are now only used by the sstable
writers.
Refs #3012
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Compaction needs access to the sstable's ancestors so we need to
keep the ancestors for the sstable separately from the metadata collector
as the latter is about to be moved to the sstable writer.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And use compaction::make_sstable_writer_config to pass
the compaction's `_sstable_level` to the writer
via sstable_writer_config, instead of via the sstable
metadata_collector, that is going to move from the sstable
to the write_impl.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And use compaction::make_sstable_writer_config to pass
the compaction's replay_position (`_rp`) to the writer
via sstable_writer_config, instead of via the sstable
metadata_collector, that is going to move from the sstable
to the write_impl.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When Alternator is enabled over HTTPS - by setting the
"alternator_https_port" option - it needs to know some SSL-related options,
most importantly where to pick up the certificate and key.
Before this patch, we used the "server_encryption_options" option for that.
However, this was a mistake: Although it sounds like these are the "server's
options", in fact prior to Alternator this option was only used when
communicating with other servers - i.e., connections between Scylla nodes.
For CQL connections with the client, we used a different option -
"client_encryption_options".
This patch introduces a third option "alternator_encryption_options", which
controls only Alternator's HTTPS server. Making it separate from the
existing CQL "client_encryption_options" allows both Alternator and CQL to
be active at the same time but with different certificates (if the user
so wishes).
For backward compatibility, we temporarily continue to allow
server_encryption_options to control the Alternator HTTPS server if
alternator_encryption_options is not specified. However, this generates
a warning in the log, urging the user to switch. This temporary workaround
should be removed in a future version.
This patch also:
1. fixes the test run code (which has an "--https" option to test over
https) to use the new name of the option.
2. Adds documentation of the new option in alternator.md and protocols.md -
previously the information on how to control the location of the
certificate was missing from these documents.
Fixes#7204.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200930123027.213587-1-nyh@scylladb.com>
The wording on how CQL with SSL is configured was ambigous. Clarify the
text to explain that by default, it is *disabled*. We recommend to enable
it on port 9142 - but it's not a "default".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201014144938.653311-1-nyh@scylladb.com>
Consolidate the code to make the sstable_writer_config
for sstable writers into a helper method.
Folowing patches will add the ability to set the
replay position and sstable level via that config
structure.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To facilitate consolidation of components_writer and
some sstable methods into sstable_writer_k_l.
Refs #3012.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The patch adds a generic pretty-printer for `nonwrapping_interval<>`
(and `nonwrapping_range<>`) and a specific one for `dht::ring_position`.
Adding support to clustering and partition ranges is just one more step.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201014054013.606141-1-bdenes@scylladb.com>
Fixes#7424
AWS sdk (kinesis) assumes SequenceNumbers are monotonically
growing bigints. Since we sort on and use timeuuids are these
a "raw" bit representation of this will _not_ fulfill the
requirement. However, we can "unwrap" the timestamp of uuid
msb and give the value as timestamp<<64|lsb, which will
ensure sort order == bigint order.
Fixes#7409
AWS kinesis Java sdk requires/expects shards to be reported in
lexical order, and even worse, ignores lastevalshard. Thus not
upholding said order will break their stream intropection badly.
Added asserts to unit tests.
v2:
* Added more comments
* use unsigned_cmp
* unconditional check in streams_test
* seastar 35c255dcd...6973080cd (2):
> Merge "memory: improve memory diagnostics dumped on allocation failures" from Botond
> map_reduce: use get0 rather than get
Tests that the exceptional future returned by the serialized action
is propagated to trigger, reproducing #7352.
The test fails without the previoud patch:
"serialized_action: trigger: include also semaphore status to promise"
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, the serialized_action error is set to a shared_promise,
but is not returned to the caller, unless there is an
already outstanding action.
Note that setting the exception to the promise when noone
collected it via the shared_future caused 'Exceptional future ignored'
warning to be issued, as seen in #7352.
Fixes#7352
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, if `with_semaphore` returns exceptional future, it is not
propagated to the promise, and other waiters that got a shared
future will not see that.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
The reader concurrency semaphore timing out or its queue being overflown
are fairly common events both in production and in testing. At the same
time it is a hard to diagnose problem that often has a benign cause
(especially during testing), but it is equally possible that it points
to something serious. So when this error starts to appear in logs,
usually we want to investigate and the investigation is lengthy...
either involves looking at metrics or coredumps or both.
This patch intends to jumpstart this process by dumping a diagnostics on
semaphore timeout or queue overflow. The diagnostics is printed to the
log with debug level to avoid excessive spamming. It contains a
histogram of all the permits associated with the problematic semaphore
organized by table, operation and state.
Example:
DEBUG 2020-10-08 17:05:26,115 [shard 0] reader_concurrency_semaphore -
Semaphore _read_concurrency_sem: timed out, dumping permit diagnostics:
Permits with state admitted, sorted by memory
memory count name
3499M 27 ks.test:data-query
3499M 27 total
Permits with state waiting, sorted by count
count memory name
1 0B ks.test:drain
7650 0B ks.test:data-query
7651 0B total
Permits with state registered, sorted by count
count memory name
0 0B total
Total: permits: 7678, memory: 3499M
This allows determining several things at glance:
* What are the tables involved
* What are the operations involved
* Where is the memory
This can speed up a follow-up investigation greatly, or it can even be
enough on its own to determine that the issue is benign.
Tests: unit(dev, debug)
"
* 'dump-diagnostics-on-semaphore-timeout/v2' of https://github.com/denesb/scylla:
reader_concurrency_semaphore: dump permit diagnostics on timeout or queue overflow
utils: add to_hr_size()
reader_concurrency_semaphore: link permits into an intrusive list
reader_concurrency_semaphore: move expiry_handler::operator()() out-of-line
reader_concurrency_semaphore: move constructors out-of-line
reader_concurrency_semaphore: add state to permits
reader_concurrency_semaphore: name permits
querier_cache_test: test_immediate_evict_on_insert: use two permits
multishard_combining_reader: reader_lifecycle_policy: add permit param to create_reader()
multishard_combining_reader: add permit parameter
multishard_combining_reader: shard_reader: use multishard reader's permit
The reader concurrency semaphore timing out or its queue being overflown
are fairly common events both in production and in testing. At the same
time it is a hard to diagnose problem that often has a benign cause
(especially during testing), but it is equally possible that it points
to something serious. So when this error starts to appear in logs,
usually we want to investigate and the investigation is lengthy...
either involves looking at metrics or coredumps or both.
This patch intends to jumpstart this process by dumping a diagnostics on
semaphore timeout or queue overflow. The diagnostics is printed to the
log with debug level to avoid excessive spamming. It contains a
histogram of all the permits associated with the problematic semaphore
organized by table, operation and state.
Example:
DEBUG 2020-10-08 17:05:26,115 [shard 0] reader_concurrency_semaphore -
Semaphore _read_concurrency_sem: timed out, dumping permit diagnostics:
Permits with state admitted, sorted by memory
memory count name
3499M 27 ks.test:data-query
3499M 27 total
Permits with state waiting, sorted by count
count memory name
1 0B ks.test:drain
7650 0B ks.test:data-query
7651 0B total
Permits with state registered, sorted by count
count memory name
0 0B total
Total: permits: 7678, memory: 3499M
This allows determining several things at glance:
* What are the tables involved
* What are the operations involved
* Where is the memory
This can speed up a follow-up investigation greatly, or it can even be
enough on its own to determine that the issue is benign.
This utility function converts a potentially large number to a compact
representation, composed of at most 4 digits and a letter appropriate to
the power of two the number has to multiplied with to arrive to the
original number (with some loss of precision).
The different powers of two are the conventional 2 ** (N * 10) variants:
* N=0: (B)ytes
* N=1: (K)bytes
* N=2: (M)bytes
* N=3: (G)bytes
* N=4: (T)bytes
Examples:
* 87665 will be converted to 87K
* 1024 will be converted to 1K
Instead of a simple boolean, designating whether the permit was already
admitted or not, add a proper state field with a value for all the
different states the permit can be in. Currently there are three such
states:
* registered - the permit was created and started accounting resource
consumption.
* waiting - the permit was queued to wait for admission.
* admitted - the permit was successfully admitted.
The state will be used for debugging purposes, both during coredump
debugging as well as for dumping diagnostics data about permits.
Require a schema and an operation name to be given to each permit when
created. The schema is of the table the read is executed against, and
the operation name, which is some name identifying the operation the
permit is part of. Ideally this should be different for each site the
permit is created at, to be able to discern not only different kind of
reads, but different code paths the read took.
As not all read can be associated with one schema, the schema is allowed
to be null.
The name will be used for debugging purposes, both for coredump
debugging and runtime logging of permit-related diagnostics.
scylla-python3 causes segfault when non-default locale specified.
As workaround for this, we need to set LC_ALL=en_US.UTF_8 on python3 thunk.
Fixes#7408Closes#7414
Make the warning message clearer:
* Include the number of partitions affected by the batch.
* Be clear that the warning is about the batch size in bytes.
Fixes#7367
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Closes#7417
Without this, clang complains that we violate argument dependent
lookup rules:
note: 'operator<<' should be declared prior to the call site or in namespace 'sstables'
std::ostream& operator<<(std::ostream&, const sstables::component_type&);
we can't enforce the #include order, but we can easily move it it
to namespace sstables (where it belongs anyway), so let's do that.
gcc is happy either way.
Closes#7413
bound_kind_m's formatter violates argument dependent lookup rules
according to clang, so fix that. Along the way improve the formatter a
little.
Closes#7412
* git://github.com/avikivity/scylla.git avikivity-bound_kind_m-formatter:
sstables: move bound_kind_m formatter to namespace sstables
sstables: move bound_kind_m formatter to its natural place
sstables: deinline bound_kind_m formatter
Without this, clang complains that we violate argument dependent
lookup rules:
note: 'operator<<' should be declared prior to the call site or in namespace 'sstables'
std::ostream& operator<<(std::ostream&, const sstables::bound_kind_m&);
we can't enforce the #include order, but we can easily move it it
to namespace sstables (where it belongs anyway), so let's do that.
gcc is happy either way.
Move bound_kind_m's formatter to the same header file where
is is defined. This prevents cases where the compiler decays
the type (an enum) to the underlying integral type because it
does not see the formatter declaration, resulting in the wrong
output.
When there are hint files to be sent and the target endpoint is DOWN,
end_point_hints_manager works in the following loop:
- It reads the first hint file in the queue,
- For each hint in the file it decides that it won't be sent because the
target endpoint is DOWN,
- After realizing that there are some unsent hints, it decides to retry
this operation after sleeping 1 second.
This causes the first segment to be wholly read over and over again,
with 1 second pauses, until the target endpoint becomes UP or leaves the
cluster. This causes unnecessary I/O load in the streaming scheduling
group.
This patch adds a check which prevents end_point_hints_manager from
reading the first hint file at all when it is not allowed to send hints.
First observed in #6964
Tests:
- unit(dev)
- hinted handoff dtests
Closes#7407
The test currently uses a single permit shared between two simulated
reads (to wait admission twice). This is not a supported way of using a
permit and will stop working soon as we make the states the permit is in
more pronounced.
Allow the evictable reader managing the underlying reader to pass its
own permit to it when creating it, making sure they share the same
permit. Note that the two parts can still end up using different
permits, when the underlying reader is kept alive between two pages of a
paged read and thus keeps using the permit received on the previous
page.
Also adjust the `reader_context` in multishard_mutation_query.cc to use
the passed-in permit instead of creating a new one when creating a new
reader.
Don't create an own permit, take one as a parameter, like all other
readers do, so the permit can be provided by the higher layer, making
sure all parts of the logical read use the same permit.
Don't create a new permit per shard reader, pass down the multishard
reader's one to be used by each shard reader. They all belong to the
same read, they should use the same permit. Note that despite its name
the shard readers are the local representation of a reader living on a
remote shard and as such they live on the same shard the multishard
combining reader lives on.
Clang parses templates more eagerly than gcc, so it fails on
some forward-declared templates. In this case, value_writer
was forward-declared and then used in data::cell. As it also
uses some definitions local to data::cell, it cannot be
defined before it as well as after it.
To solve the problem, we define it as a nested class so
it can use other local definitions, yet be defined before it
is used. No code changes.
Closes#7401
A few source files (like those generated by antlr) don't build with seastar,
and so don't inherit all of its flags. They then use the compiler default dialect,
not C++20. With gcc that's just fine, since gcc supports concepts in earlier dialects,
but clang requires C++20.
Fix by forcing --std=gnu++20 for all files (same as what Seastar chooses).
Closes#7392
The remains of the defunct #7246.
Fixes#7344Fixes#7345Fixes#7346Fixes#7347
Shard ID length is now within limits.
Shard end sequence number should be set when appropriate.
Shard parent is selected a bit more carefully (sorting)
Shards are filtered by time to exclude cdc generations we cannot get data from (too old)
Shard paging improved
Closes#7348
* github.com:scylladb/scylla:
test_streams: Add some more sanity asserts
alternator::streams: Set dynamodb data TTL explicitly in cdc options
alternator::streams: Improve paging and fix parent-child calculation
alternator::streams: Remove table from shard_id
alternator::streams: Filter our cdc streams older than data/table
alternator::error: Add a few dynamo exception types
"
max_concurrent_for_each was added to seastar for replacing
sstable_directory::parallel_for_each_restricted by using
more efficient concurrency control that doesn't create
unlimited number of continuations.
The series replaces the use of sstable_directory::parallel_for_each_restricted
with max_concurrent_for_each and exposes the sstable_directory::do_for_each_sstable
via a static method.
This method is used here by table::snapshot to limit concurrency
do snapshot operations that suffer from the same unbound
concurrency problem sstable_directory solved.
In addition sstable_directory::_load_semaphore that was used
across calls to do_for_each_sstable was replaced by a static per-shard
semaphore that caps concurrency across all calls to `do_for_each_sstable`
on that shard. This makes sense since the disk is a shared resource.
In the future, we may want to have a load semaphore per device rather than
a single global one. We should experiment with that.
Test: unit(dev)
"
* tag 'max_concurrent_for_each-v5' of github.com:bhalevy/scylla:
table: snapshot: use max_concurrent_for_each
sstable_directory: use a external load_semaphore
test: sstable_directory_test: extract sstable_directory creation into with_sstable_directory
distributed_loader: process_upload_dir: use initial_sstable_loading_concurrency
sstables: sstable_directory: use max_concurrent_for_each
The build with clang fails with
ld.lld: error: undefined symbol: icu_65::Collator::createInstance(icu_65::Locale const&, UErrorCode&)
>>> referenced by like_matcher.cc
>>> build/dev/utils/like_matcher.o:(boost::re_detail_106900::icu_regex_traits_implementation::icu_regex_traits_implementation(icu_65::Locale const&))
>>> referenced by like_matcher.cc
>>> build/dev/utils/like_matcher.o:(boost::re_detail_106900::icu_regex_traits_implementation::icu_regex_traits_implementation(icu_65::Locale const&))
That symbol lives in libicui18n. It's not clear why clang fails to resolve it and gcc succeeds (after all,
both use lld as the linker) but it is easier to add the library than to attempt to figure out the
discrepancy.
Closes#7391
Clang has some difficulty with the boost::cpp_int constructor from string_view.
In fact it is a mess of enable_if<>s so a human would have trouble too.
Work around it by converting to std::string. This is bad for performance, but
this constructor is not going to be fast in any case.
Hopefully a fix will arrive in clang or boost.
Closes#7389
Following ad48d8b43c, fix a similar problem which popped up
with higher inlining thresholds in query-result-writer.hh. Since
idl/query depends on idl/keys, it must follow in definition order.
Closes#7384
We add std::distance(...) + 1 to a vector iterator, but
the vector can be empty, so we're adding a non-zero value
to nullptr, which is undefined behavior.
Rearrange to perform the limit (std::min()) before adding
to the pointer.
Found by clang's ubsan.
Closes#7377
database_test contains several instances of calling do_with_cql_test_env()
with a function that expects to be called in a thread. This mostly works
because there is an internal thread in do_with_cql_test_env(), but is not
guaranteed to.
Fix by switching to the more appropriate do_with_cql_test_env_thread().
Closes#7333
The variable 'observer' (an std::optional) may be left uninitialized
if 'incremental_enabled' is false. However, it is used afterwards
with a call to disconnect, accessing garbage.
Fix by accessing it via the optional wrapper. A call to optional::reset()
destroys the observable, which in turn calls disconnect().
Closes#7380
get_nt_build_id() constructs a pointer by adding a base and an
offset, but if the base happens to be zero, that is undefined
under C++ rules (altough legal ELF).
Fix by performing the addition on integers, and only then
casting to a pointer.
Closes#7379
libstdc++'s std::regex uses recursion[1], with a depth controlled by the
input. Together with clang's debug mode, this overflows the stack.
Use boost::regex instead, which is immune to the problem.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86164Closes#7378
b1e78313fe added a check for ubsan to squelch a false positive,
but that check doesn't work with clang. Relax it to check for debug
mode, so clang doesn't hit the same false positive as gcc did.
Define a SANITIZE macro so we have a reliable way to detect if
we're running with a sanitizer.
Closes#7372
d2t() scales a fraction in the range [0, 1] to the range of
a biased token (same as unsigned long). But x86 doesn't support
conversion to unsigned, only signed, so this is a truncating
conversion. Clang's ubsan correctly warns about it.
Fix by reducing the range before converting, and expanding it
afterwards.
Closes#7376
Some test (run_based_compaction_test at least) use _max_sstable_size = 0
in order to force one partition per sstable. That triggers an overflow
when calculating the expected bloom filter size. The overflow doesn't
matter for normal operation, because the result later appears on a
divisor, but does trigger a ubsan error.
Squelch the error by bot dividing by zero here.
I tried using _max_sstable_size = 1, but the test failed for other
reasons.
Closes#7375
When converting a value to its Lua representation, we choose
an integer type if it fits. If it doesn't, we fall back to a
more expensive type. So we explicitly try to trigger an overflow.
However, clang's ubsan doesn't like the overflow, and kills the
test. Tell it that the overflow is expected here.
Closes#7374
Our AVX2 implementation cannot load a partial vector,
or mask unused elements (that can be done with AVX-512/SVE2),
so it has some restrictions. Document them.
Closes#7385
Offsetting a null pointer is undefined, and clang's ubsan complains.
Rearrange the arithmetic so we never offset a null pointer. A function
is introduced for the remaining contiguous bytes so it can cast the result
to size_t, avoiding a compare-of-different-signedness warning from gcc.
Closes#7373
When the number of nanosecond digits is greater than 9, the std::pow()
expression that corrects the nanosecond value becomes infinite. This
is because sstring::length() is unsigned, and so negative values
underflow and become large.
Following Cassandra, fix by forbidding more than 9 digits of
nanosecond precision.
Found by clang's ubsan.
Closes#7371
Following 5d249a8e27, apply the same
fix for user_type_impl.
This works around https://bugs.llvm.org/show_bug.cgi?id=47747
Depending on this might be unstable, as the bug bug can show up at any
corner, but this is sufficient right now to get
test_user_function_disabled to pass.
Closes#7370
Our cql parser uses large amounts of stack, and can overflow it
in debug mode with clang. To prevent this stack overflow,
temporarily use a larger (1MB) stack.
Closes#7369
* github.com:scylladb/scylla:
cql3: use larger stack for do_with_cql_parser() in debug mode
cql3: deinline do_with_cql_parser()
Clang has a bug processing inline ifuncs with intrinsics[1].
Since ifuncs can't be inlined anyway (they are always dispatched
via a function pointer that is determined based on the CPU
features present), nothing is gained by inlining them. Deinlining
therefore reduces compile time and works around the clang bug.
[1] https://bugs.llvm.org/show_bug.cgi?id=47691Closes#7358
Our cql parser uses large amounts of stack, and can overflow it
in debug mode with clang. To prevent this stack overflow,
temporarily use a larger (1MB) stack.
We can't use seastar::thread(), since do_with_cql_parser() does
not yield. We can't use std::thread(), since lw_shared_ptr()'s
debug mode will scream murder at an lw_shared_ptr used across
threads (even though it's perfectly safe in this case). We
can't use boost::context2 since that requires the library to
be compiled with address sanitizer support, which it isn't on
Fedora. So we use a fiber switch using the getcontext() function
familty. This requires extra annotations for debu mode, which are
added.
The cql parser causes trouble with the santizers and clang,
since it consumes a large amount of stack space (it does so
with gcc too, but does not overflow our 128k stacks). In
preparation for working around the problem, deinline it
so the hacks need not spread to the entire code base
via #include.
There is no performance impact from the virtual function,
as cql parsing will dominate the call.
Raft tests with declarative structure instead of procedural.
* https://github.com/alecco/scylla/tree/raft-ale-tests-03d:
raft: log failed test case name
raft: test add hasher
raft: declarative tests
raft: test make app return proper exit int value
raft: test add support for disconnected server
raft: tests use custom server ids for easier debugging
raft: make election_elapsed public for testing
raft: test remove unnecessary header
raft: fix typo snaphot snapshot
Values seen by nodes were so far added but this does not provide a
guarantee the order of these values was respected.
Use a digest to check output, implicitly checking order.
On the other hand, sum or a simple positional checksum like Fletcher's
is easier to debug as rolling sum is evident.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
For convenience making Raft tests, use declarative structures.
Servers are set up and initialized and then updates are processed.
For now, updates are just adding entries to leader and change of leader.
Updates and leader changes can be specified to run after initial test setup.
An example test for 3 nodes, node 0 starting as leader having two entries
0 and 1 for term 1, and with current term 2, then adding 12 entries,
changing leader to node 1, and adding 12 more entries. The test will
automatically add more entries to the last leader until the test limit
of total_values (default 100).
{.name = "test_name", .nodes = 3, .initial_term = 2,
.initial_states = {{.le = {{1,0},{1,1}}},
.updates = {entries{12},new_leader{1},entries{12}},},
Leader is isolated before change via is_leader returning false.
Initial leader (default server 0) will be set with this method, too.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
The get_range_to_endpoint_map method, takes a keyspace and returns a map
between the token ranges and the endpoint.
It is used by some external tools for repair.
Token ranges are codes as size-2 array, if start or end are empty, they will be
added as an empty string.
The implementation uses get_range_to_address_map and re-pack it
accordingly.
The use of stream_range_as_array it to reduce the risk of large
allocations and stalls.
Relates to scylladb/scylla-jmx#36
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Closes#7329
Tables may have thousands of sstables and a number of component files
for each sstables. Using parallel_for_each on all sstables (and
parallel_for_each in sstables::create_links for each file)
needlessly overloads the system with unbounded number of continuations.
Use max_concurrent_for_each and acquire the db sst_dir_semaphore to
limit parallelism.
Note that although snapshot is called while scylla already
loaded the sstable we use the configured initial_sstable_loading_concurrency().
As a future follow-up we may want to define yet another config
variable for on-going operations on sstable directories
if we see that it warrants a diffrent setting than the initial
loading concurrency.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Although each sstable_directory limits concurrency using
max_concurrent_for_each, there could be a large number
of calls to do_for_each_sstable running in parallel
(e.g per keyspace X per table in the distributed_loader).
To cap parallelism across sstable_directory instances and
concurrent calls to do_for_each_sstable, start a sharded<semaphore>
and pass a shared semaphore& to the sstable_directory:s.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use common code to create, start, and stop the sharded<sstable_directory>
for each test.
This will be used in the next patch for creating a sharded semaphore
and passing it to the sstable_directory.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
shutil.rmtree(ignore_errors=True) was for ignores error when directory not exist,
but it also ignores permission error, so we shouldn't use that.
Run os.path.exists() before shutil.rmtree() instead.
Fixes#7337Closes#7338
This series prepares the build system for clang support. It deals with
the different sets of warnings accepted by clang and gcc, and with
detecting clang 10 as a supported compiler.
It's still not possible to build with clang after this, but we're
another step closer.
Closes#7269
* github.com:scylladb/scylla:
build: detect and allow clang 10 as a compiler
build: detect availablity of -Wstack-usage=
build: disable many clang-specific warnings
rapidjson has a harmless (but true) ubsan violation. It was fixed
in 16872af889.
Since rapidjson has't released since 2016, we're unlikely to see
the fix, so suppress it to prevent the tests failing. In any case
the violation is harmless.
gcc's ubsan doesn't object to the addition.
Closes#7357
Send more that one entry in single append_entry message but
limit one packets size according to append_request_threshold parameter.
Message-Id: <20201007142602.GA2496906@scylladb.com>
Updated for the ability to add group names to SMP service groups
(https://github.com/scylladb/seastar/pull/809).
* seastar 8c8fd3ed...c62c4a3d (3):
> smp service group: add optional group name
> dpdk: mark link_ready() function override
> Merge "sharded: make start, stop, and invoke_on methods noexcept" from Benny
Fix a race condition in on_compaction_completion() that can prevent shutdown,
as well as an exception handling error. See individual patches for details.
Fixes#7331.
Closes#7334
* github.com:scylladb/scylla:
table: fix mishandled _sstable_deleted_gate exception in on_compaction_completion
table: fix on_compaction_completion corrupting _sstables_compacted_but_not_deleted during self-race
Although process_upload_dir is not called when initially loading
the tables, but rather from from storage_service::load_new_sstables,
it can use the same sstable_loading_concurrency, rather than constant `4`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use max_concurrent_for_each instead of parallel_for_each in
sstable_directory::parallel_for_each_restricted to avoid
creating potentially thousands of continuations,
one for each sstable.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Fixes#7345Fixes#7346
Do a more efficient collection skip when doing paging, instead of
iterating the full sets.
Ensure some semblance of sanity in the parent-child relationship
between shards by ensuring token order sorting and finding the
apparent previous ID coverting the approximate range of new gen.
Fix endsequencenumber generation by looking at whether we are
last gen or not, instead of the (not filled in) 'expired' column.
Fixes#7344
It is not data really needed, as shard_id:s are not required
to be unique across streams, and also because the length limit
on shard_id text representation.
As a side effect, shard iter instead carries the stream arn.
Instead of eagerly linearizing all values as they are passed to
validate(), defer linearization to those validators that actually need
linearized values. Linearizing large values puts pressure on the memory
allocator with large contiguous allocation requests. This is something
we are trying to actively avoid, especially if it is not really neaded.
Turns out the types, whose validators really want linearized values are
a minority, as most validators just look at the size of the value, and
some like bytes don't need validation at all, while usually having large
values.
This is achieved by templating the validator struct on the view and
using the FragmentedRange concept to treat all passed in views
(`bytes_view` and `fragmented_temporary_buffer_view`) uniformly.
This patch makes no attempt at converting existing validators to work
with fragmented buffers, only trivial cases are converted. The major
offenders still left are ascii/utf8 and collections.
Fixes: #7318
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201007054524.909420-1-bdenes@scylladb.com>
After adding a new node to the cluster, Scylla sends a NEW_NODE event
to CQL clients. Some clients immediately try to connect to the new node,
however it fails as the node has not yet started listening to CQL
requests.
In contrast, Apache Cassandra waits for the new node to start its CQL
server before sending NEW_NODE event. In practice this means that
NEW_NODE and UP events will be sent "jointly" after new node is UP.
This change is implemented in the same manner as in Apache Cassandra
code.
Fixes#7301.
Closes#7306
Fixes#7347
If cdc stream id:s are older than either table creation or now - 24h
we can skip them in describe_stream, to minimize the amount of
shards being returned.
The username becomes known in the course of resolving challenges
from `PasswordAuthenticator`. That's why username is being set on
successful authentication; until then all users are "anonymous".
Meanwhile, `AllowAllAuthenticator` (the default) does not request
username, so users logged with it will remain as "anonymous" in
`system.clients`.
Shuffling of code was necessary to unify existing infrastructure
for INSERTing entries into `system.clients` with later UPDATEs.
"
There are few places left that call for global query processor instance,
the tracing is one of them.
The query pressor is used mainly in table_helper, so this set mostly
shuffles its methods' arguments to deliver the needed reference. At the
end the main.cc code is patched to provide the query processor, which
is still global and not stopped, and is thus safe to be used anywhere.
tests: unit(dev), dtest(cql_tracing:dev)
"
* 'br-tracing-vs-query-processor' of https://github.com/xemul/scylla:
tracing: Keep qp anchor on backend
tracing: Push query processor through init methods
main: Start tracing in main
table_helper: Require local query processor in calls
table_helper: Use local qp as setup_table argument
table_helper: Use local db variable
"
The querier cache has a memory based eviction mechanism, which starts
evicting freshly inserted queriers once their collective memory
consumption goes above the configured limit. For determining the memory
consumption of individual queriers, the querier cache uses
`flat_mutation_reader::buffer_size()`. But we now have a much more
comprehensive accounting of the memory used by queriers: the reader
permit, which also happens to be available in each querier. So use this
to determine the querier's memory consumption instead.
Tests: unit(dev)
"
* 'querier-cache-use-permit-for-memory-accounting/v1' of https://github.com/denesb/scylla:
flat_mutation_reader: de-virtualize buffer_size()
querier_cache: use the reader permit for memory accounting
querier_cache_test: use local semaphore not the test global one
reader_permit: add consumed_resources() accessor
The query processor is required in table_helper's used by tracing. Now
everything is ready to push the query processor reference from main down
to the table helpers.
Because of the current initialization sequence it's only possible to have
the started query processor at the .start_tracing() time. Earlier, when
the sharded<tracing> is started the query processor is not yet started,
so tracing keeps a pointer on local query processor.
When tracing is stopped, the pointer is null-ed. This is safe (but an
assert is put when dereferencing it), because on stop trace writes' gate
is closed and the query processor is only used in them.
Also there's still a chance that tracing remains started in case of start
abort, but this is on-par with the current code -- sharded query processor
is not stopped, so the memory is not freed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goal is to make tracing keyspace helper reference query processor, so this
patch adds the needed arguments through the initialization stack.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Move the tracing::start_tracing() out of the storage_service::join_cluster. It
anyway happens at the end of the join, so the logic is not changed, but it
becomes possible to patch tracing further.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Keeping the query processor reference on the table_helper in raii manner
seems waistful, the only user of it -- the trace_keyspace_helper -- has
a bunch of helpers on board, each would then keep its own copy for no
gain.
At the same time the trace_keyspace_helper already gets the query processor
for its needs, so it can share one with table_helper-s.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goal is to make table_helper API require the query_processor
reference and use it where needed. The .setup_table() is private
method, and still grabs the query processor reference itself. Since
its futures do noth reshard, it's safe to carry the query processor
reference through.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test was broken by recent sstables manager rework. In the middle
the sstables::test_env is destroyed without being closed which leads
to broken _closing assertion inside ~sstables_manager().
Fix is to use the test_env::do_with helper.
tests: perf.memory_footprint
* https://github.com/xemul/scylla/tree/br-memory-footprint-test-fix:
test/perf/memory_footprint: Fix indentation after previous patch
test/perf/memory_footprint: Don't forget to close sstables::test_env after usage
After recent sstables manager rework the sstables::test_env must be .close()d
after usage, otherwise the ~sstables_mananger() hits the _closing assertion.
Do it with the help of .do_with(). The execution context is already seastar::async
in this place, so .get() it explicitly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In some cases a collection is used to keep several elements,
so it's good to know this timing.
For example, a mutation_partition keeps a set of rows, if used
in cache it can grow large, if used in mutation to apply, it's
typically small. Plain replacement of bst into b-tree caused
performance degardation of mutation application because b-tree
is only better at big sizes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This collection is widely used, any replacement should be
compared against it to better understand pros-n-cons.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
on_compaction_completion tries to handle a gate_closed_exception, but
with_gate() throws rather than creating an exceptional future, so
the extra handling is lost. This is relatively benign since it will
just fail the compaction, requiring that work to be redone later.
Fix by using the safer try_with_gate().
on_compaction_completion() updates _sstables_compacted_but_not_deleted
through a temporary to avoid an exception causing a partial update:
1. copy _sstables_compacted_but_not_deleted to a temporary
2. update temporary
3. do dangerous stuff
4. move temporary to _sstables_compacted_but_not_deleted
This is racy when we have parallel compactions, since step 3 yields.
We can have two invocations running in parallel, taking snapshots
of the same _sstables_compacted_but_not_deleted in step 1, each
modifying it in different ways, and only one of them winning the
race and assigning in step 4. With the right timing we can end
with extra sstables in _sstables_compacted_but_not_deleted.
Before a5369881b3, this was a benign race (only resulting in
deleted file space not being reclaimed until the service is shut
down), but afterwards, extra sstable references result in the service
refusing to shut down. This was observed in database_test in debug
mode, where the race more or less reliably happens for system.truncated.
Fix by using a different method to protect
_sstables_compacted_but_not_deleted. We unconditionally update it,
and also unconditionally fix it up (on success or failure) using
seastar::defer(). The fixup includes a call to rebuild_statistics()
which must happen every time we touch the sstable list.
Fixes#7331.
The main user of this method, the one which required this method to
return the collective buffer size of the entire reader tree, is now
gone. The remaining two users just use it to check the size of the
reader instance they are working with.
So de-virtualize this method and reduce its responsibility to just
returning the buffer size of the current reader instance.
The querier cache has a memory limit it enforces on cached queriers. For
determining how much memory each querier uses, it currently uses
`flat_mutation_reader::buffer_size()`. However, we now have a much more
complete accounting of the memory each read consumes, in the form of the
reader permit, which also happens to be handy in the queriers. So use it
instead of the not very well maintained `buffer_size()`.
In the mutation source, which creates the reader for this test, the
global test semaphore's permit was passed to the created reader
(`tests::make_permit()`). This caused reader resources to be accounted
on the global test semaphore, instead of the local one the test creates.
Just forward the permit passed to the mutation sources to the reader to
fix this.
Merged patch series by Tomasz Grabiec:
UBSAN complains in debug mode when the counter value overflows:
counters.hh:184:16: runtime error: signed integer overflow: 1 + 9223372036854775807 cannot be represented in type 'long int'
Aborting on shard 0.
Overflow is supposed to be supported. Let's silence it by using casts.
Fixes#7330.
Tests:
- build/debug/test/tools/cql_repl --input test/cql/counters_test.cql
Tomasz Grabiec (2):
counters: Avoid signed integer overflow
test: cql: counters: Add tests reproducing signed integer overflow in
debug mode
counters.hh | 2 +-
test/cql/counters_test.cql | 9 ++++++++
test/cql/counters_test.result | 48 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 58 insertions(+), 1 deletion(-)
UBSAN complains in debug mode when the counter value overflows:
counters.hh:184:16: runtime error: signed integer overflow: 1 + 9223372036854775807 cannot be represented in type 'long int'
Aborting on shard 0.
Overflow is supposed to be supported.
Let's silence it by using casts.
Fixes#7330.
Server address UUID 0 is not a valid server id since there is code that
assumes if server_id is 0 the value is not set (e.g _voted_for).
Prevent users from manually setting this invalid value.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
When AppendEntries/InstallSnapshot term is the same as the current
server's, and the current servers' leader is not set, we should
assign it to avoid starting election if the current leader becomes idle.
Restructure the code accordingly - change candidate state to Follower
upon InstallSnapshot.
It was erroneously replaced by the logic based on time which caused us
to send one probe per tick which is not an intention at all. There can
be one outstanding probe message but the moment it gets a reply next
one should be sent without waiting for a tick.
"
Misc. cleanups and minor optimizations of token_metadata methods
in preparation to futurizing parts of the api around
update_pending_ranges and abstract_replication_strategy::calculate_natural_endpoints,
to prevent reactor stalls on these paths
Test: unit(dev)
"
* 'token_metadata_cleanup' of github.com:bhalevy/scylla:
token_metadata: get rid of unused calculate_pending_ranges_for_* methods
token_metadata: get rid of clone_after_all_settled
token_metadata_impl: remove_endpoint: do not sort tokens
token_metadata_impl: always sort_tokens in place
On some environment such as CentOS8, journalctl --user -xe does not work
since journald is running in volatile mode.
The issue cannnot fix in non-root mode, as a workaround we should logging to
a file instead of journal.
Also added scylla_logrotate to ExecStartPre which rename previous log file,
since StandardOutput=file:/path/to/file will erase existing file when service
restarted.
Fixes#7131Closes#7326
Instead of relying on SCYLLA-VERSION-GEN to be consistently
updated in each submodule, propagate the top-level product-version-release
to all submodules. This reduces the churn required for each release,
and makes the release strings consistent (previously, the git hash in each
was different).
Closes#7268
Alternator does not yet support direct access to nested attributes in
expressions (this is issue #5024). But it's still good to have tests
covering this feature, to make it easier to check the implementation
of this feature when it comes.
Until now we did not have tests for using nested attributes in
*FilterExpression*. This patch adds a test for the straightforward case,
and also adds tests for the more elaborate combination of FilterExpression
and ProjectionExpression. This combination - see issue #6951 - means that
some attributes need to be retrieved despite not being projected (because
they are needed in a filter). When we support nested attributes there will
be special cases when the projected and filtered attributes are parts of
the same top-level attribute, so the code will need to handle those cases
correctly. As I was working on issue #6951 now, it is a good time to write
a test for these special cases, even if nested attributes aren't yet
supported - so we don't forget to handle these special cases later.
Both new tests pass on DynamoDB, and xfail on Alternator.
Refs #5024 (nested attributes)
Refs #6951 (FilterExpression with ProjectionExpression)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
A comment in test/alternator/test_lsi.py wrongly described the schema
of one of the test tables. Fix that comment.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch provides two more tests for issue #6951. As this issue was
already fixed, the two new tests pass.
The two new test check two special cases for which were handled correctly
but not yet tested - when the projected attribute is a key attribute of
the table or of one of its LSIs. Having these two additional tests will
ensure that any future refactoring or optimizations in the this area of
the code (filtering, projection, and its combination) will not break these
special cases.
Refs #6951.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The DynamoDB API has for the Query and Scan requests two filtering
syntaxes - the old (QueryFilter or ScanFilter) and the new (FilterExpression).
Also for projection, it has an old syntax (AttributesToGet) and a new
one (ProjectionExpression). Combining an old-style and new-style parameter
is forbidden by DynamoDB, and should also be forbidden by Alternator.
This patch fixes, and removes the "xfails" tag, of two tests:
test_query_filter.py::test_query_filter_and_projection_expression
test_filter_expression.py::test_filter_expression_and_attributes_to_get
Refs #6951
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We had a bug when a Query/Scan had both projection (ProjectionExpression
or AttributesToGet) and filtering (FilterExpression or Query/ScanFilter).
The problem was that projection left only the requested attributes, and
the filter might have needed - and not got - additional attributes.
The solution in this patch is to add the generated JSON item also
the extra attributes needed by filtering (if any), run the filter on
that, and only at the end remove the extra filtering attributes from
the item to be returned.
The two tests
test_query_filter.py::test_query_filter_and_attributes_to_get
test_filter_expression.py::test_filter_expression_and_projection_expression
Which failed before this patch now pass so we drop their "xfail" tag.
Fixes#6951.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
* seastar 292ba734bc...8c8fd3ed28 (15):
> semaphore_units: add return_units and return_all
> semaphore_units: release: mark as noexcept
> circular_buffer: support non-default-constructible allocators correctly
> core/shared_ptr: Expose use count through {lw_}enable_shared_from_this
> memory: align allocations to std::max_align_t
> util/log: logger::do_log(): go easier on allocations
> doc: add link to multipage version of tutorial
> doc: fix the output directories of split and tutorial.html
> build: do not copy htmlsplit.py to build dir
> doc: add "--input" and "--output-dir" options to htmlsplit.py
> doc: update split script to use xml.etree.ElementTree
> Merge "shared_future: make functions noexcept" from Benny
> tutorial: add linebreak between sections
> tutorial: format "future<int>" as inline code block
> docs: specify HTML language code for tutorial.html
This patch change the way the request object is passed,
using a reference instead of temporaries.
'exists' test is passing in debug mode, whereas it was
always failing before.
Fixes#7261 by ensuring request object is alive for all commands
during the whole request duration.
Signed-off-by: Etienne Adam <etienne.adam@gmail.com>
Message-Id: <20200924202034.30399-1-etienne.adam@gmail.com>
Lua 5.4 added an extra parameter to lua_resume()[1]. The parameter
denotes the number of arguments yielded, but our coroutines
don't yield any arguments, so we can just ignore it.
Define a macro to allow adding extra stuff with Lua 5.4,
and use it to supply the extra parameter.
[1] https://www.lua.org/manual/5.4/manual.html#8.3Closes#7324
Clang eagerly instantiates templates, apparently with the following
algorithm:
- if both the declaration and definition are seen at the time of
instantiation, instantiate the template
- if only the declaration is see at the time of instantiation, just emit
a reference to the template; even if the definition is later seen,
it is not instantiated
The "reference" in the second case is a relocation entry in the object file
that is satisfied at link time by the linker, but if no other object file
instantiated the needed template, a link error results.
These problems are hard to diagnose but easy to fix. This series fixes all
known such issues in the code base. It was tested on gcc as well.
Closes#7322
* github.com:scylladb/scylla:
query-result-reader: order idl implementations correctly
frozen_schema: order idl implementations correctly
idl-compiler: generate views after serializers
On Ubuntu 16/18 and Debian 9, LimitNOFILE is set to 4096 and not able to override from
user unit.
To run scylla-server in such environment, we need to turn on developer mode and show
warnings.
Fixes#7133Closes#7323
Clang eagerly instantiates templates, so if it needs a template
function for which it has a declaration but not a definition, it
will not instantiate the definition when it sees it. This causes
link errors.
Fix by ordering the idl implementation files so that definitions
come before uses.
Clang eagerly instantiates templates, so if it needs a template
function for which it has a declaration but not a definition, it
will not instantiate the definition when it sees it. This causes
link errors.
Fix by ordering the idl implementation files so that definitions
come before uses.
Clang eagerly instantiates templates, so if it needs a template
function for which it has a declaration but not a definition, it
will not instantiate the definition when it sees it. This causes
link errors.
In this case, the views use the serializer implementations, but are
generated before them.
Fix by generating the view implementations after the serializer
implementations that they use.
Add disable option for test configuration.
Tests in this list will be disabled for all modes.
* alejo/next-disable-raft-tests-01:
Raft: disable boost tests for now
Tests: add disable to configuration
Raft: Remove tests for now
For suite.yaml add an extra configuration option disable.
Tests in this list will disabled for all modes.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Remove raft C++ tests until raft is included in build process.
[tgrabiec]: Fixes test.py failure. Tests are not compiled unless --build-raft is
passed to configure.py and we cannot enable it by default yet.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20201002102847.1140775-1-alejo.sanchez@scylladb.com>
Unavailable exception means that operation was not started and it can be
retried safely. If lwt fails in the learn stage though it most
certainly means that its effect will be observable already. The patch
returns timeout exception instead which means uncertainty.
Fixes#7258
Message-Id: <20201001130724.GA2283830@scylladb.com>
This is the beginning of raft protocol implementation. It only supports
log replication and voter state machine. The main difference between
this one and the RFC (besides having voter state machine) is that the
approach taken here is to implement raft as a deterministic state
machine and move all the IO processing away from the main logic.
To do that some changes to RPC interface was required: all verbs are now
one way meaning that sending a request does not wait for a reply and
the reply arrives as a separate message (or not at all, it is safe to
drop packets).
* scylla-dev/raft-v4:
raft: add a short readme file
raft: compile raft tests
raft: add raft tests
raft: Implement log replication and leader election
raft: Introduce raft interface header
Compilation is not enabled by default as it requires coroutines support
and may require special compiler (until distributed one fixes all the
bugs related to coroutines). To enable raft tests compilation new
configure.py option is added (--build-raft).
Add test for currently implemented raft features. replication_test
tests replication functionality with various initial log configurations.
raft_fsm_test test voting state machine functionality.
This patch introduces partial RAFT implementation. It has only log
replication and leader election support. Snapshotting and configuration
change along with other, smaller features are not yet implemented.
The approach taken by this implementation is to have a deterministic
state machine coded in raft::fsm. What makes the FSM deterministic is
that it does not do any IO by itself. It only takes an input (which may
be a networking message, time tick or new append message), changes its
state and produce an output. The output contains the state that has
to be persisted, messages that need to be sent and entries that may
be applied (in that order). The input and output of the FSM is handled
by raft::server class. It uses raft::rpc interface to send and receive
messages and raft::storage interface to implement persistence.
This commit introduce public raft interfaces. raft::server represents
single raft server instance. raft::state_machine represents a user
defined state machine. raft::rpc, raft::rpc_client and raft::storage are
used to allow implementing custom networking and storage layers.
A shared failure detector interface defines keep-alive semantics,
required for efficient implementation of thousands of raft groups.
Recently, the cql_server_config::max_concurrent_requests field was
changed to be an updateable_value, so that it is updated when the
corresponding option in Scylla's configuration is live-reloaded.
Unfortunately, due to how cql_server is constructed, this caused
cql_server instances on all shards to store an updateable_value which
pointed to an updateable_value_source on shard 0. Unsynchronized
cross-shard memory operations ensue.
The fix changes the cql_server_config so that it holds a function which
creates an updateable_value appropriate for the given shard. This
pattern is similar to another, already existing option in the config:
get_service_memory_limiter_semaphore.
This fix can be reverted if updateable_value becomes safe to use across
shards.
Tests: unit(dev)
Fixes: #7310
The 'redis_database_count' was already existing, but
was not used when initializing the keyspaces. This
patch merely uses it. I think it's better that way, it
seems cleaner not to create 15 x 5 tables when we
use only one redis database.
Also change a test to test with a higher max number
of database.
Signed-off-by: Etienne Adam <etienne.adam@gmail.com>
Message-Id: <20200930210256.4439-1-etienne.adam@gmail.com>
In order to improve observability, add a username field to the the
system_traces.sessions table. The system table should be change
while upgrading by running the fix_system_distributed_tables.py
script. Until the table is updated, the old behaviour is preserved.
Fixes#6737.
This miniseries enhances the code from #7279 by:
* adding metrics for shed requests, which will allow to pinpoint the problem if the max concurrent requests threshold is too low
* making the error message more comprehensive by pointing at the variable used to set max concurrent requests threshold
Example of an ehanced error message:
```
ConnectionException('Failed to initialize new connection to 127.0.0.1: Error from server: code=1001 [Coordinator node overloaded] message="too many in-flight requests (configured via max_concurrent_requests_per_shard): 18"',)})
```
Closes#7299
* github.com:scylladb/scylla:
transport: make _requests_serving param uint32_t
transport: make overloaded error message more descriptive
transport: add requests_shed metrics
Call sort_tokens at the caller as all call sites from
within token_metadata_impl call remove_endpoint for multiple
endpoints so the tokens can be re-sorted only once, when done
removing all tokens.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It's not realistic for a shard to have over 4 billion concurrent
requests, so this value can be safely represented in 32 bits.
Also, since the current concurrency limit is represented in uint32_t,
it makes sense for these two to have matching types.
"
The last major untracked area of the reader pipeline is the reader
buffers. These scale with the number of readers as well as with the size
and shape of data, so their memory consumption is unpredictable varies
wildly. For example many small rows will trigger larger buffers
allocated within the `circular_buffer<mutation_fragment>`, while few
larger rows will consume a lot of external memory.
This series covers this area by tracking the memory consumption of both
the buffer and its content. This is achieved by passing a tracking
allocator to `circular_buffer<mutation_fragment>` so that each
allocation it makes is tracked. Additionally, we now track the memory
consumption of each and every mutation fragment through its whole
lifetime. Initially I contemplated just tracking the `_buffer_size` of
`flat_mutation_reader::impl`, but concluded that as our reader trees are
typically quite deep, this would result in a lot of unnecessary
`signal()`/`consume()` calls, that scales with the number of mutation
fragments and hence adds to the already considerable per mutation
fragment overhead. The solution chosen in this series is to instead
track the memory consumption of the individual mutation fragments, with
the observation that these are typically always moved and very rarely
copied, so the number of `signal()`/`consume()` calls will be minimal.
This additional tracking introduces an interesting dilemma however:
readers will now have significant memory on their account even before
being admitted. So it may happen that they can prevent their own
admission via this memory consumption. To prevent this, memory
consumption is only forwarded to the semaphore upon admission. This
might be solved when the semaphore is moved to the front -- before the
cache.
Another consequence of this additional, more complete tracking is that
evictable readers now consume memory even when the underlying reader is
evicted. So it may happen that even though no reader is currently
admitted, all memory is consumed from the semaphore. To prevent any such
deadlocks, the semaphore now admits a reader unconditionally if no
reader is admitted -- that is if all count resources all available.
Refs: #4176
Tests: unit(dev, debug, release)
"
* 'track-reader-buffers/v2' of https://github.com/denesb/scylla: (37 commits)
test/manual/sstable_scan_footprint_test: run test body in statement sched group
test/manual/sstable_scan_footprint_test: move test main code into separate function
test/manual/sstable_scan_footprint_test: sprinkle some thread::maybe_yield():s
test/manual/sstable_scan_footprint_test: make clustering row size configurable
test/manual/sstable_scan_footprint_test: document sstable related command line arguments
mutation_fragment_test: add exception safety test for mutation_fragment::mutate_as_*()
test: simple_schema: add make_static_row()
reader_permit: reader_resources: add operator==
mutation_fragment: memory_usage(): remove unused schema parameter
mutation_fragment: track memory usage through the reader_permit
reader_permit: resource_units: add permit() and resources() accessors
mutation_fragment: add schema and permit
partition_snapshot_row_cursor: row(): return clustering_row instead of mutation_fragment
mutation_fragment: remove as_mutable_end_of_partition()
mutation_fragment: s/as_mutable_partition_start/mutate_as_partition_start/
mutation_fragment: s/as_mutable_range_tombstone/mutate_as_range_tombstone/
mutation_fragment: s/as_mutable_clustering_row/mutate_as_clustering_row/
mutation_fragment: s/as_mutable_static_row/mutation_as_static_row/
flat_mutation_reader: make _buffer a tracked buffer
mutation_reader: extract the two fill_buffer_result into a single one
...
Today when ever we are building scylla in a singel mode we still
building jmx, tools and python3 for all dev,release and debug.
Let's make sure we build only in relevant build mode
Also adding unified-tar to ninja build
Closes#7260
`trace_keyspace_helper::make_slow_query_mutation_data` expected a
"query" key in its parameters, which does not appear in case of
e.g. batches of prepared statements. This is example of failing
`record.parameters`:
```
...{"query[0]" : "INSERT INTO ks.tbl (pk, i) values (?, ?);"},
{"query[1]" : "INSERT INTO ks.tbl (pk, i) values (?, ?);"}...
```
In such case Scylla recorded no trace and said:
```
ERROR 2020-09-28 10:09:36,696 [shard 3] trace_keyspace_helper - No
"query" parameter set for a session requesting a slow_query_log record
```
Fix here is to leave query empty if not found. The users can still
retrieve the query contents from existing info.
Fixes#5843Closes#7293
The blacklog of current and max in VIEW_BACKLOG is not update but the
nodes are updating VIEW_BACKLOG all the time. For example:
```
INFO 2020-03-06 17:13:46,761 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486026590,718)
INFO 2020-03-06 17:13:46,821 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486026531,742)
INFO 2020-03-06 17:13:47,765 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486027590,721)
INFO 2020-03-06 17:13:47,825 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486027531,745)
INFO 2020-03-06 17:13:48,772 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486028590,726)
INFO 2020-03-06 17:13:48,833 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486028531,750)
INFO 2020-03-06 17:13:49,772 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486029590,729)
INFO 2020-03-06 17:13:49,832 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486029531,753)
```
The downside of such updates:
- Introduces more gossip exchange traffic
- Updates system.peers all the time
The extra unnecessary gossip traffic is fine to a cluster in a good
shape but when some of the nodes or shards are loaded, such messages and
the handling of such messages can make the system even busy.
With this patch, VIEW_BACKLOG is updated only when the backlog is really
updated.
Btw, we can even make the update only when the change of the backlog is
great than a threshold, e.g., 5%, which can reduce the traffic even
further.
Fixes#5970
We used to calculate the number of endpoints for quorum and local_quorum
unconditionally as ((rf / 2) + 1). This formula doesn't take into
account the corner case where RF = 0, in this situation quorum should
also be 0.
This commit adds the missing corner case.
Tests: Unit Tests (dev)
Fixes#6905Closes#7296
Detect if the compiler supports -Wstack-usage= and only enable it
if supported. Because each mode uses a different threshold, have each
mode store the threshold value and merge it into the flags only after
we decided we support it.
The switch to make it only a warning is made conditional on compiler
support.
On some environment, systemctl enable <service> fails when we use symlink.
So just directly copy systemd units to ~/.config/systemd/user, instead of
creating symlink.
Fixes#7288Closes#7290
After cleaning up old cluster features (253a7640e3)
the code for special handling of 1.7.4 counter order was effectively
only used in its own tests, so it can be safely removed.
Closes#7289
This series approaches issue #7072 and provides a very simple mechanism for limiting the number of concurrent CQL requests being served on a shard. Once the limit is hit, new requests will be instantly refused and OverloadedException will be returned to the client.
This mechanism has many improvement opportunities:
* shedding requests gradually instead of having one hard limit,
* having more than one limit per different types of queries (reads, writes, schema changes, ...),
* not using a preconfigured value at all, and instead figuring out the limit dynamically,
* etc.
... and none of these are taken into account in this series, which only adds a very basic configuration variable. The variable can be updated live without a restart - it can be done by updating the .yaml file and triggering a configuration re-read via sending the SIGHUP signal to Scylla.
The default value for this parameter is a very large number, which translates to effectively not shedding any requests at all.
Refs #7072Closes#7279
* github.com:scylladb/scylla:
transport: make max_concurrent_requests_per_shard reloadable
transport: return exceptional future instead of throwing
transport,config: add a param for max request concurrency
exceptions: make a single-param constructor explicit
exceptions: add a constructor based on custom message
This reverts commit 8366d2231d because it
causes the following "scylla_setup" failure on Ubuntu 16.04:
Command: 'sudo /usr/lib/scylla/scylla_setup --nic ens5 --disks /dev/nvme0n1 --swap-directory / '
Exit code: 1
Stdout:
Setting up libtomcrypt0:amd64 (1.17-7ubuntu0.1) ...
Setting up chrony (2.1.1-1ubuntu0.1) ...
Creating '_chrony' system user/group for the chronyd daemon…
Creating config file /etc/chrony/chrony.conf with new version
Processing triggers for libc-bin (2.23-0ubuntu11.2) ...
Processing triggers for ureadahead (0.100.0-19.1) ...
Processing triggers for systemd (229-4ubuntu21.29) ...
501 Not authorised
NTP setup failed.
Stderr:
chrony.service is not a native service, redirecting to systemd-sysv-install
Executing /lib/systemd/systemd-sysv-install enable chrony
Traceback (most recent call last):
File "/opt/scylladb/scripts/libexec/scylla_ntp_setup", line 63, in <module>
run('chronyc makestep')
File "/opt/scylladb/scripts/scylla_util.py", line 504, in run
return subprocess.run(cmd, stdout=stdout, stderr=stderr, shell=shell, check=exception, env=scylla_env).returncode
File "/opt/scylladb/python3/lib64/python3.8/subprocess.py", line 512, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['chronyc', 'makestep']' returned non-zero exit status 1.
This configuration entry is expected to be used as a quick fix
for an overloaded node, so it should be possible to reload this value
without having to restart the server.
The newly introduced parameter - max_concurrent_requests_per_shard
- can be used to limit the number of in-flight requests a single
coordinator shard can handle. Each surplus request will be
immediately refused by returning OverloadedException error to the client.
The default value for this parameter is large enough to never
actually shed any requests.
Currently, the limit is only applied to CQL requests - other frontends
like alternator and redis are not throttled yet.
The memory usage is now maintained and updated on each change to the
mutation fragment, so it needs not be recalculated on a call to
`memory_usage()`, hence the schema parameter is unused and can be
removed.
The memory usage of mutation fragments is now tracked through its
lifetime through a reader permit. This was the last major (to my current
knowledge) untracked piece of the reader pipeline.
We want to start tracking the memory consumption of mutation fragments.
For this we need schema and permit during construction, and on each
modification, so the memory consumption can be recalculated and pass to
the permit.
In this patch we just add the new parameters and go through the insane
churn of updating all call sites. They will be used in the next patch.
We will soon want to update the memory consumption of mutation fragment
after each modification done to it, to do that safely we have to forbid
direct access to the underlying data and instead have callers pass a
lambda doing their modifications.
Uses where this method was just used to move the fragment away are
converted to use `as_mutation_start() &&`.
We will soon want to update the memory consumption of mutation fragment
after each modification done to it, to do that safely we have to forbid
direct access to the underlying data and instead have callers pass a
lambda doing their modifications.
Uses where this method was just used to move the fragment away are
converted to use `as_range_tombstone() &&`.
We will soon want to update the memory consumption of mutation fragment
after each modification done to it, to do that safely we have to forbid
direct access to the underlying data and instead have callers pass a
lambda doing their modifications.
Uses where this method was just used to move the fragment away are
converted to use `as_clustering_row() &&`.
We will soon want to update the memory consumption of mutation fragment
after each modification done to it, to do that safely we have to forbid
direct access to the underlying data and instead have callers pass a
lambda doing their modifications.
Uses where this method was just used to move the fragment away are
converted to use `as_static_row() &&`.
Via a tracked_allocator. Although the memory allocations made by the
_buffer shouldn't dominate the memory consumption of the read itself,
they can still be a significant portion that scales with the number of
readers in the read.
Not used yet, this patch does all the churn of propagating a permit
to each impl.
In the next patch we will use it to track to track the memory
consumption of `_buffer`.
OverloadedException was historically only used when the number
of in-flight hints got too high. The other constructor will be useful
for using OverloadedException in other scenarios.
This can be used with standard containers and other containers that use
the std::allocator interface to track the allocations made by them via a
reader_permit.
In the next patches we plan to start tracking the memory consumption of
the actual allocations made by the circular_buffer<mutation_fragment>,
as well as the memory consumed by the mutation fragments.
This means that readers will start consuming memory off the permit right
after being constructed. Ironically this can prevent the reader from
being admitted, due to its own pre-admission memory consumption. To
prevent this hold on forwarding the memory consumption to the semaphore,
until the permit is actually admitted.
Track all resources consumed through the permit inside the permit. This
allows querying how much memory each read is consuming (as there should
be one read per permit). Although this might be interesting, especially
when debugging OOM cores, the real reason we are doing this is to be
able forward resource consumption to the semaphore only post-admission.
More on this in the patch introducing this.
Another advantage of tracking resources consumed through the permit is
that now we can detect resource leaks in the permit destructor and
report them. Even if it is just a case of the holder of the resources
wanting to release the resources later, with the permit destroyed it
will cause use-after-free.
In the next patches the reader permit will gain members that are shared
across all instances of the same permit. To facilitate this move all
internals into an impl class, of which the permit stores a shared
pointer. We use a shared_ptr to avoid defining `impl` in the header.
This is how the reader permit started in the beginning. We've done a
full circle. :)
And do all consuming and signalling through these methods. These
operations will soon be more involved than the simple forwarding they do
today, so we want to centralize them to a single method pair.
In the next patches we want to introduce per-permit resource tracking --
that is, have each permit track the amount of resource consumed through
it. For this, we need all consumption to happen through a permit, and
not directly with the semaphore.
To ensure progress at all times. This is due to evictable readers, who
still hold on to a buffer even when their underlying reader is evicted.
As we are introducing buffer and mutation fragment tracking in the next
patches, these readers will hold on to memory even in this state, so it
may theoretically happen that even though no readers are admitted (all
count resources all available) no reader can be admitted due to lack of
memory. To prevent such deadlocks we now always admit one reader if all
count resource are available.
Current code uses a single counter to produce multiple buffer worth of
data. This uses carry-on from on buffer to the other, which happens to
work with the current memory accounting but is very fragile. Account
each buffer separately, resetting the counter between them.
The test consumes all resources off the semaphore, leaving just enough
to admit a single reader. However this amount is calculated based on the
base cost of readers, but as we are going to track reader buffers as
well, the amount of memory consumed will be much less predictable.
So to make sure background readers can finish during shutdown, release
all the consumed resources before leaving scope.
No point in continuing processing the entire buffer once a failure was
found. Especially that an early failure might introduce conditions that
are not handled in the normal flow-path. We could handle these but there
is no point in this added complexity, at this point the test is failed
anyway.
Some tests rely on `consume*()` calls on the permit to take effect
immediately. Soon this will only be true once the permit has been
admitted, so make sure the permit is admitted in these tests.
Currently per-shard reader contexts are cleaned up as soon as the reader
itself is destroyed. This causes two problems:
* Continuations attached to the reader destroy future might rely on
stuff in the context being kept alive -- like the semaphore.
* Shard 0's semaphore is special as it will be used to account buffers
allocated by the multishard reader itself, so it has to be alive until
after all readers are destroyed.
This patch changes this so that contexts are destroyed only when the
lifecycle policy itself is destroyed.
* seastar e215023c7...292ba734b (4):
> future: Fix move of futures of reference type
> doc: fix hyper link to tutorial.html
> tutorial: fix formatting of code block
> README.md: fix the formatting of table
The reader recreation mechanism is a very delicate and error-prone one,
as proven by the countless bugs it had. Most of these bugs were related
to the recreated reader not continuing the read from the expected
position, inserting out-of-order fragments into the stream.
This patch adds a defense mechanism against such bugs by validating the
start position of the recreated reader.
The intent is to prevent corrupt data from getting into the system as
well as to help catch these bugs as close to the source as possible.
Fixes: #7208
Tests: unit(dev), mutation_reader_test:debug (v4)
* botond/evictable-reader-validate-buffer/v5:
mutation_reader_test: add unit test for evictable reader self-validation
evictable_reader: validate buffer after recreation the underlying
evictable_reader: update_next_position(): only use peek'd position on partition boundary
mutation_reader_test: add unit test for evictable reader range tombstone trimming
evictable_reader: trim range tombstones to the read clustering range
position_in_partition_view: add position_in_partition_view before_key() overload
flat_mutation_reader: add buffer() accessor
The reader recreation mechanism is a very delicate and error-prone one,
as proven by the countless bugs it had. Most of these bugs were related
to the recreated reader not continuing the read from the expected
position, inserting out-of-order fragments into the stream.
This patch adds a defense mechanism against such bugs by validating the
start position of the recreated reader. Several things are checked:
* The partition is the expected one -- the one we were in the middle of
or the next if we stopped at partition boundaries.
* The partition is in the read range.
* The first fragment in the partition is the expected one -- has a
an equal or larger position than the next expected fragment.
* The fragment is in the clustering range as defined by the slice.
As these validations are only done on the slow-path of recreating an
evicted reader, no performance impact is expected.
`evictable_reader::update_next_position()` is used to record the position the
reader will continue from, in the next buffer fill. This position is used to
create the partition slice when the underlying reader is evicted and has
to be recreated. There is an optimization in this method -- if the
underlying's buffer is not empty we peek at the first fragment in it and
use it as the next position. This is however problematic for buffer
validation on reader recreation (introduced in the next patch), because
using the next row's position as the next pos will allow for range
tombstones to be emitted with before_key(next_pos.key()), which will
trigger the validation. Instead of working around this, just drop this
optimization for mid-partition positions, it is inconsequential anyway.
We keep it for where it is important, when we detect that we are at a
partition boundary. In this case we can avoid reading the current
partition altogether when recreating the reader.
Currently mutation sources are allowed to emit range tombstones that are
out-of the clustering read range if they are relevant to it. For example
a read of a clustering range [ck100, +inf), might start with:
range_tombstone{start={ck1, -1}, end={ck200, 1}},
clustering_row{ck100}
The range tombstone is relevant to the range and the first row of the
range so it is emitted as first, but its position (start) is outside the
read range. This is normally fine, but it poses a problem for evictable
reader. When the underlying reader is evicted and has to be recreated
from a certain clustering position, this results in out-of-order
mutation fragments being inserted into the middle of the stream. This is
not fine anymore as the monotonicity guarantee of the stream is
violated. The real solution would be to require all mutation sources to
trim range tombstones to their read range, but this is a lot of work.
Until that is done, as a workaround we do this trimming in the evictable
reader itself.
The "mode" variable name is used everywhere, usually in a loop.
Therefore, rename the global "mode" to "checkheaders_mode" so that if
your code block happens to be outside of a loop, you don't accidentally
use the globally visible "mode" and spend hours debugging why it's
always "dev".
Spotted by Yaron Kaikov.
Message-Id: <20200924112237.315817-1-penberg@scylladb.com>
The script pull_github_pr.sh uses git merge's "--log" option to put in
the merge commit the list of titles of the individual patches being
merged in. This list is useful when later searching the log for the merge
which introduced a specific feature.
Unfortunately, "--log" defaults to cutting off the list of commit titles
at 20 lines. For most merges involving fewer than 20 commits, this makes
no difference. But some merges include more than 20 commits, and get
a truncated list, for no good reason. If someone worked hard to create a
patch set with 40 patches, the last thing we should be worried about is
that the merge commit message will be 20 lines longer.
Unfortunately, there appears to be no way to tell "--log" to not limit
the length at all. So I chose an arbitrary limit of 1000. I don't think
we ever had a patch set in Scylla which exceeded that limit. Yet :-)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200924114403.817893-1-nyh@scylladb.com>
This patch fixes a race between two methods in hints manager: drain_for
and store_hint.
The first method is called when a node leaves the cluster, and it
'drains' end point hints manager for that node (sends out all hints for
that node). If this method is called when the local node is being
decomissioned or removed, it instead drains hints managers for all
endpoints.
In the case of decomission/remove, drain_for first calls
parallel_for_each on all current ep managers and tells them to drain
their hints. Then, after all of them complete, _ep_managers.clear() is
called.
End point hints managers are created lazily and inserted into
_ep_managers map the first time a hint is stored for that node. If
this happens between parallel_for_each and _ep_managers.clear()
described above, the clear operation will destroy the new ep manager
without draining it first. This is a bug and will trigger an assert in
ep manager's destructor.
To solve this, a new flag for the hints manager is added which is set
when it drains all ep managers on removenode/decommission, and prevents
further hints from being written.
Fixes#7257Closes#7278
Currently, sstable_manager is used to create sstables, but it loses track
of them immediately afterwards. This series makes an sstable's life fully
contained within its sstable_manager.
The first practical impact (implemented in this series) is that file removal
stops being a background job; instead it is tracked by the sstable_manager,
so when the sstable_manager is stopped, you know that all of its sstable
activity is complete.
Later, we can make use of this to track the data size on disk, but this is not
implemented here.
Closes#7253
* github.com:scylladb/scylla:
sstables: remove background_jobs(), await_background_jobs()
sstables: make sstables_manager take charge of closing sstables
test: test_env: hold sstables_manager with a unique_ptr
test: drop test_sstable_manager
test: sstables::test_env: take ownership of manager
test: broken_sstable_test: prepare for asynchronously closed sstables_manager
test: sstable_utils: close test_env after use
test: sstable_test: dont leak shared_sstable outside its test_env's lifetime
test: sstables::test_env: close self in do_with helpers
test: perf/perf_sstable.hh: prepare for asynchronously closed sstables_manager
test: view_build_test: prepare for asynchronously closed sstables_manager
test: sstable_resharding_test: prepare for asynchronously closed sstables_manager
test: sstable_mutation_test: prepare for asynchronously closed sstables_manager
test: sstable_directory_test: prepare for asynchronously closed sstables_manager
test: sstable_datafile_test: prepare for asynchronously closed sstables_manager
test: sstable_conforms_to_mutation_source_test: remove references to test_sstables_manager
test: sstable_3_x_test: remove test_sstables_manager references
test: schema_changes_test: drop use of test_sstables_manager
mutation_test: adjust for column_family_test_config accepting an sstables_manager
test: lib: sstable_utils: stop using test_sstables_manager
test: sstables test_env: introduce manager() accessor
test: sstables test_env: introduce do_with_async_sharded()
test: sstables test_env: introduce do_with_async_returning()
test: lib: sstable test_env: prepare for life as a sharded<> service
test: schema_changes_test: properly close sstables::test_env
test: sstable_mutation_test: avoid constructing temporary sstables::test_env
test: mutation_reader_test: avoid constructing temporary sstables::test_env
test: sstable_3_x_test: avoid constructing temporary sstables::test_env
test: lib: test_services: pass sstables_manager to column_family_test_config
test: lib: sstables test_env: implement tests_env::manager()
test: sstable_test: detemplate write_and_validate_sst()
test: sstable_test_env: detemplate do_with_async()
test: sstable_datafile_test: drop bad 'return'
table: clear sstable set when stopping
table: prevent table::stop() race with table::query()
database: close sstable_manager:s
sstables_manager: introduce a stub close()
sstable_directory_test: fix threading confusion in make_sstable_directory_for*() functions
test: sstable_datafile_test: reorder table stop in compaction_manager_test
test: view_build_test: test_view_update_generator_register_semaphore_unit_leak: do not discard future in timer
test: view_build_test: fix threading in test_view_update_generator_register_semaphore_unit_leak
view: view_update_generator: drop references to sstables when stopping
Echo message does not need to access gossip internal states, we can run
it on all shards and avoid forwarding to shard zero.
This makes gossip marking node up more robust when shard zero is loaded.
There is an argument that we should make echo message return only when
all shards have responded so that all shards are live and responding.
However, in a heavily loaded cluster, one shard might be overloaded on
multiple nodes in the cluster at the same time. If we require echo
response on all shards, we have a chance local node will mark all peer
nodes as down. As a result, the whole cluster is down. This is much
worse than not excluding a node with a slow shard from a cluster.
Refs: #7197
Gossip echo message is used to confirm a node is up. In a heavily loaded
slow cluster, a node might take a long time to receive a heart beat
update, then the node uses the echo message to confirm the peer node is
really up.
If the echo message timeout too early, the peer node will not be marked
as up. This is bad because a live node is marked as down and this could
happen on multiple nodes in the cluster which causes cluster wide
unavailability issue. In order to prevent multiple nodes to marked as
down, it is better to be conservative and less restrictive on echo
message timeout.
Note, echo message is not used to detect a node down. Increasing the
echo timeout does not have any impact on marking a node down in a timely
manner.
Refs: #7197
Currently, closing sstables happens from the sstable destructor.
This is problematic since a destructor cannot wait for I/O, so
we launch the file close process in the background. We therefore
lose track of when the closing actually takes place.
This patch makes sstables_manager take charge of the close process.
Every sstable is linked into one of two intrusive lists in its
manager: _active or _undergoing_close. When the reference count
of the sstable drops to zero, we move it from _active to
_undergoing_close and begin closing the files. sstables_manager
remembers all closes and when sstables_manager::close() is called,
it waits for all of them to complete. Therefore,
sstables_manager::close() allows us to know that all files it
manages are closed (and deleted if necessary).
The sstables_manager also gains a destructor, which disables
move construction.
sstables_manager should not be movable (since sstables hold a reference
to it). A following patch will enforce it.
Prepare by using unique_ptr to hold test_env::_manager. Right now, we'll
invoke sstables_manager move construction when creating a test_env with
do_with().
We could have chosen to update sstables when their sstables_manager is
moved, but we get nothing for the complexity.
With no users left (apart from some variants of column_family_test_config
which are removed in this patch) there are no more users, so remove it.
test_sstable_manager is obstructs sstables_manager from taking charge
of sstables ownership, since it a thread-local object. We can't close it,
since it will be used in the next test to run.
Instead of using test_sstables_manager, which we plan to drop,
carry our own sstables_manager in test_env, and close it when
test_env::stop() is called.
do_write_sst() creates a test_env, creates a shared_sstable using that test_env,
and destroys the test_env, and returns the sstable. This works now but will
stop working once sstable_manager becomes responsible for sstable lifetime.
Fortunately, do_write_sst() has one caller that isn't interested in the
return value at all, so fold it into that caller.
The test_env::do_with() are convenient for creating a scope
containing a test_env. Prepare them for asynchronously closed
sstables_manager by closing the test_env after use (which will,
in the future, close the embedded sstables_manager).
sstables_manager will soon be closed asynhronously, with a future-returning
close() function. To prepare for that, make the following changes
- replace test_sstables_manager with an sstables_manager obtained from test_env
- drop unneeded calls to await_background_jobs()
These changes allow lifetime management of the sstables_manager used
in the tests to be centralized in test_env.
sstables_manager will soon be closed asynhronously, with a future-returning
close() function. To prepare for that, make the following changes
- acquire a test_env with test_env::do_with() (or the sharded variant)
- change the sstable_from_existing_file function to be a functor that
works with either cql_test_env or test_env (as this is what individual
tests want); drop use of test_sstables_manager
- change new_sstable() to accept a test_env instead of using test_sstables_manager
- replace test_sstables_manager with an sstables_manager obtained from test_env
These changes allow lifetime management of the sstables_manager used
in the tests to be centralized in test_env.
sstables_manager will soon be closed asynhronously, with a future-returning
close() function. To prepare for that, make the following changes
- replace on-stack test_env with test_env::do_with()
- use the variant of column_family_for_tests that accepts an sstables_manager
- replace test_sstables_manager with an sstables_manager obtained from test_env
These changes allow lifetime management of the sstables_manager used
in the tests to be centralized in test_env.
Since test_env now calls await_background_jobs on termination, those
calls are dropped.
Use the sstables_manager from test_env. Use do_with_async() to create the test_env,
to allow for proper closing.
Since do_with_async() also takes care of await_background_jobs(), remove that too.
test_sstables_manager is going away, so replace it by test_env::manager().
column_family_test_config() has an implicit reference to test_sstables_manager,
so pass test_env::manager() as a parameter.
Calls to await_background_jobs() are removed, since test_env::stop() performs
the same task.
The large rows tests are special, since they use a custom sstables_manager,
so instead of using a test_env, they just close their local sstables_manager.
Acquire a test_env and extract an sstables_manager from that, passing it
to column_familty_test_config, in preparation for losing the default
constructor of column_familty_test_config.
Some tests need a sharded sstables_manager, prepare for that by
adding a stop() method and helpers for creating a sharded service.
Since test_env doesn't yet contain its own sstable_manager, this
can't be used in real life yet.
sstables::test_env needs to be properly closed (and will soon need it
even more). Use test_env::do_with_async() to do that. Removed
await_background_jobs(), which is now done by test_env::close().
A test_env contains an sstables_manager, which will soon have a close() method.
As such, it can no longer be a temporary. Switch to using test_env::do_with_async().
As a bonus, test_env::do_with_async() performs await_background_jobs() for us, so
we can drop it from the call sites.
A test_env contains an sstables_manager, which will soon have a close() method.
As such, it can no longer be a temporary. Switch to using test_env::do_with_async().
A test_env contains an sstables_manager, which will soon have a close() method.
As such, it can no longer be a temporary. Switch to using test_env::do_with_async().
As a bonus, test_env::do_with_async() performs await_background_jobs() for us, so
we can drop it from the call sites.
Since we're dropping test_sstables_manager, we'll require callers to pass it
to column_family_test_config, so provide overloads that accept it.
The original overloads (that don't accept an sstables_manager) remain for
the transition period.
Some tests are now referencing the global test_sstables_manager,
which we plan to remove. Add test_env::manager() as a way to
reference the sstables_manager that the test_env contains.
The pattern
return function_returning_a_future().get();
is legal, but confusing. It returns an unexpected std::tuple<>. Here,
it doesn't do any harm, but if we try to coerce the surrounding code
into a signature (void ()), then that will fail.
Remove the unneeded and unexpected return.
Drop references to a table's sstables when stopping it, so that
the sstable_manager can start deleting it. This includes staging
sstables.
Although the table is no longer in use at this time, maintain
cache synchronity by calling row_cache::invalidate() (this also
has the benefit of avoiding a stall in row_cache's destructor).
We also refresh the cache's view of the sstable set to drop
the cache's references.
Take the gate in table::query() so that stop() waits for queries. The gate
is already waited for in table::stop().
This allows us to know we are no longer using the table's sstables in table::stop().
sstables_manager is going to take charge of its sstables lifetimes,
so it will need a close() to wait until sstables are deleted.
This patch adds sstables_manager::close() so that the surrounding
infrastructure can be wired to call it. Once that's done, we can
make it do the waiting.
The make_sstable_directory_for*() functions run in a thread, and
call functions that run in a thread, but return a future. This
more or less works but is a dangerous construct that can fail.
Fix by returning a regular value.
Stopping a table will soon close its sstables; so the next check will fail
as the number of sstables for the table will be zero.
Reorder the stop() call to make it safe.
We don't need the stop() for the check, since the previous loop made sure
compactions completed.
test_view_update_generator_register_semaphore_unit_leak creates a continuation chain
inside a timer, but does not wait for it. This can result in part of the chain
being executed after its captures have been destroyed.
This is unlikely to happen since the timer fires only if the test fails, and
tests never fail (at least in the way that one expects).
Fix by waiting for that future to complete before exiting the thread.
test_view_update_generator_register_semaphore_unit_leak uses a thread function
in do_with_cql_env(), even though the latter doesn't promise a thread and
accepts a regular function-returning-a-future. It happens to work because the
function happens to be called in a thread, but this isn't guaranteed.
Switch to do_with_cql_env, which guarantees a thread context.
sstable_manager will soon wait for all sstables under its
control to be deleted (if so marked), but that can't happen
if someone is holding on to references to those sstables.
To allow sstables_manager::stop() to work, drop remaining
queued work when terminating.
The script scripts/pull_github_pr.sh begins by fetching some information
from github, which can cause a noticable wait that the user doesn't
understand - so in this patch we add a couple of messages on what is
happening in the beginning of the script.
Moreover, if an invalid pull-request number is given, the script used
to give mysterious errors when incorrect commands ran using the name
"null" - in this patch we recognize this case and print a clear "Not Found"
error message.
Finally, the PR_REPO variable was never used, so this patch removes it.
Message-Id: <20200923151905.674565-1-nyh@scylladb.com>
Currently, scripts/pull_pr pollutes the local branch namespace
by creating a branch and never deleting it. This can be avoided
by using FETCH_HEAD, a temporary name automatically assigned by
git to fetches with no destination.
"
Currently all logical read operations are counted in a single pair of
metrics (successful/failed) located in the `database::db_stats`. This
prevents observing the number of reads executed against the
user/system
read semaphores. This distinction is especially interesting since
0c6bbc84c which selects the semaphore for each read based on the
scheduling group it is running under.
This mini series moves these counters into the semaphore and updates
the
exported metrics accordingly, the `total_reads` and
`total_reads_failed`
now has a user/system lable, just like the other semaphore dependent
metrics.
Tests: manual(checked that new metric works)
"
* 'per-semaphore-read-metrics/v2' of https://github.com/denesb/scylla:
database: move total_reads* metrics to the concurrency semaphore
database: setup_metrics(): split the registering database metrics in two
reader_concurrency_semaphore: add non-const stats accessor
reader_concurrency_semaphore: s/inactive_read_stats/stats/
Currently all "database" metrics are registered in a single call to
`metric_groups::add_group()`. As all the metrics to-be-registered are
passed in a single initializer list, this blows up the stack size, to
the point that adding a single new metric causes it to exceed the
currently configured max-stack-size of 13696 bytes. To reduce stack
usage, split the single call in two, roughly in the middle. While we
could try to come up with some logical grouping of metrics and do much
arranging and code-movement I think we might as well just split into two
arbitrary groups, containing roughly the same amount of metrics.
Commit 8e1f7d4fc7 ("dist/common/scripts: drop makedirs(), use
os.makedirs()") dropped the "mkdirs()" function, but forgot to convert
the caller in scylla_kernel_check to os.mkdirs().
Message-Id: <20200923104510.230244-1-penberg@scylladb.com>
In preparations of non-inactive read stats being added to the semaphore,
rename its existing stats struct and member to a more generic name.
Fields, whose name only made sense in the context of the old name are
adjusted accordingly.
As the test test_streams_closed_read confirmed, when a stream shard is
closed, GetRecords should not return a NextShardIterator at all.
Before this patch we wrongly returned an empty string for it.
Before this patch, several Alternator Stream tests (in test_streams.py)
failed when running against a multi-node Scylla cluster. The reason is as
follows: As a multi-node cluster boots and more and more nodes enter the
cluster, the cluster changes its mind about the token ownership, and
therefore the list of stream shards changes. By the time we have the full
cluster, a bunch of shards were created and closed without any data yet.
All the tests will see these closed shards, and need to understand them.
The fetch_more() utility function correctly assumed that a closed shard
does not return a NextShardIterator, and got confused by the empty string
we used to return.
Now that closed shards can return responses without NextShardIterator,
we also needed to fix in this patch a couple of tests which wrongly assumed
this can't happen. These tests did not fail on DynamoDB because unlike in
Scylla, DynamoDB does not have any closed shards in normal tests which
do not specifically cause them (only test_streams_closed_read).
We also need to fix test_streams_closed_read to get rid of an unnecessary
assumption: It currently assumes that when we read the very last item in
a closed shard is read, the end-of-shard is immediately signaled (i.e.,
NextShardIterator is not returned). Although DynamoDB does in fact do this,
it is also perfectly legal for Alternator's implementation to return the
last item with a new NextShardIterator - and only when the client reads
from that iterator, we finally return the signal the end of the shard.
Fixes#7237.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200922082529.511199-1-nyh@scylladb.com>
While many of these warnings are important, they can be addressed
at at a lower priority. In any case it will be easier to enforce
the warnings if/when we switch to clang.
The classes touche private data of each other for no real
reason. Putting the interaction behind API makes it easier
to track the usage.
* xemul/br-unfriends-in-row-cache-2:
row cache: Unfriend classes from each other
rows_entry: Move container/hooks types declarations
rows_entry: Simplify LRU unlink
mutation_partition: Define .replace_with method for rows_entry
mutation_partition: Use rows_entry::apply_monotonically
Merged pull request https://github.com/scylladb/scylla/pull/7264 by
Avi Kivity (29 commits):
This series fixes issues found by clang. Most are real issues that gcc just
doesn't find, a few are due to clang lagging behind on some C++ updates.
See individual explanations in patches.
The series is not sufficient to build with clang; it just addresses the
simple problems. Two larger problems remain: clang isn't able to compile
std::ranges (not clear yet whether this is a libstdc++ problem or a clang
problem) and clang can't capture structured binding variables (due to
lagging behind on the standard).
The motivation for building with clang is gaining access to a working
implementation of coroutines and modules.
This series compiles with gcc and the unit tests pass.
Currently in all cases we first deduct the to-be-consumed resources,
then construct the `reader_resources` class to protect it (release it on
destruction). This is error prone as it relies on no exception being
thrown while constructing the `reader_resources`. Albeit the
`reader_resources` constructor is `noexcept` right now this might change
in the future and as the call sites relying on this are disconnected
from the declaration, the one modifying them might not notice.
To make this safe going forward, make the `reader_resources` a true RAII
class, consuming the units in its constructor and releasing them in its
destructor.
Fixes: #7256
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200922150625.1253798-1-bdenes@scylladb.com>
We have templates for multiprecision_int for both sides of the operator,
for example:
template <typename T>
bool operator==(const T& x) const
and
template <typename T>
friend bool operator==(const T& x, const multiprecision_int& y)
Clang considers them equally satisfying when both operands are
multiprecision_int, so provide a disambiguating overload.
Clang dislikes forward-declared functions returning auto, so declare the
type up front. Functions returning auto are a readability problem
anyway.
To solve a circular dependency problem (get_local_injector() ->
error_injection<> -> get_local_injector()), which is further compounded
by problems in using template specializations before they are defined
(which is forbidden), the storage for get_local_injector() was moved
to error_injection<>, and get_local_injector() is just an accessor.
After this, error_injection<> does not depend on get_local_injector().
new_sstable is defined as a template, and later used in a context
that requires an object. Somehow gcc uses an instantiation with
an empty template parameter list, but I don't think it's right,
and clang refuses.
Since the template is gratuitous anyway, just make it a regular
function.
This patch adds a test, test_streams_closed_read, which reproduces
two issues in Alternator Streams, regarding the behavior of *closed*
stream shards:
Refs #7239: After streaming is disabled, the stream should still be readable,
it's just that all its shards are now "closed".
Refs #7237: When reaching the end of a closed shard, NextShardIterator should
be missing. Not set to an empty string as we do today.
The test passes on DynamoDB, and xfails on Alterator, and should continue to
do so until both issues are fixed.
This patch changes the implementation of the disable_stream() function.
This function was never actually used by the existing code, and now that
I wanted to use it, I discovered it didn't work as expected and had to fix it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200915134643.236273-1-nyh@scylladb.com>
The gossiper verbs are registered in two places -- start_gossiping
and do_shadow_round(). And unregistered in one -- stop_gossiping
iff the start took place. Respectively, there's a chance that after
a shadow round scylla exits without starting gossiping thus leaving
verbs armed.
Fix by unregistering verbs on stop if they are still registered.
fixes: #7262
tests: manual(start, abort start after shadow round), unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200921140357.24495-1-xemul@scylladb.com>
Before updating the _last_[cp]key (for subsequent .fetch_page())
the pager checks is 'if the pager is not exhausted OR the result
has data'.
The check seems broken: if the pager is not exhausted, but the
result is empty the call for keys will unconditionally try to
reference the last element from empty vector. The not exhausted
condition for empty result can happen if the short_read is set,
which, in turn, unconditionally happens upon meeting partition
end when visiting the partition with result builder.
The correct check should be 'if the pager is not exhausted AND
the result has data': the _last_[pc]key-s should be taken for
continuation (not exhausted), but can be taken if the result is
not empty (has data).
fixes: #7263
tests: unit(dev), but tests don't trigger this corner case
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200921124329.21209-1-xemul@scylladb.com>
The .get_last_partition_and_clustering_key() method gets
the last partition from the on-board vector of partitions.
The vector in question is assumed not to be empty, but if
this assumption breaks, the result will look like memory
corruption (docs say that accessing empty's vector back()
results in undefined behavior).
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200921122948.20585-1-xemul@scylladb.com>
Clang has a hard time dealing with single-element initializer lists. In this
case, adding an explicit conversion allows it to match the
initializer_list<data_value> parameter.
std::log() is not constexpr, so it cannot be assigned to a constexpr object.
Make it non-constexpr and automatic. The optimizer still figures out that it's
constant and optimizes it.
Found by clang. Apparently gcc only checks the expression is constant, not
constexpr.
Casting from the maximum int64_t to double loses precision, because
int64_t has 64 bits of precision while double has only 53. Clang
warns about it. Since it's not a real problem here, add an explicit
cast to silence the warning.
index_reader::index_bound must be constructible by non-friend classes
since it's used in std::optional (which isn't anyone's friend). This
now works in gcc because gcc's inter-template access checking is broken,
but clang correctly rejects it.
promoted_index_blocks_reader has a data member called "state", and a type member
called "state". Somehow gcc manages to disambiguate the two when used, but
clang doesn't. I believe clang is correct here, one member should subsume the other.
Change the type member to have a different name to disambiguate the two.
std::out-of-range does not have a default constructor, yet gcc somehow
accepts a no-argument construction. Clang (correctly) doesn't, so add
a parameter.
We use class template argument deduction (CTAD) in a few places, but it appears
not to work for alias templates in clang. While it looks like a clang bug, using
the class name is an improvement, so let's do that.
bool_class only has explicit conversion to bool, so an assignment such as
bool x = bool_class<foo>(true);
ought to fail. Somehow gcc allows it, but I believe clang is correct in
disallowing it.
Fix by using 'auto' to avoid the conversion.
Due to a bug, clang does not decay a type to a reference, failing
the concept evaluation on correct input. Add parentheses to force
it to decay the type.
atomic_cell_or_collection is also declared as a friend class further
down, and gcc appears to inject this friend declration into the
global namespace. Clang appears not to, and so complains when
atomic_cell_or_collection is mentioned in the declaration of
merge_column().
Add a forward declaration in the global namespace to satisfy clang.
We call ostringstream::view(), but that member doesn't exist. It
works because it is guarded by an #ifdef and the guard isn't satisified,
but if it is (as with clang) it doesn't compile. Remove it.
base64_chars() calls strlen() from a static_assert, but strlen() isn't
(and can't be) constexpr. gcc somehow allows it, but clang rightfully
complains.
Fix by using a character array and sizeof, instead of a pointer
and strlen().
The constraint on on cryptopp_hasher<>::impl is illegal, since it's not
on the base template. Convert it to a static_assert.
We could have moved it to the base template, but that would have undone
the work to push all the implementation details in .cc and reduce #include
load.
Found by clang.
While we need CryptoPP Update() functions not to throw, they aren't
marked noexcept.
Since there is no reason for them to throw, we'll just hope they
don't, and relax the requirement.
Found by clang. Apparently gcc didn't bother to check the constraint
here.
scylla_util.py becomes large, some code are unused, some code are only for specific script.
Drop unnecessary things, move non-common functions to caller script, make scylla_util.py simple.
Closes#7102
* syuu1228-refactor_scylla_util:
scylla_util.py: de-duplicate code on parse_scylla_dirs_with_default() and get_scylla_dirs()
scylla_util.py: remove rmtree() and redhat_version() since these are unused
dist/common/scripts: drop makedirs(), use os.makedirs()
dist/common/scripts: drop hex2list.py since it's nolonger used
dist/common/scripts: drop is_systemd() since we nolonger support non-systemd environment
dist/common/scripts: drop dist_name() and dist_ver()
dist/common/scripts: move functions that are only called from single file
Seems like parse_scylla_dirs_with_default() and get_scylla_dirs() shares most of
the code, de-duplicate it.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
scylla_util.py is a library for common functions across setup scripts,
it should not include private function of single file.
So move all those functions to caller file.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
* seastar dc06cd1f0f...9ae33e67e1 (9):
> expiring_fifo: mark methods noexcept
> chunked_fifo: mark methods noexcept
> circular_buffer: support stateful allocators
> net/posix-stack: fix sockaddr forward reference
> rpc: fix LZ4_DECODER_RING_BUFFER_SIZE not defined
> futures_test: fix max_concurrent_for_each concepts error
> core: limit memory size for each process to 64GB
> core/reactor_backend: kill unused nr_retry
> Merge "IO tracing" from Pavel E
"
Based on https://github.com/scylladb/scylla/issues/7199,
it looks like storage_service::set_tables_autocompaction
may be called on shards other than 0.
Use run_with_api_lock to both serialize the action
and to check _initialized on shard 0.
Fixes#7199
Test: unit(dev), compaction_test:TestCompaction_with_SizeTieredCompactionStrategy.disable_autocompaction_nodetool_test
"
* tag 'set_tables_autocompaction-v1' of github.com:bhalevy/scylla:
storage_service: set_tables_autocompaction: fixup indentation
storage_service: set_tables_autocompaction: run_with_api_lock
storage_service: set_tables_autocompaction: use do_with to hold on to args
storage_service: set_tables_autocompaction: log message in info level
We need to skip internet access on offline installation.
To do this we need following changes:
- prevent running yum/apt for each script
- set default "NO" for scripts it requires package installation
- set default "NO" for scripts it requires internet access, such as NTP
See #7153Closes#7224
* syuu1228-offline_setup:
dist/common/scripts: skip internet access on offline installation
scylla_ntp_setup: use shutil.witch() to lookup command
- Add necessary changes to `scylla_util.py` in order to support Scylla Machine Image in GCE.
- Fixes and improvements for `curl` function.
Closes#7080
* bentsi-gce-image-support:
scylla_util.py: added GCE instance/image support
scylla_util.py: make max_retries as a curl function argument
scylla_util.py: adding timeout to curl function
scylla_util.py: styling fixes to curl function
scylla_util.py: change default value for headers argument in curl function
Let's build scylla-unified-package.tar.gz in build/<mode>/dist/tar for
symmetry. The old location is still kept for backward compatibility for
now. Also document the new official artifact location.
Message-Id: <20200917071131.126098-1-penberg@scylladb.com>
"
... and make sure nothing is left. Whith the help of fresh seastar
this can be done quickly.
Before doing this check -- unregister remaining verbs in repair and
storage_service and fix tests not to register verbs, because they
are all local.
tests: unit(dev), manual
"
* 'br-messaging-service-stop-all' of https://github.com/xemul/scylla:
messaging_service: Report still registered services as errors
repair: Move CHECKSUM_RANGE verb into repair/
repair: Toss messaging init/uninit calls
storage_service: Uninit RPC verbs
test: Do not init messaging verbs
On stop -- unregister the CLIENT_ID verb, which is registerd
in constructor, then check for any remaining ones.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The verb is sent by repair code, so it should be registered
in the same place, not in main. Also -- the verb should be
unregistered on stop.
The global messaging service instance is made similarly to the
row-level one, as there's no ready to use repair service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There goal is to make it possible to reg/unreg not only row-level
verbs. While at it -- equip the init call with sharded<database>&
argument, it will be needed by the next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch adds a "--list" option to test.py that shows a list of tests
instead of executing them. This is useful for people and scripts, which
need to discover the tests that will be run. For example, when Jenkins
needs to store failing tests, it can use "test.py --list" to figure out
what to archive.
Message-Id: <20200916135714.89350-1-penberg@scylladb.com>
And move-assign to _pending_ranges_interval_map[keyspace_name]
only when done.
This is more effient since there's no need to look up
_pending_ranges_interval_map[keyspace_name] for every insert to the
interval_map.
And it is exception safe in case we run out of memory mid-way.
Refs #7220
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200916115059.788606-1-bhalevy@scylladb.com>
This patch fixes a bug noted in issue #7218 - where PutItem operations
sometimes lose part of the item's data - some attributes were lost,
and the name of other attributes replaced by empty strings. The problem
happened when the write-isolation policy was LWT and there was contention
of writes to the same partition (not necessarily the same item).
To use CAS (a.k.a. LWT), Alternator builds an alternator::rmw_operation
object with an apply() function which takes the old contents of the item
(if needed) and a timestamp, and builds a mutation that the CAS should
apply. In the case of the PutItem operation, we wrongly assumed that apply()
will be called only once - so as an optimization the strings saved in the
put_item_operation were moved into the returned mutation. But this
optimization is wrong - when there is contention, apply() may be called
again when the changed proposed by the previous one was not accepted by
the Paxos protocol.
The fix is to change the one place where put_item_operation *moved* strings
out of the saved operations into the mutations, to be a copy. But to prevent
this sort of bug from reoccuring in future code, this patch enlists the
compiler to help us verify that it can't happen: The apply() function is
marked "const" - it can use the information in the operation to build the
mutation, but it can never modify this information or move things out of it,
so it will be fine to call this function twice.
The single output field that apply() does write (_return_attributes) is
marked "mutable" to allow the const apply() to write to it anyway. Because
apply() might be called twice, it is important that if some apply()
implementation sometimes sets _return_attributes, then it must always
set it (even if to the default, empty, value) on every call to apply().
The const apply() means that the compiler verfies for us that I didn't
forget to fix additional wrong std::move()s. Additionally, a test I wrote
to easily reproduce issue #7218 (which I will submit as a dtest later)
passes after this fix.
Fixes#7218.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200916064906.333420-1-nyh@scylladb.com>
Add a "Closes #$PR_NUM" annotation at the end of the commit
message to tell github to close the pull request, preventing
manual work and/or dangling pull requests.
Closes#7245
"
This series follows the suggestion from https://github.com/scylladb/scylla/pull/7203#issuecomment-689499773 discussion and deprecates a number of cluster features. The deprecation does not remove any features from the strings sent via gossip to other nodes, but it removes all checks for these features from code, assuming that the checks are always true. This assumption is quite safe for features introduced over 2 years ago, because the official upgrade path only allows upgrading from a previous official release, and these feature bits were introduced many release cycles ago.
All deprecated features were picked from a `git blame` output which indicated that they come from 2018:
```git
e46537b7d3 2016-05-31 11:44:17 +0200 RANGE_TOMBSTONES_FEATURE = "RANGE_TOMBSTONES";
85c092c56c 2016-07-11 10:59:40 +0100 LARGE_PARTITIONS_FEATURE = "LARGE_PARTITIONS";
02bc0d2ab3 2016-12-09 22:09:30 +0100 MATERIALIZED_VIEWS_FEATURE = "MATERIALIZED_VIEWS";
67ca6959bd 2017-01-30 19:50:13 +0000 COUNTERS_FEATURE = "COUNTERS";
815c91a1b8 2017-04-12 10:14:38 +0300 INDEXES_FEATURE = "INDEXES";
d2a2a6d471 2017-08-03 10:53:22 +0300 DIGEST_MULTIPARTITION_READ_FEATURE = "DIGEST_MULTIPARTITION_READ";
ecd2bf128b 2017-09-01 09:55:02 +0100 CORRECT_COUNTER_ORDER_FEATURE = "CORRECT_COUNTER_ORDER";
713d75fd51 2017-09-14 19:15:41 +0200 SCHEMA_TABLES_V3 = "SCHEMA_TABLES_V3";
2f513514cc 2017-11-29 11:57:09 +0000 CORRECT_NON_COMPOUND_RANGE_TOMBSTONES = "CORRECT_NON_COMPOUND_RANGE_TOMBSTONES";
0be3bd383b 2017-12-04 13:55:36 +0200 WRITE_FAILURE_REPLY_FEATURE = "WRITE_FAILURE_REPLY";
0bab3e59c2 2017-11-30 00:16:34 +0000 XXHASH_FEATURE = "XXHASH";
fbc97626c4 2018-01-14 21:28:58 -0500 ROLES_FEATURE = "ROLES";
802be72ca6 2018-03-18 06:25:52 +0100 LA_SSTABLE_FEATURE = "LA_SSTABLE_FORMAT";
71e22fe981 2018-05-25 10:37:54 +0800 STREAM_WITH_RPC_STREAM = "STREAM_WITH_RPC_STREAM";
```
Tests: unit(dev)
manual(verifying with cqlsh that the feature strings are indeed still set)
"
Closes#7234.
* psarna-clean_up_features:
gms: add comments for deprecated features
gms: remove unused feature bits
streaming: drop checks for RPC stream support
roles: drop checks for roles schema support
service: drop checks for xxhash support
service: drop checks for write failure reply support
sstables: drop checks for non-compound range tombstones support
service: drop checks for v3 schema support
repair: drop checks for large partitions support
service: drop checks for digest multipartition read support
sstables: drop checks for correct counter order support
cql3: drop checks for materialized views support
cql3: drop checks for counters support
cql3: drop checks for indexing support
We need to skip internet access on offline installation.
To do this we need following changes:
- prevent running yum/apt for each script
- set default "NO" for scripts it requires package installation
- set default "NO" for scripts it requires internet access, such as NTP
See #7153Fixes#7182
The command installed directory may different between distributions,
we can abstract the difference using shutil.witch().
Also the script become simpler than passing full path to os.path.exists().
Consider an unpaged query that consumes all of available memory, despite
fea5067dfa which limits them (perhaps the
user raised the limit, or this is a system query). Eventually we will see a
bad_alloc which will abort the query and destroy this reconcilable_result_builder.
During destruction, we first destroy _memory_accounter, and then _result.
Destroying _memory_accounter resumes some continuations which can then
allocate memory synchronously when increasing the task queue to accomodate
them. We will then crash. Had we not crashed, we would immediately afterwards
release _result, freeing all the memory that we would ever need.
Fix by making _result the last member, so it is freed first.
Fixes#7240.
Introduce new database config option `schema_registry_grace_period`
describing the amount of time in seconds after which unused schema
versions will be cleaned up from the schema registry cache.
Default value is 1 second, the same value as was hardcoded before.
Tests: unit(debug)
Refs: #7225
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200915131957.446455-1-pa.solodovnikov@scylladb.com>
The set contains 3 small optimizations:
- avoid copying of partition key on lookup path
- reduce number of args carried around when creating a new entry
- save one partition key comparison on reader creation
Plus related satellite cleanups.
* https://github.com/xemul/scylla/tree/br-row-cache-less-copies:
row_cache: Revive do_find_or_create_entry concepts
populating reader: Do not copy decorated key too early
populating reader: Less allocator switching on population
populating reader: Fix indentation after previous patch
row_cache: Move missing entry creation into helper
test: Lookup an existing entry with its own helper
row_cache: Do not copy partition tombstone when creating cache entry
row_cache: Kill incomplete_tag
row_cache: Save one key compare on direct hit
"
This PR removes _pending_ranges and _pending_ranges_map in token_metadata.
This removal of makes copying of token_metadata faster and reduces the chance to cause reactor stall.
Refs: #7220
"
* asias-token_metadata_replication_config_less_maps:
token_metadata: Remove _pending_ranges
token_metadata: Get rid of unused _pending_ranges_map
from Asias.
This series follows "repair: Add progress metrics for node ops #6842"
and adds the metrics for the remaining node operations,
i.e., replace, decommission and removenode.
Fixes#1244, #6733
* asias-repair_progress_metrics_replace_decomm_removenode:
repair: Add progress metrics for removenode ops
repair: Add progress metrics for decommission ops
repair: Add progress metrics for replace ops
Change 94995acedb added yielding to abstract_replication_strategy::do_get_ranges.
And 07e253542d used get_ranges_in_thread in compaction_manager.
However, there is nothing to prevent token_metadata, and in particular its
`_sorted_tokens` from changing while iterating over them in do_get_ranges if the latter yields.
Therefore copy the the replication strategy `_token_metadata` in `get_ranges_in_thread(inet_address ep)`.
If the caller provides `token_metadata` to get_ranges_in_thread, then the caller
must make sure that we can safely yield while accessing token_metadata (like
in `do_rebuild_replace_with_repair`).
Fixes#7044
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200915074555.431088-1-bhalevy@scylladb.com>
- Remove get_pending_ranges and introduce has_pending_ranges, since the
caller only needs to know if there is a pending range for the keyspace
and the node.
- Remove print_pending_ranges which is only used in logging. If we
really want to log the new pending token ranges, we can log when we
set the new pending token ranges.
This removal of _pending_ranges makes copying of token_metadata faster
and reduces the chance to cause reactor stall.
Refs: #7220
"
The range_tombstone_list provides an abstraction to work with
sorted list of range tombstones with methods to add/retrive
them. However, there's a tombstones() method that just returns
modifiable reference to the used collection (boost::intrusive_set)
which makes it hard to track the exact usage of it.
This set encapsulates the collaction of range tombstones inside
the mentioned ..._list class.
tests: unit(dev)
"
* 'br-range-tombstone-encapsulate-collection' of https://github.com/xemul/scylla:
range_tombstone_list: Do not expose internal collection
range_tombstone_list: Introduce and use pop-and-lock helper
range_tombstone_list: Introduce and use pop_as<>()
flat_mutation_reader: Use range_tombstone_list begin/end API
repair: Mark some partition_hasher methods noexcept
hashers: Mark hash updates noexcept
The histogram constructor has a `counts` parameter defaulted to
`defaultdict(int)`. Due to how default argument values work in
python -- the same value is passed to all invocations -- this results in
all histogram instances sharing the same underlying counts dict. Solve
it the way this is usually solved -- default the parameter to `None` and
when it is `None` create a new instance of `defaultdict(int)` local to
the histogram instance under construction.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200908142355.1263568-1-bdenes@scylladb.com>
Currently blobs are converted to python bytes objects and printed by
simply converting them to string. This results in hard to read blobs as
the bytes' __str__() attempts to interpret the data as a printable
string. This patch changes this to use bytes.hex() which prints blobs in
hex format. This is much more readable and it is also the format that
scylla uses when printing blobs.
Also the conversion to bytes is made more efficient by using gdb's
gdb.inferior.read_memory() function to read the data.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200911085439.1461882-1-bdenes@scylladb.com>
The patch extracts a little helper function that populates
a schema_mutation with various column information.
Will be used in a subsequent patch to serialize column mappings.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This helper function is similar to the ordinary `unfreeze` of
`frozen_mutation` but in addition to the schema_ptr supplies a
custom column_mapping which is being used when upgrading the
mutation.
Needed for a subsequent patch regarding column mappings history.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
hugepages and libhugetlbfs-bin packages is only required for DPDK mode,
and unconditionally installation causes error on offline mode, so drop it.
Fixes#7182
data::cell targets 8KB as its maximum allocations size to avoid
pressuring the allocator. This 8KB target is used for internal storage
-- values small enough to be stored inside the cell itself -- as well
for external storage. Externally stored values use 8KB fragment sizes.
The problem is that only the size of data itself was considered when
making the allocations. For example when allocating the fragments
(chunks) for external storage, each fragment stored 8KB of data. But
fragments have overhead, they have next and back pointers. This resulted
in a 8KB + 2 * sizeof(void*) allocation. IMR uses the allocation
strategy mechanism, which works with aligned allocations. As the seastar
allocation only guarantees aligned allocations for power of two sizes,
it ends up allocating a 16KB slot. This results in the mutation fragment
using almost twice as much memory as would be required. This is a huge
waste.
This patch fixes the problem by considering the overhead of both
internal and external storage ensuring allocations are 8KB or less.
Fixes: #6043
Tests: unit(debug, dev, release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200910171359.1438029-1-bdenes@scylladb.com>
Instead of using the default hasher, hasing specializations should
use the hasher type they were specialized for. It's not a correctness
issue now because the default hasher (xx_hasher) is compatible with
its predecessor (legacy_xx_hasher_without_null_digest), but it's better
to be future-proof and use the correct type in case we ever change the
default hasher in a backward-incompatible way.
Message-Id: <c84ce569d12d9b4f247fb2717efa10dc2dabd75b.1600074632.git.sarna@scylladb.com>
Based on https://github.com/scylladb/scylla/issues/7199,
it looks like storage_service::set_tables_autocompaction
may be called on shards other than 0.
Use run_with_api_lock to both serialize the action
and to check _initialized on shard 0.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In preparation to calling is_initialized() which may yield.
Plus, the way the tables vector is currently captured is inefficient.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Checks for features introduced over 2 years ago were removed
in previous commits, so all that is left is removing the feature
bits itself. Note that the feature strings are still sent
to other nodes just to be double sure, but the new code assumes
that all these features are implicitly enabled.
Streaming with RPC stream is supported for over 2 years and upgrades
are only allowed from versions which already have the support,
so the checks are hereby dropped.
The reader fills up the buffer upon construction, which is not what other
readers do, and is considered to be waste of cycles, as the reader can be
dropped early.
Refs #1671
test: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200910171134.11287-2-xemul@scylladb.com>
xxhash algorithm is supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
Write failure reply is supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
Correct non-compound range tombstones are supported for over 2 years
and upgrades are only allowed from versions which already have the
support, so the checks are hereby dropped.
Large partitions are supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
Digest multipartition read is supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
Correct counter order is supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
"
Migration manager installs several cluster feature change listeners.
The listeners will call update_schema_version_and_announce() when cluster
features are enabled, which does this:
return update_schema_version(proxy, features).then([] (utils::UUID uuid) {
return announce_schema_version(uuid);
});
It first updates the schema version and then publishes it via
gossip in announce_schema_version(). It is possible that the
announce_schema_version() part of the first schema change will be
deferred and will execute after the other four calls to
update_schema_version_and_announce(). It will install the old schema
version in gossip instead of the more recent one.
The fix is to serialize schema digest calculation and publishing.
Refs #7200
This problem also brought my attention to initialization code, which could be
prone to the same problem.
The storage service computes gossiper states before it starts the
gossiper. Among them, node's schema version. There are two problems with that.
First is that computing the schema version and publishing it is not
atomic, so is not safe against concurrent schema changes or schema
version recalculations. It will not exclude with
recalculate_schema_version() calls, and we could end up with the old
(and incorrect) schema version being advertised in gossip.
Second problem is that we should not allow the database layer to call
into the gossiper layer before it is fully initialized, as this may
produce undefined behavior.
Maybe we're not doing concurrent schema changes/recalculations now,
but it is easy to imagine that this could change for whatever reason
in the future.
The solution for both problems is to break the cyclic dependency
between the database layer and the storage_service layer by having the
database layer not use the gossiper at all. The database layer
publishes schema version inside the database class and allows
installing listeners on changes. The storage_service layer asks the
database layer for the current version when it initializes, and only
after that installs a listener which will update the gossiper.
Tests:
- unit (dev)
- manual (3 node ccm)
"
* tag 'fix-schema-digest-calculation-race-v1' of github.com:tgrabiec/scylla:
db, schema: Hide update_schema_version_and_announce()
db, storage_service: Do not call into gossiper from the database layer
db: Make schema version observable
utils: updateable_value_source: Introduce as_observable()
schema: Fix race in schema version recalculation leading to stale schema version in gossip
"
This pull request fixes unified relocatable package dependency issues in
other build modes than release, and then adds unified tarball to the
"dist" build target.
Fixes#6949
"
* 'penberg/build/unified-to-dist/v1' of github.com:penberg/scylla:
configure.py: Build unified tarball as part of "dist" target
unified/build_unified: Use build/<mode>/dist/tar for dependency tarballs
configure.py: Use build/<mode>/dist/tar for unified tarball dependencies
This patch is a proposal for the removal of the redis
classes describing the commands. 'prepare' and 'execute'
class functions have been merged into a function with
the name of the command.
Note: 'command_factory' still needs to be simplified.
Signed-off-by: Etienne Adam <etienne.adam@gmail.com>
Message-Id: <20200913183315.9437-1-etienne.adam@gmail.com>
it's failing as so:
Python Exception <class 'TypeError'> unsupported operand type(s) for +: 'int' and 'str':
it's a regression caused by e4d06a3bbf.
_mask() should use the ref stored in the ctor to dereference _impl.
Fixes#7058.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200908154342.26264-1-raphaelsc@scylladb.com>
Define container types near the containing elements' hook
members, so that they could be private without the need
to friend classes with each other.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The cache_tracker tries to access private member of the
rows_entry to unlink it, but the lru_type is auto_unlink
and can unlink itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The storage service computes gossiper states before it starts the
gossiper. Among them, node's schema version. There are two problems with that.
First is that computing the schema version and publishing it is not
atomic, so is not safe against concurrent schema changes or schema
version recalculations. It will not exclude with
recalculate_schema_version() calls, and we could end up with the old
(and incorrect) schema version being advertised in gossip.
Second problem is that we should not allow the database layer to call
into the gossiper layer before it is fully initialized, as this may
produce undefined behavior.
The solution for both problems is to break the cyclic dependency
between the database layer and the storage_service layer by having the
database layer not use the gossiper at all. The database layer
publishes schema version inside the database class and allows
installing listeners on changes. The storage_service layer asks the
database layer for the current version when it initializes, and only
after that installs a listener which will update the gossiper.
This also allows us to drop unsafe functions like update_schema_version().
Migration manager installs several feature change listeners:
if (this_shard_id() == 0) {
_feature_listeners.push_back(_feat.cluster_supports_view_virtual_columns().when_enabled(update_schema));
_feature_listeners.push_back(_feat.cluster_supports_digest_insensitive_to_expiry().when_enabled(update_schema));
_feature_listeners.push_back(_feat.cluster_supports_cdc().when_enabled(update_schema));
_feature_listeners.push_back(_feat.cluster_supports_per_table_partitioners().when_enabled(update_schema));
}
They will call update_schema_version_and_announce() when features are enabled, which does this:
return update_schema_version(proxy, features).then([] (utils::UUID uuid) {
return announce_schema_version(uuid);
});
So it first updates the schema version and then publishes it via
gossip in announce_schema_version(). It is possible that the
announce_schema_version() part of the first schema change will be
deferred and will execute after the other four calls to
update_schema_version_and_announce(). It will install the old schema
version in gossip instead of the more recent one.
The fix is to serialize schema digest calculation and publishing.
Refs #7200
It causes gdb to print UUIDs like this:
"a3eadd80-f2a7-11ea-853c-", '0' <repeats 11 times>, "4"
This is quite hard to read, let's drop the string display hint, so they
are displayed like this:
a3eadd80-f2a7-11ea-853c-000000000004
Much better. Also technically UUID is a 128 bit integer anyway, not a
string.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200911090135.1463099-1-bdenes@scylladb.com>
The build_unified.sh script has the same bug as configure.py had: it
looks for the python tarball in
build/<mode>/scylla-python3-package.tar.gz, but it's never generated
there. Fix up the problem by using build/<mode>/dist/tar location for
all dependency tarballs.
The build target for scylla-unified-package.tar.gz incorrectly depends
on "build/<mode>/scylla-python3-package.tar.gz", which is never
generated. Instead, the package is either generated in
"build/release/scylla-python3-package.tar.gz" (for legacy reasons) or
"build/<mode>/dist/tar/scylla-python3-package.tar.gz". This issues
causes building unified package in other modes to fail.
To solve the problem, let's switch to using the "build/<mode>/dist/tar"
locations for unified tarball dependencies, which is the correct place
to use anyway.
When the test suite is run with Scylla serving in HTTPS mode, using
test/alternator/run --https, two Alternator Streams tests failed.
With this patch fixing a bug in the test, the tests pass.
The bug was in the is_local_java() function which was supposed to detect
DynamoDB Local (which behaves in some things differently from the real
DynamoDB). When that detection code makes an HTTPS request and does not
disable checking the server's certificate (which on Alternator is
self-signed), the request fails - but not in the way that the code expected.
So we need to fix the is_local_java() to allow the failure mode of the
self-signed certificate. Anyway, this case is *not* DynamoDB Local so
the detection function would return false.
Fixes#7214
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200910194738.125263-1-nyh@scylladb.com>
This patch adds regression tests for four recently-fixed issues which did not yet
have tests:
Refs #7157 (LatestStreamArn)
Refs #7158 (SequenceNumber should be numeric)
Refs #7162 (LatestStreamLabel)
Refs #7163 (StreamSpecification)
I verified that all the new tests failed before these issues were fixed, but
now pass.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200907155334.562844-1-nyh@scylladb.com>
"
This series fixes a bug in `appending_hash<row>` that caused it to ignore any cells after the first NULL. It also adds a cluster feature which starts using the new hashing only after the whole cluster is aware of it. The series comes with tests, which reproduce the issue.
Fixes#4567
Based on #4574
"
* psarna-fix_ignoring_cells_after_null_in_appending_hash:
test: extend mutation_test for NULL values
tests/mutation: add reproducer for #4567
gms: add a cluster feature for fixed hashing
digest: add null values to row digest
mutation_partition: fix formatting
appending_hash<row>: make publicly visible
The test is extended for another possible corner case:
[1, NULL, 2] vs [1, 2, NULL] should have different digests.
Also, a check for legacy behavior is added.
With the new hashing routine, null values are taken into account
when computing row digest. Previous behavior had a regression
which stopped computing the hash after the first null value
is encountered, but the original behavior was also prone
to errors - e.g. row [1, NULL, 2] was not distinguishable
from [1, 2, NULL], because their hashes were identical.
This hashing is not yet active - it will only be used after
the next commit introduces a proper cluster feature for it.
appending_hash<row> specialisation is declared and defined in a *.cc file
which means it cannot have a dedicated unit test. This patch moves the
declaration to the corresponding *.hh file.
There was a typo in get_column_defs_for_filtering(): it checked the
wrong pointer before dereferencing. Add a test exposing the NULL
dereference and fix the typo.
Tests: unit (dev)
Fixes#7198.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
In commit 7d86a3b208 (storage_service:
Make replacing node take writes), application state of TOKENS of the
replacing node is added into gossip and propagated to the cluster after
the initial start of gossip service. This can cause a race below
1. The replacing node replaces the old dead node with the same ip address
2. The replacing node starts gossip without application state of the TOKENS
3. Other nodes in the cluster replace the application states of old dead node's
version with the new replacing node's version
4. replacing node dies
5. replace operation is performed again, the TOKENS application state is
not preset and replace operation fails.
To fix, we can always add TOKENS application state when the
gossip service starts.
Fixes: #7166
Backports: 4.1 and 4.2
The clustering_row class looks as a decorated deletable_row, but
re-implements all its logic (and members). Embed the deletable_row
into clustering_row and keep the non-static row logic in one
class instead of two.
We don't want to update scylla-python3 submodule for every python3 dependency
update, bring python3 package list to python3-dependencies.txt, pass it on
package building time.
See #6702
See scylladb/scylla-python3#6
[avi: add
* tools/python3 19a9cd3...b4e52ee (1):
> Allow specify package dependency list by --packages
to maintain bisectability]
* seastar 4ff91c4c3a...52f0f38994 (4):
> rpc: Return actual chosen compressor in server reponse - not all avail
Fixes#6925.
> net/tls: fix compilation guards around sec_param().
> future: improved printing of seastar::nested_exception
> merge: http: fix issues with the request parser and testing
When configuration files for perftune contain invalid parameter, scylla_prepare
may cause traceback because error handling is not eough.
Throw all errors from create_perftune_conf(), catch them on scylla_prepare,
print user understandable error.
Fixes#6847
The clustering_row is deletable_row + clustering_key, all
its internals work exactly as the relevant deletable_row's
ones.
The similar relation is between static_row and row, and
the former wrapes the latter, so here's the same trick
for the non-static row classes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The deletable_row accepts clustering_row in constructor and
.apply() method. The next patch will make clustering_row
embed the deletable_row inside, so those two methods will
violate layering and should be fixed in advance.
The fix is in providing a clustering_row method to convert
itself into a deletable_row. There are two places that need
this: mutation_fragment_applier and partition_snapshot_row_cursor.
Both methods pass temporary clustering_row value, so the
method in question is also move-converter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
check_and_repair_cdc_streams, in case it decides to create a new CDC
generation, updates the STATUS application state so that other nodes
gossiped with pick up the generation change.
The node which runs check_and_repair_cdc_streams also learns about a
generation change: the STATUS update causes a notification change.
This happens during add_local_application_state call
which caused the STATUS update; it would lead to calling
handle_cdc_generation, which detects a generation change and calls
add_local_application_state with the new generation's timestamp.
Thus, we get a recursive add_local_application_state call. Unforunately,
the function takes a lock before doing on_change notifications, so we
get a deadlock.
This commit prevents the deadlock.
We update the local variable which stores the generation timestamp
before updating STATUS, so handle_cdc_generation won't consider
the observed generation to be new, hence it won't perform the recursive
add_local_application_state call.
Hints writes are handled by storage_proxy in the exact same way
regular writes are, which in turn means that the same smp service
group is used for both. The problem is that it can lead to a priority
inversion where writes of the lower priority kind occupies a lot of
the semaphores units making the higher priority writes wait for an
empty slot.
This series adds a separate smp group for hints as well as a field
to pass the correct smp group to mutate_locally functions, and
then uses this field to properly classify the writes.
Fixes#7177
* eliransin-hint_priority_inversion:
Storage proxy: use hints smp group in mutate locally
Storage proxy: add a dedicated smp group for hints
Changing a user type may allow adding apparently duplicate rows to
tables where this type is used in a partitioning key. Fix by checking
all types of existing partitioning columns before allowing to add new
fields to the type.
Fixes#6941
The log-structured allocator (LSA) reserves memory when performing
operations, since its operations are performed with reclaiming disabled
and if it runs out, it cannot evict cache to gain more. The amount of
memory to reserve is remembered across calls so that it does not have
to repeat the fail/increase-reserve/retry cycle for every operation.
However, we currently lack decaying the amount to reserve. This means
that if a single operation increased the reserve in the distant past,
all current operations also require this large reserve. Large reserves
are expensive since they can cause large amounts of cache to be evicted.
This patch adds reserve decay. The time-to-decay is inversely proportional
to reserve size: 10GB/reserve. This means that a 20MB reserve is halved
after 500 operations (10GB/20MB) while a 20kB reserve is halved after
500,000 operations (10GB/20kB). So large, expensive reserves are decayed
quickly while small, inexpensive reserves are decayed slowly to reduce
the risk of allocation failures and exceptions.
A unit test is added.
Fixes#325.
This patch adds support for 2 hash commands HDEL and HGETALL.
Internally it introduces the hashes_result_builder class to
read hashes and stored them in a std::map.
Other changes:
- one exception return string was fixed
- tests now use pytest.raises
Signed-off-by: Etienne Adam <etienne.adam@gmail.com>
Message-Id: <20200907202528.4985-1-etienne.adam@gmail.com>
We are using mutate_locally to handle hint mutations that arrived
through RPC. The current implementation makes no distinction whether
the mutation came through hint verb or a mutation verb resulting in
using the same smp group for both. This commit adds the ability to
reference different smp group in mutate_locally private calls and
makes the handlers pass the correct smp group to mutate_locally.
Due to #7175, microseconds are stored in a db_clock::time_point
as if they were milliseconds.
std::chrono::duration_cast<std::chrono::nanoseconds> may cause overflow
and end up with invalid/negative nanos.
This change specializes time_point_to_string to std::chrono::milliseconds
since it's currently only called to print db_clock::time_point
and uses boost::posix_time::milliseconds to print the count.
This would generate an exception in today's time stamps
and the output will look like:
1599493018559873 milliseconds (Year is out of valid range: 1400..9999)
instead of:
1799-07-16T19:57:52.175010
It is preferrable to print the numeric value annotated as out of valid range
than to print a bogus date in the past.
Test: unit(dev), commitlog_test:TestCommitLog.test_mixed_mode_commitlog_same_partition_smp_1
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200907162845.147477-1-bhalevy@scylladb.com>
Now all work with the list is described as API calls, it's
finally possible to stop exposing the boost::set outside
the class.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's an optimization in flat_mutation_reader_from_mutations
that folds the list from left-to-right in linear time. In case
of currently used boost::set the .unlink_leftmost_without_rebalance
helper is used, so wrap this exception with a method of the
range_tombstone_list. This is the last place where caller need
to mess with the exact internal collection.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method extracts an element from the list, constructs
a desired object from it and frees. This is common usage
of range_tombstone_list. Having a helper helps encapsulating
the exact collection inside the class.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goal is to stop revealing the exact collection from
the range_tombstone_list, so make use of existing begin/end
methods and extend with rbegin() where needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The net patch will change the way range tombstones are
fed into hasher. To make sure the codeflow doesn't
become exception-unsafe, mark the relevant methods as
nont-throwing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All those methods end up with library calls, whose code
is not marked noexcept, but is such according to code
itself or docs.
The primary goal is to make some repair partition_hasher
methods noexcept (next patch).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/7160
By Calle Wilund:
Stream descriptions, as returned from create, update and describe stream was missing "latest" stream arn.
Shard descriptions sequence number (for us timeuuid:s) were formatted wrongly. The spec states they should be numeric only.
Both these defects break Kinesis operations.
alternator: Set CDC delta to keys only for alternator streams
alternator: Include stream spec in desc for create/update/describe
alternator: Include LatestStreamLabel in resulting desc for create/update table
alternator: Make "StreamLabel" an iso8601 timestamp
alternator: Alloc BILLING_MODE in update_table
cdc: Add setter for delta mode
alternator: Fix sequence number range using wrong format
alternator: Include stream arn in table description if enabled
Add new validate_with_error_position function
which returns -1 if data is a valid UTF-8 string
or otherwise a byte position of first invalid
character. The position is added to exception
messages of all UTF-8 parsing errors in Scylla.
validate_with_error_position is done in two
passes in order to preserve the same performance
in common case when the string is valid.
Fixes#7190
Since we don't use any delta value when translating cdc -> streams
it is wasteful to write these to the log table, esp. since we already
write big fat pre- and post images.
Fixes#7163
If enabled, the resulting table description should include a
StreamDescription object with the appropriate members describing
current stream settings.
Previous patches removed those `BOOST_REQUIRE*` macros that could be
invoked from shards other than 0. The reason is that said macros are not
thread-safe, so calling them from multiple shards produces mangled
output to stdout as well as the XML report file. It was assumed that
only these invocations -- from a non-0 shard -- are problematic, but it
turns out even these can race with seastar log messages emitted from
other shards. This patch removes all such macros, replacing them with
the thread safe `require*` functions from `test/lib/test_utils.hh`.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200907125309.1199104-1-bdenes@scylladb.com>
Hints and regular writes currently uses the same cross shard
operation semaphore, which can lead to priority inversion, making
cross shard writes wait for cross shard hints. This commit adds
an smp_service_group for hints and adds it usage in the mutate_hint
function.
Fixes#7158
A streams shard descriptions has a sequence range describing start/end
(if available) of the shard. This is specified as being "numeric only".
Alternator incorrectly used UUID here, which breaks kinesis.
v2:
* Fix uint128_t parsing from string. bmp::number constructor accepted
sstring, but did not interpret it as std::string/chars. Weird results.
As seen in https://github.com/scylladb/scylla/issues/7175,
1e676cd845
that was merged in bc77939ada
exposed a preexisting problem in time_point_to_string
where it tried printing a timestamp that was in microseconds
(taken from an api::timestamp_type instead of db_clock::time_point)
and hit `boost::wrapexcept<boost::gregorian::bad_year> (Year is out of valid range: 1400..9999)`
If hit, this patch with print the offending time_stamp in
nanoseconds and the error message.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200907083303.33229-1-bhalevy@scylladb.com>
Fixes#7157
When creating/altering/describing a table, if streams are enabled, the
"latest active" stream arn should be included as LatestStreamArn.
Not doing so breaks java kinesis.
This improves the build documentation beyond just packaging:
- Explain how the configure.py step works
- Explain how to build just executables and tests (for development)
- Explain how to build for specific build mode if you didn't specify a
build mode in configure.py step
- Fix build artifact locations, missing .debs, and add executables and
tests
Message-Id: <20200904084443.495137-1-penberg@iki.fi>
T& tp may have other period than milliseconds.
Cast the time_point duration to nanoseconds (or microseconds
if boost doesn't supports it) so it is printed in the best possible
resolution.
Note that we presume that the time_point epoch is the
Unix epoch of 1970-01-01, but the c++ standard doesn't guwarntee
that. See https://github.com/scylladb/scylla/issues/5498
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200906171106.690872-1-bhalevy@scylladb.com>
Documentation states that `SCYLLA_LWT_OPTIMIZATION_META_BIT_MASK`
is a 32-bit integer that represents bit mask. What it fails to mention
is that it's a unsigned value and in fact it takes value of 2147483648.
This is problematic for clients in languages that don't have unsigned
types (like Java).
This patch improves the documentation to make it clear that
`SCYLLA_LWT_OPTIMIZATION_META_BIT_MASK` is represented by unsigned
value.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <7166b736461ae6f3d8ffdf5733e810a82aa02abc.1599382184.git.piotr@scylladb.com>
statement_restrictions::process_partition_key_restrictions() was
checking has_unrestricted_components(), whereas just an empty() check
suffices there, because has_unrestricted_components() is implicitly
checked five lines down by needs_filtering().
The replacement check is cheaper and simpler to understand.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Previously batch statement result set included rows for only
those updates which have a prefetch data present (i.e. there
was an "old" (pre-existing) row for a key).
Also, these rows were sorted not in the order in which statements
appear in the batch, but in the order of updated clustering keys.
If we have a batch which updates a few non-existent keys, then
it's impossible to figure out which update inserted a new key
by looking at the query response. Not only because the responses
may not correspond to the order of statements in the batch, but
even some rows may not show up in the result set at all.
Please see #7113 on Github for detailed description
of the problem:
https://github.com/scylladb/scylla/issues/7113
The patch set proposes the following fix:
For conditional batch statements the result set now always
includes a row for each LWT statement, in the same order
in which individual statements appear in the batch.
This way we can always tell which update did actually insert
a new key or update the existing one.
Technically, the following changes were made:
* `update_parameters::prefetch_data::row::is_in_cas_result_set`
member removed as well as the supporting code in
`cas_request::applies_to` which iterated through cas updates
and marked individual `prefetch_data` rows as "need to be in
cas result set".
* `cas_request::applies_to` substantially simplified since it
doesn't do anything more than checking `stmt.applies_to()`
in short-circuiting manner.
* `modification_statement::build_cas_result_set` method moved
to `cas_request`. This allows to easily iterate through
individual `cas_row_update` instances and preserve the order
of the rows in the result set.
* A little helper `cas_request::find_old_row`
is introduced to find a row in `prefetch_data` based on the
(pk, ck) combination obtained from the current `cas_request`
and a given `cas_row_update`.
* A few tests for the issue #7113 are written, other lwt-batch-related
tests adjusted accordingly.
Previously batch statement result set included rows for only
those updates which have a prefetch data present (i.e. there
was an "old" (pre-existing) row for a key).
Also, these rows were sorted not in the order in which statements
appear in the batch, but in the order of updated clustering keys.
If we have a batch which updates a few non-existent keys, then
it's impossible to figure out which update inserted a new key
by looking at the query response. Not only because the responses
may not correspond to the order of statements in the batch, but
even some rows may not show up in the result set at all.
The patch proposes the following fix:
For conditional batch statements the result set now always
includes a row for each LWT statement, in the same order
in which individual statements appear in the batch.
This way we can always tell which update did actually insert
a new key or update the existing one.
`update_parameters::prefetch_data::row::is_in_cas_result_set`
member variable was removed as well as supporting code in
`cas_request::applies_to` which iterated through cas updates
and marked individual `prefetch_data` rows as "need to be in
cas result set".
Instead now `cas_request::applies_to` is significantly
simplified since it doesn't do anything more than checking
`stmt.applies_to()` in short-circuiting manner.
A few tests for the issue are written, other lwt-batch-related
tests were adjusted accordingly to include rows in result set
for each statement inside conditional batches.
Tests: unit(dev, debug)
Co-authored-by: Konstantin Osipov <kostja@scylladb.com>
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This is just a plain move of the code from `modification_statement`
to `cas_request` without changes in the logic, which will further
help to refactor `build_cas_result_set` behavior to include a row
for each LWT statement and order rows in the order of statements
in a batch.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Factor out little helper function which finds a pre-existing
row for a given `cas_row_update` (matching the primary key).
Used in `cas_request::applies_to`.
Will be used in a subsequent patch to move
`modification_statement::build_cas_result_set` into `cas_request`.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The partitions_type::lower_bound() method can return a hint that saves
info about the "lower-ness of the bound", in particular when the search
key is found, this can be guessed from the hint without comparison.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The row_cache::find_or_create is only used to put (or touch) an entry in cache
having the partition_start mutation at hands. Thus, theres no point in carrying
key reference and tombstone value through the calls, just the partition_start
reference is enough.
Since the new cache entry is created incomplete, rename the creation method
to reflect this.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The only caller of find_or_create() in tests works on already existing (.populate()-d) entry,
so patch this place for explicity and for the sake of next patching.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the key for new partition is copied inside do_find_or_create_entry we may call
this function without allocator set, as it sets the allocator inside.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the missing partition is created in cache the decorated key is copied from
the ring position view too early -- to do the lookup. However, the read context
had been already entered the partition and already has the decorated key on board,
so for lookup we can use the reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We use a lot of TLS variables yet GDB is not of much help when working
with these. So in this patch I document where they are located in
memory, how to calculate the address of a known TLS variable and how to
find (identify) one given an address.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200903131802.1068288-1-bdenes@scylladb.com>
* seastar 7f7cf0f232...4ff91c4c3a (47):
> core/reactor: complete_timers(): restore previous scheduling group
Fixes#7117.
> future: Drop spurious 'mutable' from a lambda
> future: Don't put a std::tuple in future_state
> future: Prepare for changing the future_state storage type
> future: Add future_state::get0()
> when_all: Replace another untuple with get0
> when_all: Use get0 instead of untuple
> rpc: Don't assume that a future stores a std::tuple
> future: Add a futurize_base
> testing: Stop boost from installing its own signal handler
> future: Define futurize::value_type with future::value_type
> future: Move futurize down the file
> Merge "Put logging onto {fmt} rails" from Pavel E
> Merge "Make future_state non variadic" from Rafael
> sharded: propagate sharded instances stop error
> log: Fix misprint in docs
> future: Use destroy_at in destructor
> future: Add static_assert rejecting variadic future_state
> futures_test: Drop variadic test
> when_all: Drop a superfluous use of futurize::from_tuples
> everywhere: Use future::get0 when appropriate
> net: upgrade sockopt comments to doxygen
> iostream: document the read_exactly() function
> net:tls: fix clang-10 compilation duration cast
> future: Move continuation_base_from_future to future.hh
> repeat: Drop unnecessary make_tuple
> shared_future: Don't use future_state_type
> future: Add a static_assert against variadic futures
> future: Delete warn_variadic_future
> rpc_demo: Don't use a variadic future
> rpc_test: Don't use a variadic future
> futures_test: Don't use a variadic future
> future: Move disable_failure_guard to promise::schedule
> net: add an interface for custom socket options
> posix: add one more setsockopt overload
> Merge "Simplify pollfn and its inheritants" from Pavel E
> util:std-compat.hh: add forward declaration of std::pmr for clang-10
> rpc: Add protocol::has_handlers() helper
> Add a Seastar_DEBUG_ALLOCATIONS build option
> futures_test: Add a test for futures of references
> future: Simplify destruction of future_state
> Use detect_stack_use_after_return=1
> repeat: Fix indentation
> repeat: Delete try/catch
> repeat: Simplify loop
> Avoid call to std::exception_ptr's destructor from a .hh
> file: Add missing include
"
Before seastar is updated with the {fmt} engine under the
logging hood, some changes are to be made in scylla to
conform to {fmt} standards.
Compilation and tests checked against both -- old (current)
and new seastar-s.
tests: unit(dev), manual
"
* 'br-logging-update' of https://github.com/xemul/scylla:
code: Force formatting of pointer in .debug and .trace
code: Format { and } as {fmt} needs
streaming: Do not reveal raw pointer in info message
mp_row_consumer: Provide hex-formatting wrapper for bytes_view
heat_load_balance: Include fmt/ranges.h
This patch adds a test for the TRIM_HORIZON option of GetShardIterator in
Alternator Streams. This option asks to fetch again *all* the available
history in this shard stream. We had an implementation for it, but not a
test - so this patch adds one. The test passes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200830131458.381350-1-nyh@scylladb.com>
Alternator Streams already support the AT_SEQUENCE_NUMBER and
AFTER_SEQUENCE_NUMBER options for iterators. These options allow to replay
a stream of changes from a known position or after that known position.
However, we never had a test verifying that these features actually work
as intended, beyond just checking syntax. Having such tests is important
because recently we changed the implementation of these iterators, but
didn't have a test verifying that they still work.
So in this patch we add such tests. The tests pass (as usual, on both
Alternator and DynamoDB).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200830115817.380075-1-nyh@scylladb.com>
We had a test, test_streams_last_result, that verifies that after reading from
an Alternator Stream the last event, reading again will find nothing.
But we didn't actually have a test which checks that if at that point a new event
*does* arrive, we can read it. This test checks this case, and it passes (we don't
have a bug there, but it's good as a regression test for NextShardIterator).
This test also verifies that after reading an event for a particular key on a
a specific stream "shard", the next event for the same key will arrive on the
same shard.
This test passes on both Alternator and DynamoDB.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200830105744.378790-1-nyh@scylladb.com>
A compaction strategy, that supports parallel compaction, may want to know
if the table has compaction running on its behalf before making a decision.
For example, a size-tiered-like strategy may not want to trigger a behavior,
like cross-tier compaction, when there's ongoing compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200901134306.23961-1-raphaelsc@scylladb.com>
If available, use the recently added
`reader_concurrency_semaphore::_initial_resources` to calculate the
amount of memory used out of the initially configured amount. If not
available, the summary falls back to the previous mode of just printing
the remaining amount of memory.
Example:
Replica:
Read Concurrency Semaphores:
user sstable reads: 11/100, 263621214/ 42949672 B, queued: 847
streaming sstable reads: 0/ 10, 0/ 42949672 B, queued: 0
system sstable reads: 1/ 10, 251584/ 42949672 B, queued: 0
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200901091452.806419-1-bdenes@scylladb.com>
This patch fixes a bug which caused sporadic failures of the Alternator
test - test_streams.py::test_streams_last_result.
The GetRecords operation reads from an Alternator Streams shard and then
returns an "iterator" from where to continue reading next time. Because
we obviously don't want to read the same change again, we "incremented"
the current position, to start at the incremented position on the next read.
Unfortunately, the implementation of the increment() function wasn't quite
right. The position in the CDC log is a timeuuid, which has a really bizarre
comparison function (see compare_visitor in types.cc). In particular the
least-sigificant bytes of the UUID are compared as *signed* bytes. This
means that if the last byte of the UUID was 127, and increment() increased
it to 128, and this was wrong because the comparison function later deemed
that as a signed byte, where 128 is lower than 127, not higher! The result
was that with 1/256 probability (whenever the last byte of the position was
127) we would return an item twice. This was reproduced (with 1/256
probability) by the test test_streams_last_result, as reported in issue #7004.
The fix in this patch is to drop the increment() and replace it by a flag
whether an iterator is inclusive of the threshold (>=) or exclusive (>).
The internal representation of the iterator has a boolean flag "inclusive",
and the string representation uses the prefixes "I" or "i" to indicate an
inclusive or exclusive range, respectively - whereas before this patch we
always used the prefix "I".
Although increment() could have been fixed to work correctly, the result would
have been ugly because of the weirdness of the timeuuid comparison function.
increment() would also require extensive new unit-tests: we were lucky that
the high-level functional tests caught a 1 in 256 error, but they would not
have caught rarer errors (e.g., 1 in 2^32). Furthermore, I am looking at
Alternator as the first "user" of CDC, and seeing how complicated and
error-prone increment() is, we should not recommend to users to use this
technique - they should use exclusive (>) range queries instead.
Fixes#7004.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200901102718.435227-1-nyh@scylladb.com>
Users can set python3 and sysconfdir from cmdline of install.sh
according to the install mode (root or nonroot) and distro type.
It's helpful to correct the default python3/sysconfdir, otherwise
setup scripts or scylla-jmx doesn't work.
Fixes#7130
Signed-off-by: Amos Kong <amos@scylladb.com>
If the install.sh is executed by sudo, then the tmp scylla.yaml is owned
by root. It's difficult to overwrite it by non-privileged users.
Signed-off-by: Amos Kong <amos@scylladb.com>
Commit a6ad70d3da changed the format of
stream IDs: the lower 8 bytes were previously generated randomly, now
some of them have semantics. In particular, the least significant byte
contains a version (stream IDs might evolve with further releases).
This is a backward-incompatible change: the code won't properly handle
stream IDs with all lower 8 bytes generated randomly. To protect us from
subtle bugs, the code has an assertion that checks the stream ID's
version.
This means that if an experimental user used CDC before the change and
then upgraded, they might hit the assertion when a node attempts to
retrieve a CDC generation with old stream IDs from the CDC description
tables and then decode it.
In effect, the user won't even be able to start a node.
Similarly as with the case described in
d89b7a0548, the simplest fix is to rename
the tables. This fix must get merged in before CDC goes out of
experimental.
Now, if the user upgrades their cluster from a pre-rename version, the
node will simply complain that it can't obtain the CDC generation
instead of preventing the cluster from working. The user will be able to
use CDC after running checkAndRepairCDCStreams.
Since a new table is added to the system_distributed keyspace, the
cluster's schema has changed, so sstables and digests need to be
regenerated for schema_digest_test.
Merged pull request https://github.com/scylladb/scylla/pull/7121
By Calle Wilund:
Refs #7095Fixes#7128
CDC delta!=full both relied on post-filtering
to remove generated log row and/or cells. This is inefficient.
Instead, simply check if the data should be created in the
visitors.
Also removed delta_mode=off mode.
cdc: Remove post-filterings for keys-only/off cdc delta generation
cdc: Remove cdc delta_mode::off
Refs #7095
CDC delta!=full both relied on post-filtering
to remove generated log row and/or cells. This is inefficient.
Instead, simply check if the data should be created in the
visitors.
v2:
* Fixed delta logs rows created (empty) even when delta == off
v3:
* Killed delta == off
v4:
* Move checks into (const) member var(s)
Fixes#7128
CDC logs are not useful without at least delta_mode==keys, since
pre/post image data has no info on _what_ was actually done to
base table in source mutation.
The following metric is added:
scylla_node_maintenance_operations_removenode_finished_percentage{shard="0",type="gauge"} 0.650000
It is the number of finished percentage for removenode operation so
far.
Fixes#1244, #6733
The following metric is added:
scylla_node_maintenance_operations_decommission_finished_percentage{shard="0",type="gauge"}
0.650000
It is the number of finished percentage for decommission operation so
far.
Fixes#1244, #6733
The following metric is added:
scylla_node_maintenance_operations_replace_finished_percentage{shard="0",type="gauge"} 0.650000
It is the number of finished percentage for replace operation so far.
Fixes#1244, #6733
hget and hset commands using hashes internally, thus
they are not using the existing write_strings() function.
Limitations:
- hset only supports 3 params, instead of multiple field/value
list that is available in official redis-server.
- hset should return 0 when the key and field already exists,
but I am not sure it's possible to retrieve this information
without doing read-before-write, which would not be atomic.
I factorized a bit the query_* functions to reduce duplication, but
I am not 100% sure of the naming, it may still be a bit confusing
between the schema used (strings, hashes) and the returned format
(currently only string but array should come later with hgetall).
Signed-off-by: Etienne Adam <etienne.adam@gmail.com>
Message-Id: <20200830190128.18534-1-etienne.adam@gmail.com>
Try to look up and use schema from the local schema_registry
in case when we have a schema mismatch between the most recent schema
version and the one that is stored inside the `frozen_mutation` for
the accepted proposal.
When such situation happens the stored `frozen_mutation` is able
to be applied only if we are lucky enough and column_mapping in
the mutation is "compatible" with the new table schema.
It wouldn't work if, for example, the columns are reordered, or
some columns, which are referenced by an LWT query, are dropped.
With the patch we are able to mitigate these cases as long as the
referenced schema is still present in the node cache (e.g.
it didn't restart/crash or the cache entry is not too old
to be evicted).
Tests: unit(dev, debug), dtest(paxos_tests.schema_mismatch_*_test)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200827150844.624017-1-pa.solodovnikov@scylladb.com>
The get_natural_endpoints returns the list of nodes holding a key.
There is a variation of the method that gets the key as string, the
current implementation just cast the string to bytes_view, which will
not work. Instead, this patch changes the implementation to use
from_nodetool_style_string to translate the key (in a nodetool like
format) to a token.
Fixes#7134
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
After issue #7107 was fixed (regarding the correctness of OldImage and NewImage
in Alternator Streams) we forgot to remove the "xfail" tag from one of the tests
for this issue.
This test now passes, as expected, so in this patch we remove the xfail tag.
Refs #7107
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200827103054.186555-1-nyh@scylladb.com>
alternator/getting-started.md had a missing grave accent (`) character,
resulting in messed up rendering of the involved paragraph. Add the missing
quote.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200827110920.187328-1-nyh@scylladb.com>
The lockup:
When view_builder starts all shards at some point get to a
barrier waiting for each other to pass. If any shard misses
this checkpoint, all others stuck forever. As this barrier
lives inside the _started future, which in turn is waited
on stop, the stop stucks as well.
Reasons to miss the barrier -- exception in the middle of the
fun^w start or explicit abort request while waiting for the
schema agreement.
Fix the "exception" case by unlocking the barrier promise with
exception and fix the "abort request" case by turning it into
an exception.
The bug can be reproduced by hands if making one shard never
see the schema agreement and continue looping until the abort
request.
The crash:
If the background start up fails, then the _started future is
resolved into exception. The view_builder::stop then turns this
future into a real exception caught-and-rethrown by main.cc.
This seems wrong that a failure in a background fiber aborts
the regular shutdown that may proceed otherwise.
tests: unit(dev), manual start-stop
branch: https://github.com/xemul/scylla/tree/br-view-builder-shutdown-fix-3fixes: #7077
Patch #5 leaves the seastar::async() in the 1-st phase of the
start() although can also be tuned not to produce a thread.
However, there's one more (painless) issue with the _sem usage,
so this change appears too large for the part of the bug-fix
and will come as a followup.
* 'br-view-builder-shutdown-fix-3' of git://github.com/xemul/scylla:
view_builder: Add comment about builder instances life-times
view_builder: Do sleep abortable
view_builder: Wakeup barrier on exception
view_builder: Always resolve started future to success
view_builder: Re-futurize start
view_builder: Split calculate_shard_build_step into two
view_builder: Populate the view_builder_init_state
view_builder: Fix indentation after previous patch
view_builder: Introduce view_builder_init_state
Merged pull request https://github.com/scylladb/scylla/pull/7063
By Calle Wilund:
Fixes#6935
DynamoDB streams for some reason duplicate the record keys
into both the "Keys" and "OldImage"/"NewImage" sub-objects
when doing GetRecords. But only if there is other data
to include.
This patch appends the pk/ck parts into old/new image
iff we had any record data.
Updated to handle keys-only updates, and distinguish creating vs. updating rows. Changes cdc to not generate preimage
for non-existent/deleted rows, and also fixes missing operations/ttls in keys-only delta mode.
alternator_streams: Include keys in OldImage/NewImage
cdc: Do not generate pre/post image for non-existent rows
Fixes#6935Fixes#7107
DynamoDB streams for some reason duplicate the record keys
into both the "Keys" and "OldImage"/"NewImage" sub-objects
when doing GetRecords.
This patch appends the pk/ck parts into old/new image, and
also removes the previous restrictions on image generation
since cdc now generates more consistent pre/post image
data.
Fixes#7119Fixes#7120
If preimage select came up empty - i.e. the row did not exist, either
due to never been created, or once delete, we should not bother creating
a log preimage row for it. Esp. since it makes it harder to interpret the
cdc log.
If an operation in a cdc batch did a row delete (ranged, ck, etc), do
not generate postimage data, since the row does no longer exist.
Note that we differentiate deleting all (non-pk/ck) columns from actual
row delete.
... and tests. Printin a pointer in logs is considered to be a bad practice,
so the proposal is to keep this explicit (with fmt::ptr) and allow it for
.debug and .trace cases.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two places that want to print "{<text>}" strings, but do not format
the curly braces the {fmt}-way.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Showing raw pointer values in logs is not considered to be good
practice. However, for debugging/tracing this might be helpful.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
By default {fmt} doesn't know how to format this type (although it's a
basic_string_view instantiated), and even providing formatter/operator<<
does not help -- it anyway hits an earlier assertion in args mapper about
the disallowance of character types mixing.
The hex-wrapper with own operator<< solves the problem.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
The view_info object, which is attached to the schema object of the
view, contains a data structure called
"base_non_pk_columns_in_view_pk". This data structure contains column
ids of the base table so is valid only for a particular version of the
base table schema. This data structure is used by materialized view
code to interpret mutations of the base table, those coming from base
table writes, or reads of the base table done as part of view updates
or view building.
The base table schema version of that data structure must match the
schema version of the mutation fragments, otherwise we hit undefined
behavior. This may include aborts, exceptions, segfaults, or data
corruption (e.g. writes landing in the wrong column in the view).
Before this patch, we could get schema version mismatch here after the
base table was altered. That's because the view schema did not change
when the base table was altered.
Another problem was that view building was using the current table's schema
to interpret the fragments and invoke view building. That's incorrect for two
reasons. First, fragments generated by a reader must be accessed only using
the reader's schema. Second, base_non_pk_columns_in_view_pk of the recorded
view ptrs may not longer match the current base table schema, which is used
to generate the view updates.
Part of the fix is to extract base_non_pk_columns_in_view_pk into a
third entity called base_dependent_view_info, which changes both on
base table schema changes and view schema changes.
It is managed by a shared pointer so that we can take immutable
snapshots of it, just like with schema_ptr. When starting the view
update, the base table schema_ptr and the corresponding
base_dependent_view_info have to match. So we must obtain them
atomically, and base_dependent_view_info cannot change during update.
Also, whenever the base table schema changes, we must update
base_dependent_view_infos of all attached views (atomically) so that
it matches the base table schema.
Fixes#7061.
Tests:
- unit (dev)
- [v1] manual (reproduced using scylla binary and cqlsh)
"
* tag 'mv-schema-mismatch-fix-v2' of github.com:tgrabiec/scylla:
db: view: Refactor view_info::initialize_base_dependent_fields()
tests: mv: Test dropping columns from base table
db: view: Fix incorrect schema access during view building after base table schema changes
schema: Call on_internal_error() when out of range id is passed to column_at()
db: views: Fix undefined behavior on base table schema changes
db: views: Introduce has_base_non_pk_columns_in_view_pk()
If one shard delays in seeing the schema agreement and returns on
abort request, other shards may get stuck waiting for it on the
status read barrier. Luckily with the previous patch the barrier
is exception-proof, so we may abort the waiting loop with exception
and handle the lock-up.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If an exception pops up during the view_builder::start while some
shards wait for the status-read barrier, these shards are not woken
up, thus causing the shutdown to stuck.
Fix this by setting exception on the barrier promise, resolving all
pending and on-going futures.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If the view builder background start fails, the _started future resolves
to exceptional state. In turn, stopping the view builder keeps this state
through .finally() and aborts the shutdown very early, while it may and
should proceed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Step two turning the view_builder::start() into a chain of lambdas --
rewrite (most of) the seastar::async()'s lambda into a more "classical"
form.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The calculate_shard_build_step() has a cross-shard barrier in the middle and
passing the barrier is broken wrt exceptions that may happen before it. The
intention is to prepare this barrier passing for exception handling by turning
the view_builder::start() into a dedicated continuation lambda.
Step one in this campaign -- split the calculate_shard_build_step() into
steps called by view_builder::start():
- before the barrier
- barrier
- after the barrier
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Keep the internal calculate_shard_build_step()'s stuff on the init helper
struct, as the method in question is about to be split into a chain of
continuation lambdas.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is the helper initialization struct that will carry the needed
objects accross continuation lambdas.
The indentation in ::start() will be fixed in the next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
This series adds scylla repairs command to help debug repair.
Fixes#7103
"
* asias-repair_help_debug_scylla_repairs_cmd:
scylla-gdb.py: Add scylla repairs command
repair: Add repair_state to track repair states
scylla-gdb.py: Print the pointers of elements in boost_intrusive_list_printer
scylla-gdb.py: Add printer for gms::inet_address
scylla-gdb.py: Fix a typo in boost_intrusive_list
repair: Fix the incorrect comments for _all_nodes
repair: Add row_level_repair object pointer in repair_meta
repair: Add counter for reads issued and finished for repair_reader
"
Issue https://github.com/scylladb/scylla/issues/7019 describes a problem of an ever-growing map of temporary values stored in query_options. In order to mitigate this kind of problems, the storage for temporary values is moved from an external data structure to the value views itself. This way, the temporary lives only as long as it's accessible and is automatically destroyed once a request finishes. The downside is that each temporary is now allocated separately, while previously they were bundled in a single byte stream.
Tests: unit(dev)
Fixes https://github.com/scylladb/scylla/issues/7019
"
* psarna-move_temporaries_to_value_view:
cql3: remove query_options::linearize and _temporaries
cql3: remove make_temporary helper function
cql3: store temporaries in-place instead of in query_options
cql3: add temporary_value to value view
cql3: allow moving data out of raw_value
cql3: split values.hh into a .cc file
With relatively big summaries, reactor can be stalled for a couple
of milliseconds.
This patch:
a. allocates positions upfront to avoid excessive reallocation.
b. returns a future from seal_summary() and uses `seastar::do_for_each`
to iterate over the summary entries so the loop can yield if necessary.
Fixes#7108.
Based on 2470aad5a389dfd32621737d2c17c7e319437692 by Raphael S. Carvalho <raphaelsc@scylladb.com>
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200826091337.28530-1-bhalevy@scylladb.com>
Per sheduling-group data was moved from the task queues to a separate
data member in the reactor itself. Update `scylla memory` to use the new
location to get the per sheduling group data for the storage proxy
stats.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200825140256.309299-1-bdenes@scylladb.com>
We copy a list, which was reported to generate a 15ms stall.
This is easily fixed by moving it instead, which is safe since this is
the last use of the variable.
Fixes#7115.
If file_writer::close() fails to close the output stream
closing will be retried in file_writer::~file_writer,
leading to:
```
include/seastar/core/future.hh:1892: seastar::future<T ...> seastar::promise<T>::get_future() [with T = {}]: Assertion `!this->_future && this->_state && !this->_task' failed.
```
as seen in https://github.com/scylladb/scylla/issues/7085Fixes#7085
Test: unit(dev), database_test with injected error in posix_file_impl::close()
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200826062456.661708-1-bhalevy@scylladb.com>
This patch adds tests for the "NewImage" attribute in Alternator Streams
in NEW_IMAGE and NEW_AND_OLD_IMAGES mode.
It reproduces issue #7107, that items' key attributes are missing in the
NewImage. It also verifies the risky corner cases where the new item is
"empty" and NewImage should include just the key, vs. the case where the
item is deleted, so NewImage should be missing.
This test currently passes on AWS DynamoDB, and xfails on Alternator.
Refs #7107.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200825113857.106489-1-nyh@scylladb.com>
Never trust Occam's Razor - it turns out that the use-after-free bug in the
"exists" command was caused by two separate bugs. We fixed one in commit
9636a33993, but there is a second one fixed in
this patch.
The problem fixed here was that a "service_permit" object, which is designed to
be copied around from place to place (it contains a shared pointer, so is cheap
to copy), was saved by reference, and the reference was to a function argument
and was destroyed prematurely.
This time I tested *many times* that that test_strings.py passes on both dev and
debug builds.
Note that test/run/redis still fails in a debug build, but due to a different
problem.
Fixes#6469
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200825183313.120331-1-nyh@scylladb.com>
test/redis/README.md suggests that when running "pytest" the default is to connect
to a local redis on localhost:6379. This default was recently lost when options
were added to use a different host and port. It's still good to have the default
suggested in README.md.
It also makes it easier to run the tests against the standard redis, which by
default runs on localhost:6379 - by just running "pytest".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200825195143.124429-1-nyh@scylladb.com>
query_options::linearize was the only user of _temporaries helper
attribute, and it turns out that this function is never used -
- and is therefore removed.
When a value_view needs to store a temporarily instantiated object,
it can use the new variant field. The temporary value will live
only as long as the view itself.
Use repair_state to track the major state of repair from the beginning
to the end of repair.
With this patch, we can easily know at which state both the repair
master and followers are. It is very helpful when debugging a repair
hang issue.
Refs #7103
We saw errors in killed_wiped_node_cannot_join_test dtest:
Aug 2020 10:30:43 [node4] Missing: ['A node with address 127.0.76.4 already exists, cancelling join']:
The test does:
n1, n2, n3, n4
wipe data on n4
start n4 again with the same ip address
Without this patch, n4 will bootstrap into the cluster new tokens. We
should prevent n4 to bootstrap because there is an existing
node in the cluster.
In shadow round, the local node should apply the application state of
the node with the same ip address. This is useful to detect a node
trying to bootstrap with the same IP address of an existing node.
Tests: bootstrap_test.py
Fixes: #7073
Fixes#6900
Clustered range deletes did not clear out the "row_states" data associated
with affected rows (might be many).
Adds a sweep through and erases relevant data. Since we do pre- and
postimage in "order", this should only affect postimage.
Fix the default number of test repeats to 1, which it was before
(spotted by Nadav). Also, prefix the options so that they become
"--test-repeat" and "--test-timeout" (spotted by Avi).
Message-Id: <20200825081456.197210-1-penberg@scylladb.com>
It is useful to distinguish if the repair is a regular repair or used
for node operations.
In addition, log the keyspace and tables are repaired.
Fixes#7086
A check, to validate that counter column cannot be added into non-counter table,
is missing for alter table statement. Validation is performed when building new
schema, but it's limited to checking that a schema will not contain both counter
and non-counter columns.
Due to lack of validation, the added counter column could be incorrectly
persisted to the schema, but this results in a crash when setting the new
schema to its table. On restart, it can be confirmed that the schema change
was indeed persisted when describing the table.
This problem is fixed by doing proper validation for the alter table statement,
which consists of making sure a new counter column cannot be added to a
non-counter table.
The test cdc_disallow_cdc_for_counters_test is adjusted because one of its tests
was built on the assumption that counter column can be added into a non-counter
table.
Fixes#7065.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200824155709.34743-1-raphaelsc@scylladb.com>
To verify offline installed image with do_verify_package() in scylla_setup, we introduce is_offline() function works just like is_nonroot() but the install not with --nonroot.
* syuu1228-do_verify_package_for_offline:
scylla_setup: verify package correctly on offline install
scylla_util.py: implement is_offline() to detect offline installed image
do_verify_package written only for .rpm/.deb, does not working correctly
for offline install(including nonroot).
We should check file existance for the environment, not by package existance
using rpm/dpkg.
Fixes#7075
A missing "&" caused the key stored in a long-living command to be copied
and the copy quickly freed - and then used after freed.
This caused the test test_strings.py::test_exists_multiple_existent_key for
this feature to frequently crash.
Fixes#6469
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200823190141.88816-1-nyh@scylladb.com>
Consider:
1) Start n1,n2,n3
2) Stop n3
3) Start n4 to replace n3 but list n4 as seed node
4) Node n4 finishes replacing operation
5) Restart n2
6) Run SELECT * from system.peers on node or node 1.
cqlsh> SELECT * from system.peers ;
peer| data_center | host_id| preferred_ip | rack | release_version | rpc_address | schema_version| supported_features| tokens
127.0.0.3 |null |null | null | null | null |null |null |null | {'-90410082611643223', '5874059110445936121'}
The replaced old node 127.0.0.3 shows in system.peers.
(Note, since commit 399d79fc6f (init: do not allow
replace-address for seeds), step 3 will be rejected. Assume we use a version without it)
The problem is that n2 sees n3 is in gossip status of SHUTDOWN after
restart. The storage_service::handle_state_normal callback is called for
127.0.0.3. Since n4 is using different token as n3 (seed node does not bootstrap so
it uses new tokens instead of tokens of n3 which is being replaced), so
owned_tokens will be set. We see logs like:
[shard 0] storage_service - handle_state_normal: New node 127.0.0.3 at token 5874059110445936121
[shard 0] storage_service - Host ID collision for cbec60e5-4060-428e-8d40-9db154572df7 between 127.0.0.4
and 127.0.0.3; ignored 127.0.0.3
As a result, db::system_keyspace::update_tokens
will be called to write to system.peers for 127.0.0.3 wrongly.
if (!owned_tokens.empty()) {
db::system_keyspace::update_tokens(endpoint, owned_tokens)
}
To fix, we should skip calling db::system_keyspace::update_tokens if the
nodes is present in endpoints_to_remove.
Refs: #4652
Refs: #6397
Just like test/alternator, make redis-test runnable from test.py.
For this we move the redis tests into a subdirectory of tests/,
and create a script to run them: tests/redis/run.
These tests currently fail, so we did not yet modify test.py to actually
run them automatically.
Fixes#6331
"
There's last call for global storage service left in compaction code, it
comes from cleanup_compaction to get local token ranges for filtering.
The call in question is a pure wrapper over database, so this set just
makes use of the database where it's already available (perform_cleanup)
and adds it where it's needed (perform_sstable_upgrade).
tests: unit(dev), nodetool upgradesstables
"
* 'br-remove-ss-from-compaction-3' of https://github.com/xemul/scylla:
storage_service: Remove get_local_ranges helper
compaction: Use database from options to get local ranges
compaction: Keep database reference on upgrade options
compaction: Keep database reference on cleanup options
db: Factor out get_local_ranges helper
This patch causes orphaned hints (hints that were written towards a node
that is no longer their replica) to be sent with a default write
timeout. This is what is currently done for non-orphaned hints.
Previously, the timeout was hardcoded to one hour. This could cause a
long delay while shutting down, as hints manager waits until all ongoing
hint sending operation finish before stopping itself.
Fixes: #7051
This reader was probably created in ancient times, when readers didn't
yet have a _schema member of their own. But now that they do, it is not
necessary to store the schema in the reader implementation, there is one
available in the parent class.
While at it also move the schema into the class when calling the
constructor.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200821070358.33937-1-bdenes@scylladb.com>
"
We keep refrences to locator::token_metadata in many places.
Most of them are for read-only access and only a few want
to modify the token_metadata.
Recently, in 94995acedb,
we added yielding loops that access token_metadata in order
to avoid cpu stalls. To make that possible we need to make
sure they token_metadata object they are traversing won't change
mid-loop.
This series is a first step in ensuring the serialization of
updates to shared token metadata to reading it.
Test: unit(dev)
Dtest: bootstrap_test:TestBootstrap.start_stop_test{,_node}, update_cluster_layout_tests.py -a next-gating(dev)
"
* tag 'constify-token-metadata-access-v2' of github.com:bhalevy/scylla:
api/http_context: keep a const sharded<locator::token_metadata>&
gossiper: keep a const token_metadata&
storage_service: separate get_mutable_token_metadata
range_streamer: keep a const token_metadata&
storage_proxy: delete unused get_restricted_ranges declaration
storage_proxy: keep a const token_metadata&
storage_proxy: get rid of mutable get_token_metadata getter
database: keep const token_metadata&
database: keyspace_metadata: pass const locator::token_metadata& around
everywhere_replication_strategy: move methods out of line
replication_strategy: keep a const token_metadata&
abstract_replication_strategy: get_ranges: accept const token_metadata&
token_metadata: rename calculate_pending_ranges to update_pending_ranges
token_metadata: mark const methods
token_ranges: pending_endpoints_for: return empty vector if keyspace not found
token_ranges: get_pending_ranges: return empty vector if keyspace not found
token_ranges: get rid of unused get_pending_ranges variant
replication_strategy: calculate_natural_endpoints: make token_metadata& param const
token_metadata: add get_datacenter_racks() const variant
The cleanup compaction wants to keep local tokens on-board and gets
them from storage_service.get_local_ranges().
This method is the wrapper around database.get_keyspace_local_ranges()
created in previous patch, the live database reference is already
available on the descriptor's options, so we can short-cut the call.
This allows removing the last explicit call for global storage_service
instance from compaction code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The only place that creates them is the API upgrade_sstables call.
The created options object doesn't over-survive the returned
future, so it's safe to keep this reference there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The database is available at both places that create the options --
tests and API perform_cleanup call.
Options object doesn't over-survive the returned future, so it's
safe to keep the reference on it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Storage service and repair code have identical helpers to get local
ranges for keyspace. Move this helper's code onto database, later it
will be reused by one more place.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Use a different getter for a token_metadata& that
may be changed so we can better synchronize readers
and writers of token_metadata and eventually allow
them to yield in asynchronous loops.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We'd like to strictly control who can modify token metadata
and nobody currently needs a mutable reference to storage_proxy::_token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
No need to modify token_metadata form database code.
Also, get rid of mutable get_token_metadata variant.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move methods depending on token_metadata to source file
so we can avoid including token_metadata.hh in header files
where spossible.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Many token_metadata methods do not modify the object
and can be marked as const.
The motivation is to better control who may modify
token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is no longer called once for a given view_info, so the name
"initialize" is not appropriate.
This patch splits the "initialize" method into the "make" part, which
makes a new base_info object, and the "set" part, which changes the
current base_info object attached to the view.
The view building process was accessing mutation fragments using
current table's schema. This is not correct, fragments must be
accessed using the schema of the generating reader.
This could lead to undefined behavior when the column set of the base
table changes. out_of_range exceptions could be observed, or data in
the view ending up in the wrong column.
Refs #7061.
The fix has two parts. First, we always use the reader's schema to
access fragments generated by the reader.
Second, when calling populate_views() we upgrade the fragment-wrapping
reader's schema to the base table schema so that it matches the base
table schema of view_and_base snapshots passed to populate_views().
The view_info object, which is attached to the schema object of the
view, contains a data structure called
"base_non_pk_columns_in_view_pk". This data structure contains column
ids of the base table so is valid only for a particular version of the
base table schema. This data structure is used by materialized view
code to interpret mutations of the base table, those coming from base
table writes, or reads of the base table done as part of view updates
or view building.
The base table schema version of that data structure must match the
schema version of the mutation fragments, otherwise we hit undefined
behavior. This may include aborts, exceptions, segfaults, or data
corruption (e.g. writes landing in the wrong column in the view).
Before this patch, we could get schema version mismatch here after the
base table was altered. That's because the view schema does not change
when the base table is altered.
Part of the fix is to extract base_non_pk_columns_in_view_pk into a
third entitiy called base_dependent_view_info, which changes both on
base table schema changes and view schema changes.
It is managed by a shared pointer so that we can take immutable
snapshots of it, just like with schema_ptr. When starting the view
update, the base table schema_ptr and the corresponding
base_dependent_view_info have to match. So we must obtain them
atomically, and base_dependent_view_info cannot change during update.
Also, whenever the base table schema changes, we must update
base_dependent_view_infos of all attached views (atomically) so that
it matches the base table schema.
Refs #7061.
When user defines a build mode with configure.py, the build, check, and
test targets fail as follows:
./configure.py --mode=dev && ninja build
ninja: error: 'debug-build', needed by 'build', missing and no known rule to make it
Fix the issue by making the targets depend on build targets for
specified build modes, not all available modes.
Message-Id: <20200813105639.1641090-1-penberg@scylladb.com>
We already have a test, test_streams.py::test_streams_updateitem_old_image,
for issue #6935: It tests that the OLD_IMAGE in Alternator Streams should
contain the item's key.
However this test was missing one corner case, which is the first solution
for this issue did incorrectly. So in this patch we add a test for this
corner case, test_streams_updateitem_old_image_empty_item:
This corner case about the item existing, but *empty*, i.e., having just
the key but no other attribute. In this case, OLD_IMAGE should return that
empty item - including its key. Not nothing.
As usual, this test passes on DynamoDB and xfails on Alternator, and the
"xfail" mark will be removed when issue #6935 is fixed.
Refs #6935.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200819155229.34475-1-nyh@scylladb.com>
When binding prepared statement it is possible that values being binded
are not correct. Unfortunately before this patch, the error message
was only saying what type got a wrong value. This was not very helpful
because there could be multiple columns with the same type in the table.
We also support collections so sometimes error was saying that there
is a wrong value for a type T but the affected column was actually of
type collection<T>.
This patch adds information about a column name that got the wrong
value so that it's easier to find and fix the problem.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <90b70a7e5144d7cb3e53876271187e43fd8eee25.1597832048.git.piotr@scylladb.com>
"
The messaging service is (as many other services) present in
the global namespace and is widely accessed from where needed
with global get(_local)?_messaging_service() calls. There's a
long-term task to get rid of this globality and make services
and componenets reference each-other and, for and due-to this,
start and stop in specific order. This set makes this for the
messaging service.
The service is very low level and doesn't depend on anything.
It's used by gossiper, streaming, repair, migration manager,
storage proxy, storage service and API. According to this
dependencies the set consists of several parts:
patches 1-9 are preparatory, they encapsulate messaging service
init/fini stuff in its own module and decouple it from the
db::config
patch 10-12 introduce local service reference in main and set
its init/fini calls at the early stage so that this reference
can later be passed to those depending on it
patches 13-42 replace global referencing of messaging service
from other subsystems with local references initialized from
main.
patch 43 finalizes tests.
patch 44 wraps things up with removing global messaiging service
instance along with get(_local)?_messaging_service calls.
The service's stopping part is deliberately left incomplete (as
it is now), the sharded service remains alive, only the instance's
stop() method is called (and is empty for a while). Since the
messaging service's users still do not stop cleanly, its instances
should better continue leaking on exit.
Once (if) the seastar gets the helper rpc::has_handlers() method
merged the messaging_service::stop() will be able to check if all
the verbs had been unregistered (spoiler: not yet, more fixes to
come).
For debugging purposes the pointer on now-local messaging service
instance is kept in service::debug namespace.
tests: unit(dev)
dtest(dev: simple_boot_shutdown, repair, update_cluster_layout)
manual start-stop
"
* 'br-unglobal-messaging-service-2' of https://github.com/xemul/scylla: (44 commits)
messaging_service: Unglobal messaging service instance
tests: Use own instances of messaging_service
storage_service: Use local messaging reference
storage_service: Keep reference on sharded messaging service
migration_manager: Add messaging service as argument to get_schema_definition
migration_manager: Use local messaging reference in simple cases
migration_manager: Keep reference on messaging
migration_manager: Make push_schema_mutation private non-static method
migration_manager: Move get_schema_version verb handling from proxy
repair: Stop using global messaging_service references
repair: Keep sharded messaging service reference on repair_meta
repair: Keep sharded messaging service reference on repair_info
repair: Keep reference on messaging in row-level code
repair: Keep sharded messaging service in API
repair: Unset API endpoints on stop
repair: Setup API endpoints in separate helper
repair: Push the sharded<messaging_service> reference down to sync_data_using_repair
repair: Use existing sharded db reference
repair: Mark repair.cc local functions as static
streaming: Keep messaging service on send_info
...
Remove the global messaging_service, keep it on the main stack.
But also store a pointer on it in debug namespace for debugging.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The global one is going away, no core code uses it, so all tests
can be safely switched to use their own instances.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All the places the are (and had become such with previous patches) using
the global messaging service and the storage service methods, so they
can access the local reference on the messaging service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are 4 places that call this helper:
- storage proxy. Callers are rpc verb handlers and already have the proxy
at hands from which they can get the messaging service instance
- repair. There's local-global messaging instance at hands, and the caller
is in verb handler too
- streaming. The caller is verb handler, which is unregistered on stop, so
the messaging service instance can be captured
- migration manager itself. The caller already uses "this", so the messaging
service instance can be get from it
The better approach would be to make get_schema_definition be the method of
migration_manager, but the manager is stopped for real on shutdown, thus
referencing it from the callers might not be safe and needs revisiting. At
the same time the messaging service is always alive, so using its reference
is safe.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Most of those places are either non-static migration_manager methods.
Plus one place where the local service instance is already at hands.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The local migration manager instance is already available at caller, so
we can call a method on it. This is to facilitate next patching.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The user of this verb is migration manager, so the handler must be it as well.
The hander code now explicitly gets global proxy. This call is safe, as proxy
is not stopped nowadays. In the future we'll need to revisit the relation
between migration - proxy - stats anyway.
The use of local migration manager is safe, as it happens in verb handler which
is unregistered and is waited to be completed on migration manager stop.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now all the users of messaging service have the needed reference.
Again, the messaging service is not really stopped at the end, so its usage
is safe regardless of whether repair stuff itself leaks on stop or not.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The reference comes from repair_info and storage_service calls, both
had been already patched for that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The row-level repair keeps its statics for needed services, same as the
streaming does. Treat the messaging service the same way to stop using
the global one in the next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This unset the roll-back of the correpsonding _set-s. The messaging
service will be (already is, but implicitly) used in repair API
callbacks, so make sure they are unset before the messaging service
is stopped.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There will be the unset part soon, this is the preparation. No functional
changes in api/storage_server.cc, just move the code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This function needs the messaging service inside, but the closest place where it
can get one from is the storage_service API handlers. Temporarily move the call for
global messaging service into storage service, its turn for this cleanup will
come later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The db.invoke_on_all's lambda tries to get the sharded db reference via
the global storage service. This can be done in a much nicer way.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Streaming uses messaging, init it with itw own reference.
Nowadays the whole streaming subsystem uses global static references on the
needed services. This is not nice, but still better than just using code-wide
globals, so treat the messaging service here the same way.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It is alive there, so it is safe to pass one to lambda.
Once in forward_fn, it can be used to get messaging from.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Most of the places that need messaging service in proxy already use
storage_proxy instance, so it is safe to get the local messaging
from it too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The proxy is another user of messaging, so keep the reference on it. Its
real usage will come in next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Gossiper needs messaging service, the messaging is started before the
gossiper, so we can push the former reference into it.
Gossiper is not stopped for real, neither the messaging service is, so
the memory usage is still safe.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Let's reserve space for sharding metadata in advance, to avoid excessive
allocations in create_sharding_metadata().
With the default ignore_msb_bits=12, it was observed that the # of
reallocations is frequently 11-12. With ignore_msb_bits=16, the number
can easily go up to 50.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200814210250.39361-1-raphaelsc@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/6978
by Kamil Braun:
These abstractions are used for walking over mutations created by a write
coordinator, deconstructing them into atomic'' pieces (changes''),
and consuming these pieces.
Read the big comment in cdc/change_visitor.hh for more details.
4 big functions were rewritten to use the new abstractions.
tests:
unit (dev build)
all dtests from cdc_tests.py, except cdc_tests.py:TestCdc.cluster_reduction_with_cdc_test, have passed (on a dev build). The test that fails also fails on master.
Part of #5945.
cdc: rewrite process_changes using inspect_mutation
cdc: move some functions out of `cdc::transformer`
cdc: rewrite extract_changes using inspect_mutation
cdc: rewrite should_split using inspect_mutation
cdc: rewrite find_timestamp using inspect_mutation
cdc: introduce a ,,change visitor'' abstraction
"
After data segregation feature, anything that cause out-of-order writes,
like read repair, can result in small updates to past time windows.
This causes compaction to be very aggressive because whenever a past time
window is updated like that, that time window is recompacted into a
single SSTable.
Users expect that once a window is closed, it will no longer be written
to, but that has changed since the introduction of the data segregation
future. We didn't anticipate the write amplification issues that the
feature would cause. To fix this problem, let's perform size-tiered
compaction on the windows that are no longer active and were updated
because data was segregated. The current behavior where the last active
window is merged into one file is kept. But thereafter, that same
window will only be compacted using STCS.
Fixes#6928.
"
* 'fix_twcs_agressiveness_after_data_segregation_v2' of github.com:raphaelsc/scylla:
compaction/twcs: improve further debug messages
compaction/twcs: Improve debug log which shows all windows
test: Check that TWCS properly performs size-tiered compaction on past windows
compaction/twcs: Make task estimation take into account the size-tiered behavior
compaction/stcs: Export static function that estimates pending tasks
compaction/stcs: Make get_buckets() static
compact/twcs: Perform size-tiered compaction on past time windows
compaction/twcs: Make strategy easier to extend by removing duplicated knowledge
compaction/twcs: Make newest_bucket() non-static
compaction/twcs: Move TWCS implementation into source file
Contains patch from Rafael to fix up includes.
* seastar c872c3408c...7f7cf0f232 (9):
> future: Consider result_unavailable invalid in future_state_base::ignore()
> future: Consider result_unavailable invalid in future_state_base::valid()
> Merge "future-util: split header" from Benny
> docs: corrected some text and code-examples in streaming-rpc docs
> future: Reduce nesting in future::then
> demos: coroutines: include std-compat.hh
> sstring: mark str() and methods using it as noexcept
> tls: Add an assert
> future: fix coroutine compilation
Some tests directly reference the global messaging service. For the sake
of simpler patching wrap this global reference with a local one. Once the
global messaging service goes away tests will get their own instances.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
API is one of the subsystems that work with messaging service. To keep
the dependencies correct the related API stuff should be stopped before
the messaging service stops.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are few places that initialize db and system_ks and need the
messaging service. Pass the reference to it from main instead of
using the global helpers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is the preparation for moving the message service to main -- keep
a reference and eventually pass one to subsystems depending on messaging.
Once they are ready, the reference will be turned into an instance.
For now only push the reference into the messaging service init/exit
itself, other subsystems will be patched next.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Introduce back the .stop() method that will be used to really stop
the service. For now do not do sharded::stop, as its users are not
yet stopping, so this prevents use-after-free on messaging service.
For now the .stop() is empty, but will be in charge of checking if
all the other users had unregisterd their handlers from rpc.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The messaging service is a low-level one which doesn't need other
services, so it can be started first. Nowadays it's indeed started
before most of its users but one -- the gossiper.
In current code gossiper doesn't do anything with messaging service
until it starts, but very soon this dependency will be expressed in
terms of a refernce from gossiper to messaging_service, thus by the
time the latter starts, the former should already exist.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now the init_messaging_service() only deals with messaing service
and related internal stuff, so it can sit in its own module.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This struct is nowadays only used to transport arguments from db::config
to messaging_service::scheduling_config, things get simpler if dropping it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This makes the messaging service configuration completely independent
from the db config. Next step would be to move the messaging service
init code into its module.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is the continuation of the previous patch -- change the primary
constructor to work with config. This, in turn, will decouple the
messaging service from database::config.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This service constructor uses and copies many simple values, it would be
much simpler to group them on config. It also helps the next patches to
simplify the messaging service initialization and to keep the defaults
(for testing) in one place.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The init_ms_fd_gossiper function initializes two services, but
effectively consists of two independent parts, so declare them
as such.
The duplication of listen address resolution will go away soon.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
On today's stop() the messaging service is not really stopped as other
services still (may) use it and have registered handlers in it. Inside
the .stop() only the rpc servers are brought down, so the better name
for this method would be shutdown().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Just a cleanup. These internal stoppers must be private, also there
are too many public specifiers in the class description around them.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The current log prints one log entry for each window, it doesn't print
the # of SSTs in the bucket, and the now information is copied across
all the window entries.
previously, it looked like this:
[shard 0] compaction - Key 1597331160000000, now 1597331160000000
[shard 0] compaction - Key 1597331100000000, now 1597331160000000
[shard 0] compaction - Key 1597331040000000, now 1597331160000000
[shard 0] compaction - Key 1597330980000000, now 1597331160000000
this made it harder to group all windows which reflect the state of
the strategy in a given time.
now, it looks like as follow:
[shard 0] compaction - time_window_compaction_strategy::newest_bucket:
now 1597331160000000
buckets = {
key=1597331160000000, size=1
key=1597331100000000, size=2
key=1597331040000000, size=1
key=1597330980000000, size=1
}
Also the level of this log is changed from debug to trace, given that
now it's compressed and only printed once.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The task estimation was not taking into account that TWCS does size-tiered
on the the windows, and it only added 1 to the estimation when there
could be more tasks than that depending on the amount of SSTables in
all the existing size tiers.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That will be useful for allowing other compaction strategies that use
STCS to properly estimate the pending tasks.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
STCS will export a static function to estimate pending tasks, and
it relies on get_buckets() being static too.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
One of the most common code I write when investigating coredumps is
finding all objects of a certain type and creating a human readable
report of certain properties of these objects. This usually involves
retrieving all objects with a vptr with `find_vptrs()` and matching
their type to some pattern. I found myself writing this boilerplate over
and over again, so in this patch I introduce a convenience method to
avoid repeating it in the future.
Message-Id: <20200818145247.2358116-1-bdenes@scylladb.com>
scylla_fiber._walk() has an early return condition on the passed-in
pointer actually being a task pointer. The problem is that the actual
return statement was missing and only an error was logged. This resulted
in execution continuing and further weird errors being printed due to
the code not knowing how to handle the bad pointer.
Message-Id: <20200818144902.2357289-1-bdenes@scylladb.com>
thread_wake_task was moved into an anonymous namespace, add this new
fully qualified name to the task name white-list. Leave the old name for
backward compatibility.
While at it, also add `seastar::thread_context` which is also a task
object, for better seastar thread support.
Message-Id: <20200818142206.2354921-1-bdenes@scylladb.com>
Accessing the element of a tuple from the gdb command line is a
nightmare, add support to collection_element() retrieving one of its
elements to make this easier.
Message-Id: <20200818141123.2351892-1-bdenes@scylladb.com>
Currently we assign the reference to the vector of selected sstables to
`auto sst`. This makes a copy and we pass this local variable to
`do_for_each()`, which will result in a use-after-free if the latter
defers.
Fix by not making a copy and instead just keep the reference.
Fixes: #7060
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200818091241.2341332-1-bdenes@scylladb.com>
"
operator_type is awkward because it's not copyable or assignable. Replace it with a new enum class.
Tests: unit(dev)
"
* dekimir-operator-type:
cql3: Drop operator_type entirely
cql3: Drop operator_type from the parser
cql3/expr: Replace operator_type with an enum
Replace operator_type with the nicer-behaved oper_t in CQL parser and,
consequently, in the relation hierarchy and column_condition.
After this, no references to operator_type remain in live code.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
operator_type is awkward because it's not copyable or assignable.
Replace it in expression representation with a new enum class, oper_t.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
As suggested by Avi, let's move the tarballs from
"build/dist/<mode>/tar" to "build/<mode>/dist/tar" to retain the
symmetry of different build modes, and make the tarballs easier to
discover. While at it, let's document the new tarball locations.
Message-Id: <20200818100427.1876968-1-penberg@scylladb.com>
fea83f6 introduced a race between processing (and hence removing)
sstables from `_sstables_with_tables` and registering new ones. This
manifested in sstables that were added concurrently with processing a
batch for the same sstables being dropped and the semaphore units
associated with them not returned. This resulted in repairs being
blocked indefinitely as the units of the semaphore were effectively
leaked.
This patch fixes this by moving the contents of `_sstables_with_tables`
to a local variable before starting the processing. A unit test
reproducing the problem is also added.
Fixes: #6892
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200817160913.2296444-1-bdenes@scylladb.com>
Except scylla-python3, each scylla package has its own git repository, same package script filename, same build directory structure.
To put python3 thing on scylla repo, we created 'python3' directory on multiple locations, made '-python3' suffixed files, dig deeper build directory not to conflict scylla-server package build.
We should move all scylla-python3 related files to new repository, scylla-python3.
To keep compatibility with current Jenkins script, provide packages on
build/ directory for now.
Fixes#6751
After data segregation feature, anything that cause out-of-order writes,
like read repair, can result in small updates to past time windows.
This causes compaction to be very aggressive because whenever a past time
window is updated like that, that time window is recompacted into a
single SSTable.
Users expect that once a window is closed, it will no longer be written
to, but that has changed since the introduction of the data segregation
future. We didn't anticipate the write amplification issues that the
feature would cause. To fix this problem, let's perform size-tiered
compaction on the windows that are no longer active and were updated
because data was segregated. The current behavior where the last active
window is merged into one file is kept. But thereafter, that same
window will only be compacted using STCS.
Fixes#6928.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
TWCS is hard to extend because its knowledge on what to do with a window
bucket is duplicated in two functions. Let's remove this duplication by
placing the knowledge into a single function.
This is important for the coming change that will perform size-tiered
instead of major on windows that are no longer active.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Previously system.paxos TTL was set as max(3h, gc_grace_seconds).
Introduce new per-table option named `paxos_grace_seconds` to set
the amount of seconds which are used to TTL data in paxos tables
when using LWT queries against the base table.
Default value is equal to `DEFAULT_GC_GRACE_SECONDS`,
which is 10 days.
This change allows to easily test various issues related to paxos TTL.
Fixes#6284
Tests: unit (dev, debug)
Co-authored-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200816223935.919081-1-pa.solodovnikov@scylladb.com>
This is an abstraction for walking over mutations created by a write
coordinator, deconstructing them into ,,atomic'' pieces (,,changes''),
and consuming these pieces.
Read the big comment in cdc/change_visitor.hh for more details.
While Alternator doesn't yet support creating a table with a different
"server-side encryption" (a.k.a. encryption-at-rest) parameters, the
SSESpecification option with Enabled=false should still be allowed, as
it is just the default, and means exactly the same as would a missing
SSESpecification.
This patch also adds a test for this case, which failed on Alternator
before this patch.
Fixes#7031.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200812205853.173846-1-nyh@scylladb.com>
This patch adds a test that attributes which serve as a key for a
secondary index still appear in the OldImage in an Alternator Stream.
This is a special case, because although usually Alternator attributes
are saved as map elements, not stand-alone Scylla columns, in the special
case of secondary-index keys they *are* saved as actual Scylla columns
in the base table. And it turns out we produce wrong results in this case:
CDC's "preimage" does not currently include these columns if they didn't
change, while DynamoDB requires that all columns, not just the changed ones,
appear in OldImage. So the test added in this patch xfails on Alternator
(and as usual, passes on DynamoDB).
Refs #7030.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200812144656.148315-1-nyh@scylladb.com>
The "user.home" system property in JVM does not use the "HOME"
environment variable. This breaks Ant and Maven builds with Podman,
which attempts to look up the local Maven repository in "/root/.m2" when
building tools, for example:
build.xml:757: /root/.m2/repository does not exist.
To fix the issue, let's bind-mount an /etc/passwd file, which contains
host username for UID 0, which ensures that Podman container $USER and $HOME
are the same as on the host.
Message-Id: <20200817085720.1756807-1-penberg@scylladb.com>
"
gossip: Get rid of seed concept
The concept of seed and the different behaviour between seed nodes and
non seed nodes generate a lot of confusion, complication and error for
users. For example, how to add a seed node into into a cluster, how to
promote a non seed node to a seed node, how to choose seeds node in
multiple DC setup, edit config files for seeds, why seed node does not
bootstrap.
If we remove the concept of seed, it will get much easier for users.
After this series, seed config option is only used once when a new node
joins a cluster.
Major changes:
Seed nodes are only used as the initial contact point nodes.
Seed nodes now perform bootstrap. The only exception is the first node
in the cluster.
The unsafe auto_bootstrap option is now ignored.
Gossip shadow round now talks to all nodes instead of just seed nodes.
Refs: #6845
Tests: update_cluster_layout_tests.py + manual test
"
* 'gossip_no_seed_v2' of github.com:asias/scylla:
gossip: Get rid of seed concept
gossip: Introduce GOSSIP_GET_ENDPOINT_STATES verb
gossip: Add do_apply_state_locally helper
gossip: Do not talk to seed node explicitly
gossip: Talk to live endpoints in a shuffled fashion
The concept of seed and the different behaviour between seed nodes and
non seed nodes generate a lot of confusion, complication and error for
users. For example, how to add a seed node into into a cluster, how to
promote a non seed node to a seed node, how to choose seeds node in
multiple DC setup, edit config files for seeds, why seed node does not
bootstrap.
If we remove the concept of seed, it will get much easier for users.
After this series, seed config option is only used once when a new node
joins a cluster.
Major changes:
- Seed nodes are only used as the initial contact point nodes.
- Seed nodes now perform bootstrap. The only exception is the first node
in the cluster.
- The unsafe auto_bootstrap option is now ignored.
- Gossip shadow round now attempts to talk to all nodes instead of just seed nodes.
Manual test:
- bootstrap n1, n2, n3 (n1 and n2 are listed as seed, check only n1
will skip bootstrap, n2 and n3 will bootstrap)
- shtudown n1, n2, n3
- start n2 (check non seed node can boot)
- start n1 (check n1 talks to both n2 and n3)
- start n3 (check n3 talks to both n1 and n3)
Upgrade/Downgrade test:
- Initialize cluster
Start 3 node with n1, n2, n3 using old version
n1 and n2 are listed as seed
- Test upgrade starting from seed nodes
Rolling restart n1 using new version
Rolling restart n2 using new version
Rolling restart n3 using new version
- Test downgrade to old version
Rolling restart n1 using old version
Rolling restart n2 using old version
Rolling restart n3 using old version
- Test upgrade starting from non seed nodes
Rolling restart n3 using new version
Rolling restart n2 using new version
Rolling restart n1 using new version
Notes on upgrade procedure:
There is no special procedure needed to upgrade to Scylla without seed
concept. Rolling upgrade node one by one is good enough.
Fixes: #6845
Tests: ./test.py + update_cluster_layout_tests.py + manual test
Since older binutils on some distribution does not able to handle
compressed debuginfo generated on Fedora, we need to disable it.
However, debian packager force debuginfo compression since debian/compat = 9,
we have to uncompress them after compressed automatically.
Fixes#6982
"
This patch series changes the build system to build all tarballs to
build/dist/<mode>/tar directory. For example, running:
./tools/toolchain/dbuild ./configure.py --mode=dev && ./tools/toolchain/dbuild ninja-build dist-tar
produces the following tarballs in build/dist/dev/tar:
$ ls -1 build/dist/dev/tar/
scylla-jmx-package.tar.gz
scylla-package.tar.gz
scylla-python3-package.tar.gz
scylla-tools-package.tar.gz
This makes it easy to locate release tarballs for humans and scripts. To
preserve backward compatibility, the tarballs are also retained in their
original locations. Once release engineering infrastructure has been
adjusted to use the new locations, we can drop the duplicate copies.
"
* 'penberg/build-dist-tar/v1' of github.com:penberg/scylla:
configure.py: Copy tarballs to build/dist/<mode>/tar directory
configure.py: Add "dist-<component>-tar" targets
reloc/python3: Add "--builddir" to build_deb.sh
configure.py: Use copy-on-write copies when possible
"
Contains several fixes which improve debuggability in situations where
too large column ids are passed to column definition loop methods.
"
* 'schema-range-check-fix' of github.com:tgrabiec/scylla:
schema: Add table name and schema version to error messages
schema: Use on_internal_error() for range check errors
schema: Fix off-by-one in column range check
schema: Make range checks for regular and static columns the same as for clustering columns
select() is too generic for the method that retrieve sstable runs,
and it has a completely different meaning that the former select
method used to select sstables based on token range.
let's give it a more descriptive name.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200811193401.22749-1-raphaelsc@scylladb.com>
regression caused by 55cf219c97.
remove_by_toc_name() must work both for a sealed sstable with toc,
and also a partial sstable with tmp toc.
so dirname() should be called conditionally on the condition of
the sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200813160612.101117-1-raphaelsc@scylladb.com>
LCS can have its overlapping invariant broken after operations that can
proceed in parallel to regular compaction like cleanup. That's because
there could be two compactions in parallel placing data in overlapping
token ranges of a given level > 0.
After reshape, the whole table will be rewritten, on restart, if a
given level has more than (fan_out*2)=20 overlaps.
That may sound like enough, but that's not taking into account the
exponential growth in # of SSTables per level, so 20 overlaps may
sound like a lot for level 2 which can afford 100 sstables, but it's
only 2% of level 3, and 0.2% of level 4. So let's change the
overlapping tolerance from the constant of fan_out*2 to 10% of level
limit on # of SSTables, or fan_out, whichever is higher.
Refs #6938.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200810154510.32794-1-raphaelsc@scylladb.com>
After 8014c7124, cleanup can potentially pick a compacting SSTable.
Upgrade and scrub can also pick a compacting SSTable.
The problem is that table::candidates_for_compaction() was badly named.
It misleads the user into thinking that the SSTables returned are perfect
candidates for compaction, but manager still need to filter out the
compacting SSTables from the returned set. So it's being renamed.
When the same SSTable is compacted in parallel, the strategy invariant
can be broken like overlapping being introduced in LCS, and also
some deletion failures as more than one compaction process would try
to delete the same files.
Let's fix scrub, cleanup and ugprade by calling the manager function
which gets the correct candidates for compaction.
Fixes#6938.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200811200135.25421-1-raphaelsc@scylladb.com>
Before this patch, modifying cdc/cdc_options.hh required recompiling 264
source files. This is because this header file was included by a couple
other header files - most notably schema.hh, where a forward declaration
would have been enough. Only the handful of source files which really
need to access the CDC options should include "cdc/cdc_options.hh" directly.
After this patch, modifying cdc/cdc_options.hh requires only 6 source files
to be recompiled.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200813070631.180192-1-nyh@scylladb.com>
C++17 introduced try_emplace for maps to replace a pattern:
if(element not in a map) {
map.emplace(...)
}
try_emplace is more efficient and results in a more concise code.
This commit introduces usage of try_emplace when it's appropriate.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <4970091ed770e233884633bf6d46111369e7d2dd.1597327358.git.piotr@scylladb.com>
This testcase was temporarily commented out in 37ebe52, because it
relied on buggy (#6369) behaviour fixed by that commit. Specifically,
it expected a NULL comparison to match a NULL cell value. We now
bring it back, with corrected result expectation.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
The _clients is std::vector, it doesn't have _M_elems.
Luckily there's std_vector() class for it.
The seastar::rpc::server::_conns is unordered_map, not
unordered_set.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200814070858.32383-1-xemul@scylladb.com>
It is needed to print the boost::intrusive::list which is used
by repair_meta_for_masters in repair.
Fixes#7037
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Signed-off-by: Asias He <asias@scylladb.com>
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
`count` function was often used in various ways.
`contains` does not only express the intend of the code better but also
does it in more unified way.
This commit replaces all the occurences of the `count` with the
`contains`.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <b4ef3b4bc24f49abe04a2aba0ddd946009c9fcb2.1597314640.git.piotr@scylladb.com>
1. The node1 is shutdown
2. The node1 sends shutdown message to node2
3. The node2 receives gossip shutdown message but the handler yields
4. The node1 is restarted
5. The node1 sends new gossip endpoint_state to node2, node2 applies the state
in apply_state_locally and calls gossiper::handle_major_state_change
and then calls gossiper::mark_alive
6. The shutdown message handler in step 3 resumes and sets status of node1 to SHUTDOWN
7. The gossiper::mark_alive fiber in step 5 resumes and calls gossiper::real_mark_alive,
node2 will skip to mark node1 as alive because the status of node1 is
SHUTDOWN. As a result, node1 is alive but it is not marked as UP by node2.
To fix, we serialize the two operations.
Fixes#7032
Merged pull request https://github.com/scylladb/scylla/pull/7028
By Calle Wilund:
Changes the "preimage" option from binary true/false to on/off/full (accepting true/false, and using old style notation for normal to string - for upgrade reasons), where "full" will force us to include all columns in pre image log rows.
Adds small test (just adding the case to preimage test).
Uses the feature in alternator
Fixes#7030
alternator: Set "preimage" to "full" for streams
cdc_test: Do small test of "full"
cdc: Make pre image optionally "full" (include all columns)
Fixes#7030
Dynamo/alternator streams old image data is supposed to
contain the full old value blob (all keys/values).
Setting preimage=full ensures we get even those properties
that have separate columns if they are not part of an actual
modification.
Makes the "preimage" option for cdc non-binary, i.e. it can now
be "true"/"on", "false"/"off" or "full. The two former behaving like
previously, the latter obviously including all columns in pre image.
Merged pull request https://github.com/scylladb/scylla/pull/7018
by Piotr Sarna:
This series addresses various issues with metrics and semaphores - it mainly adds missing metrics, which makes it possible to see the length of the queues attached to the semaphores. In case of view building and view update generation, metrics was not present in these services at all, so a first, basic implementation is added.
More precise semaphore metrics would ease the testing and development of load shedding and admission control.
view_builder: add metrics
db, view: add view update generator metrics
hints: track resource_manager sending queue length
hints: add drain queue length to metrics
table: add metrics for sstable deletion semaphore
database: remove unused semaphore
The `shard` parameter of `find_db()` is optional and is defaulted to
`None`. When missing, the current shard's database instance is returned.
The problem is that the if condition checking this uses `not shard`,
which also evaluates to `True` if `shard == 0`, resulting in returning
the current shard's database instance for shard 0. Change the condition
to `shard is None` to avoid this.
Fixes: #7016
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200812091546.1704016-1-bdenes@scylladb.com>
"
This series adds progress metrics for the node operations. Metrics for bootstrap and rebuild progress are added as a starter. I will add more for the remaining operations after getting feedback.
With this the Scylla Monitor and Scylla Manager can know the progress of the bootstrap and other node operations. E.g.,
scylla_node_ops_bootstrap_nr_ranges_finished{shard="0",type="derive"} 50
scylla_node_ops_bootstrap_nr_ranges_total{shard="0",type="derive"} 1040
Fixes#1244, #6733
"
* 'repair_progress_metrics_v3' of github.com:asias/scylla:
repair: Add progress metrics for repair ops
repair: Add progress metrics for rebuild ops
repair: Add progress metrics for bootstrap ops
"
It is pretty hard to find the repair_meta object when debugging a core.
This patch makes it is easier by putting repair_meta object created by
both repair follower and master into a map.
Fixes#7009
"
* asias-repair_make_debug_eaiser_track_all_repair_metas:
repair: Add repair_meta_tracker to track repair_meta for followers and masters
repair: Move thread local object _repair_metas out of the function
* 'espindola/move-out-of-line' of https://github.com/espindola/scylla:
test: Move code in sstable_run_based_compaction_strategy_for_tests.hh out of line
test: Drop ifdef now that we always use c++20
test: Move sstable_run_based_compaction_strategy_for_tests.hh to test/lib
It is pretty hard to find the repair_meta object when debugging a core.
This patch makes it is easier by putting repair_meta object created by
both repair follower and master into boost intrusive list.
Fixes#7009
test is currently flaky since system reads can happen
in the background and disturb the global row cache stats.
Use the table's row_cache stats instead.
Fixes#6773
Test: cql_query_test.test_cache_bypass(dev, debug)
Credit-to: Botond Dénes <bdenes@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200811140521.421813-1-bhalevy@scylladb.com>
make all required allocations in advance to merging sstables
into _compacting_sstables so it should not throw
after registering some sstables, but not all.
Test: database_test(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200811132440.416945-1-bhalevy@scylladb.com>
Currently when running against a debug build, our integration test suite
suffers from a ton of timeout related error logs, caused by auth queries
timing out. This causes spurious test failures due to the unexpected
error messages in the log.
This patch increases the timeout for internal distributed auth queries
in debug mode, to give the slow debug builds more headroom to meet the
timeout.
Refs: #6548
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200811145757.1593350-1-bdenes@scylladb.com>
"
Let users disable the unencrypted native transport too by setting the port to
zero in the scylla.yaml configuration file.
Fixes#6997
"
* penberg-penberg/native-transport-disable:
docs/protocol: Document CQL protocol port configuration options
transport: Allow user to disable unencrypted native transport
"
Refs #6148
Separates disk usage into two cases: Allocated and used.
Since we use both reserve and recycled segments, both
which are not actually filled with anything at the point
of waiting.
Also refuses to recycle segments or increase reserve size
if our current disk footprint exceeds threshold.
And finally uses some initial heuristics to determine when
we should suggest flushing, based on disk limit, segment
size, and current usage. Right now, when we only have
a half segment left before hitting used == max.
Some initial tests show an improved adherence to limit
though it will still be exceeded, because we do _not_
force waiting for segments to become cleared or similar
if we need to add data, thus slow flushing can still make
usage create extra segments. We will however attempt to
shrink disk usage when load is lighter.
Somewhat unclear how much this impacts performance
with tight limits, and how much this matters.
"
* elcallio-calle/commitlog_size:
commitlog: Make commitlog respect disk limit better
commitlog: Demote buffer write log messages to trace
The service constructor included a check ensuring that only
standard_role_manager can be used with password_authenticator. But
after 00f7bc6, password_authenticator does not depend on any action of
standard_role_manager. All queries to meta::roles_table in
password_authenticator seem self-contained: the table is created at
the start if missing, and salted_hash is CRUDed independently of any
other columns bar the primary key role_col_name.
NOTE: a nonstandard role manager may not delete a role's row in
meta::roles_table when that role is dropped. This will result in
successful authentication for that non-existing role. But the clients
call check_user_can_login() after such authentication, which in turn
calls role_manager::exists(role). Any correctly implemented role
manager will then return false, and authentication_exception will be
thrown. Therefore, no dependencies exist on the role-manager
behaviour, other than it being self-consistent.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
"
This path set fixes stalls in repair that are caused by std::list merge and clear operations during test_latency_read_with_nemesis test.
Fixes#6940Fixes#6975Fixes#6976
"
* 'fix_repair_list_stall_merge_clear_v2' of github.com:asias/scylla:
repair: Fix stall in apply_rows_on_master_in_thread and apply_rows_on_follower
repair: Use clear_gently in get_sync_boundary to avoid stall
utils: Add clear_gently
repair: Use merge_to_gently to merge two lists
utils: Add merge_to_gently
The row_diff list in apply_rows_on_master_in_thread and
apply_rows_on_follower can be large. Modify do_apply_rows to remove the
row from the list when the row is consumed to avoid stall when the list
is destroyed.
Fixes#6975
"
Make do_io_check and the io_check functions that
call it noexcept. Up to sstable_write_io_check
and sstable_touch_directory_io_check.
Tests: unit (dev)
"
* tag 'io-check-noexcept-v1' of github.com:bhalevy/scylla:
ssstable: io_check functions: make noexcept
utils: do_io_check: adjust indentation
utils: io_check: make noexcept for future-returning functions
Refs #6148
Separates disk usage into two cases: Allocated and used.
Since we use both reserve and recycled segments, both
which are not actually filled with anything at the point
of waiting.
Also refuses to recycle segments or increase reserve size
if our current disk footprint exceeds threshold.
And finally uses some initial heuristics to determine when
we should suggest flushing, based on disk limit, segment
size, and current usage. Right now, when we only have
a half segment left before hitting used == max.
Some initial tests show an improved adherence to limit
though it will still be exceeded, because we do _not_
force waiting for segments to become cleared or similar
if we need to add data, thus slow flushing can still make
usage create extra segments. We will however attempt to
shrink disk usage when load is lighter.
Somewhat unclear how much this impacts performance
with tight limits, and how much this matters.
v2:
* Add some comments/explanations
v3:
* Made disk footprint subtract happen post delete (non-optimistic)
"
This series adds support for the "md" sstable format.
Support is based on the following:
* do not use clustering based filtering in the presence
of static row, tombstones.
* Disabling min/max column names in the metadata for
formats older than "md".
* When updating the metadata, reset and disable min/max
in the presence of range tombstones (like Cassandra does
and until we process them accurately).
* Fix the way we maintain min/max column names by:
keeping whole clustering key prefixes as min/max
rather than calculating min/max independently for
each component, like Cassandra does in the "md" format.
Fixes#4442
Tests: unit(dev), cql_query_test -t test_clustering_filtering* (debug)
md migration_test dtest from git@github.com:bhalevy/scylla-dtest.git migration_test-md-v1
"
* tag 'md-format-v4' of github.com:bhalevy/scylla: (27 commits)
config: enable_sstables_md_format by default
test: cql_query_test: add test_clustering_filtering unit tests
table: filter_sstable_for_reader: allow clustering filtering md-format sstables
table: create_single_key_sstable_reader: emit partition_start/end for empty filtered results
table: filter_sstable_for_reader: adjust to md-format
table: filter_sstable_for_reader: include non-scylla sstables with tombstones
table: filter_sstable_for_reader: do not filter if static column is requested
table: filter_sstable_for_reader: refactor clustering filtering conditional expression
features: add MD_SSTABLE_FORMAT cluster feature
config: add enable_sstables_md_format
database: add set_format_by_config
test: sstable_3_x_test: test both mc and md versions
test: Add support for the "md" format
sstables: mx/writer: use version from sstable for write calls
sstables: mx/writer: update_min_max_components for partition tombstone
sstables: metadata_collector: support min_max_components for range tombstones
sstable: validate_min_max_metadata: drop outdated logic
sstables: rename mc folder to mx
sstables: may_contain_rows: always true for old formats
sstables: add may_contain_rows
...
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
the code pattern looked like:
<collection>.find(<element>) != <collection>.end()
In C++20 the same can be expressed with:
<collection>.contains(<element>)
This is not only more concise but also expresses the intend of the code
more clearly.
This commit replaces all the occurences of the old pattern with the new
approach.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <f001bbc356224f0c38f06ee2a90fb60a6e8e1980.1597132302.git.piotr@scylladb.com>
"
Make sure to close sstable files also on error paths.
Refs #5509Fixes#6448
Tests: unit (dev)
"
* tag 'sstable-close-files-on-error-v6' of github.com:bhalevy/scylla:
sstable: file_writer: auto-close in destructor
sstable: file_writer: add optional filename member
sstable: add make_component_file_writer
sstable: remove_by_toc_name: accept std::string_view
sstable: remove_by_toc_name: always close file and input stream
sstable: delete_sstables: delete outdated FIXME comment
sstable: remove_by_toc_name: drop error_handler parameter
sstable: remove_by_toc_name: make static
sstable: read_toc: always close file
sstable: mark read_toc and methods calling it noexcept
sstable: read_toc: get rid of file_path
sstable: open_data, create_data: set member only on success.
sstable: open_file: mark as noexcept
sstable: new_sstable_component_file: make noexcept
sstable: new_sstable_component_file: close file on failure
sstable: rename_new_sstable_component_file: do not pass file
sstable: open_sstable_component_file_non_checked: mark as noexcept
sstable: open_integrity_checked_file_dma: make noexcept
sstable: open_integrity_checked_file_dma: close file on failure
The following metric is added:
scylla_node_maintenance_operations_repair_finished_percentage{shard="0",type="gauge"} 0.650000
It is the number of finished percentage for all ongoing repair operations.
When all ongoing repair operations finish, the percentage stays at 100%.
Fixes#1244, #6733
Although python2 should be a distant memory by now, the reality is that
we still need to debug scylla on platforms that still have no python3
available (centos7), so we need to keep scylla-gdb.py python2
compatible.
Refs: #7014
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200811093753.1567689-1-bdenes@scylladb.com>
maintainers.md contains a very helpful explanation of how to backport
Seastar fixes to old branches of Scylla, but has a tiny typo, which
this patch corrects.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200811095350.77146-1-nyh@scylladb.com>
Fixes#6948
Changes the stream_id format from
<token:64>:<rand:64>
to
<token:64>:<rand:38><index:22><version:4>
The code will attempt to assert version match when
presented with a stored id (i.e. construct from bytes).
This means that ID:s created by previous (experimental)
versions will break.
Moves the ID encoding fully into the ID class, and makes
the code path private for the topology generation code
path.
Removes some superflous accessors but adds accessors for
token, version and index. (For alternator etc).
Add unit tests reproducing https://github.com/scylladb/scylla/issues/3552
with clustering-key filtering enabled.
enable_sstables_md_format option is set to true as clustering-key
filtering is enabled only for md-format sstables.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that it is safe to filter md format sstable by min/max column names
we can remove the `filtering_broken` variable that disabled filtering
in 19b76bf75b to fix#4442.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To prevent https://github.com/scylladb/scylla/issues/3552
we want to ensure that in any case that the partition exists in any
sstable, we emit partition_start/end, even when returning no rows.
In the first filtering pass, filter_sstable_for_reader_by_pk filters
the input sstables based on the partition key, and num_sstables is set the size
of the sstables list after the first filtering pass.
An empty sstables list at this stage means there are indeed no sstables
with the required partition so returning an empty result will leave the
cache in the desired state.
Otherwise, we filter again, using filter_sstable_for_reader_by_ck,
and examine the list of the remaining readers.
If num_readers != num_sstables, we know that
some sstables were filterd by clustering key, so
we append a flat_mutation_reader_from_mutations to
the list of readers and return a combined reader as before.
This will ensure that we will always have a partition_start/end
mutations for the queried partition, even if the filtered
readers emit no rows.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
With the md sstable format, min/max column names in the metadata now
track clustering rows (with or without row tombstones),
range tombstones, and partition tombstones (that are
reflected with empty min/max column names - indicating
the full range).
As such, min and max column names may be of different lengths
due to range tombstones and potentially short clustering key
prefixes with compact storage, so the current matching algorithm
must be changed to take this into account.
To determine if a slice range overlaps the min/max range
we are using position_range::overlaps.
sstable::clustering_components_ranges was renamed to position_range
as it now holds a single position_range rather than a vector of bytes_view ranges.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move contains_rows from table code to sstable::may_contain_rows
since its implementation now has too specific knowledge of sstable
internals.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Static rows aren't reflected in the sstable min/max clustering keys metadata.
Since we don't have any indication in the metadata that the sstable stores
static rows, we must read all sstables if a static column is requested.
Refs #3553
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We're about to drop `filtering_broken` in a future patche
when clustering filtering can be supported for md-format sstables.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
MD format is disabled by default at this point.
The option extends enable_sstables_mc_format
so that both are needed to be set for supporting
the md format.
The MD_FORMAT cluster feature will be added in
a following patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This is required for test applications that may select a sstable
format different than the default mc format, like perf_fast_forward.
These apps don't use the gossip-based sstables_format_selector
to set the format based on the cluster feature and so they
need to rely on the db config.
Call set_format_by_config in single_node_cql_env::do_with.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Run the test cases that write sstables using both the
mc and md versions. Note that we can still compare the
resulting Data, Index, Digest, and Filter components
with the prepared mc sstables we have since these
haven't changed in md.
We take special consideration around validating
min/max column names that are now calculated using
a revised algorithm in the md format.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Test also the md format in all_sstable_versions.
Add pre-computed md-sstable files generated using Cassandra version 3.11.7
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than using a constant sstable_version_types::mc.
In preparation to supporting sstable_version_types::md.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Partition tombstones represent an implicit clustering range
that is unbound on both sides, so reflect than in min/max
column names metadata using empty clustering key prefixes.
If we don't do that, when using the sstable for filtering, we have no
other way of distinguishing range tombstones from partition tombstones
given the sstable metadata and we would need to include any sstable
with tombstones, even if those are range tombstone, for which
we can do a better filtering job, using the sstable min/max
column names metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We essentially treat min/max column names as range bounds
with min as incl_start and max as incl_end.
By generating a bound_view for min/max column names on the fly,
we can correctly track and compare also short clustering
key prefixes that may be used as bounds for range tombstones.
Extend the sstable_tombstone_metadata_check unit test
to cover these cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The following checks were introduced in 0a5af61176
To deal with a bug in min max metadata generation of our own,
from a time where only ka / la were supported.
This is no longer relevant now that we'll consider min_max_column_names
only for sstable format > mc (in sstable::may_contain_rows)
We choose not to clear_incorrect_min_max_column_names
from older versions here as this disturbs sstable unit tests.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
the min/max column names metadata can be trusted only
starting the md format, so just always return `true`
for older sstable formats.
Note that we could achieve that by clearing the min/max
metadata in set_clustering_components_ranges but we choose
not to do so since it disturbs sstable unit tests
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move the logic from table to sstable as it will contain
intimate knowledge of the sstable min/max column names validity
for md format.
Also, get rid of the sstable::clustering_components_ranges() method
as the member is used only internally by the sstable code now.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add the sstable_version_types::md enum value
and logically extend sstable_version_types comparisons to cover
also the > sstable_version_types::mc cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Instead of keeping the `_has_min_max_clustering_keys` flag,
just store an empty key for `_{min,max}_clustering_key` to represent
the full range. These will never be narrowed down and will be
encoded as empty min/max column names as if they weren't set.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently we compare each min/max component independently.
This may lead to suboptimal, inclusive clustering ranges
that do not indicate any actual key we encountered.
For example: ['a', 2], ['b', 1] will lead to min=['a', 1], max=['b', 2]
instead of the keys themselves.
This change keeps the min or max keys as a whole.
It considers shorter clustering prefixes (that are possible with compact
storage) as range tombstone bounds, so that a shorter key is considered
less than the minimum if the latter has a common prefix, and greater
than the maximum if the latter has a common prefix.
Extend the min_max_clustering_key_test to test for this case.
Previously {"a", "2"}, {"b", "1"} clustering keys would erronuously
end up with min={"a", "1"} max={"b", "2"} while we want them to be
min={"a", "2"} max={"b", "1"}.
Adjust sstable_3_x_test to ignore original mc sstables that were
previously computed with different min/max column names.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Pass the sstable schema to the metadata_collector constructor.
Note that the long term plan is to move metadata_collector
to the sstable writer but this requires a bigger change to
get rid of the dependencies on it in the legacy writer code
in class sstable methods.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
C++20 introduced std::erase_if which simplifies removal of elements
from the collection. Previously the code pattern looked like:
<collection>.erase(
std::remove_if(<collection>.begin(), <collection>.end(), <predicate>),
<collection>.end());
In C++20 the same can be expressed with:
std::erase_if(<collection>, <predicate>);
This commit replaces all the occurences of the old pattern with the new
approach.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <6ffcace5cce79793ca6bd65c61dc86e6297233fd.1597064990.git.piotr@scylladb.com>
Fixes#6995
In c2c6c71 the assert on replay positions in flushed sstables discarded by
truncate was broken, by the fact that we no longer flush all sstables
unless auto snapshot is enabled.
This means the low_mark assertion does not hold, because we maybe/probably
never got around to creating the sstables that would hold said mark.
Note that the (old) change to not create sstables and then just delete
them is in itself good. But in that case we should not try to verify
the rp mark.
"
not recommended setups will still run iotune
fixes#6631
"
* tarzanek-gcp-iosetup:
scylla_io_setup: Supported GCP VMs with NVMEs get out of box I/O configs
scylla_util.py: add support for gcp instances
scylla_util.py: support http headers in curl function
scylla_io_setup: refactor iotune run to a function
"
As I described at https://github.com/scylladb/scylla/issues/6973#issuecomment-669705374,
we need to avoid disk full on scylla_swap_setup, also we should provide manual configuration of swap size.
This PR provides following things:
- use psutil to get memtotal and disk free, since it provides better APIs
- calculate swap size in bytes to avoid causing error on low-memory environment
- prevent to use more than 50% of disk space when auto-configured swap size, abort setup when almost no disk space available (less than 2GB)
- add --swap-size option to specify swap size both on scylla_swap_setup and scylla_setup
- add interactive swap size prompt on scylla_setup
Fixes#6947
Related #6973
Related scylladb/scylla-machine-image#48
"
* syuu1228-scylla_swap_setup_improves:
scylla_setup: add swap size interactive prompt on swap setup
scylla_swap_setup: add --swap-size option to specify swap size
scylla_swap_setup: limit swapfile size to half of diskspace
scylla_swap_setup: calculate in bytes instead of GB
scylla_swap_setup: use psutil to get memtotal and disk free
Sometimes (e.g. when investigating a suspected heap use-after-free) it
is useful to include dead objects in the search results. This patch adds
a new option to scylla find to enable just that. Also make sure we
save and print the offset of the address in the object for dead
objects, just like we do for live ones.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200810091202.1401231-1-bdenes@scylladb.com>
We should not fill entire disk space with swapfile, it's safer to limit
swap size 50% of diskspace.
Also, if 50% of diskspace <= 1GB, abort setup since it's out of disk space.
Converting memory & disk sizes to an int value of N gigabytes was too rough,
it become problematic in low memory size / low disk size environment, such as
some types of EC2 instances.
We should calculate in bytes.
To get better API of memory & disk statistics, switch to psutil.
With the library we don't need to parse /proc/meminfo.
[avi: regenerate tools/toolchain/image for new python3-psutils package]
It is used only for updating the metadata_collector {min,max}_column_names.
Implement metadata_collector::do_update_min_max_components in
sstables/metadata_collector.cc that will be used to host some other
metadata_collector methods in following patches that need not be
implemented in the header file.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The following metric is added:
scylla_node_maintenance_operations_rebuild_finished_percentage{shard="0",type="gauge"} 0.650000
It is the number of finished percentage for rebuild operation so far.
Fixes#1244, #6733
The following metric is added:
scylla_node_maintenance_operations_bootstrap_finished_percentage{shard="0",type="gauge"} 0.850000
It is the number of finished percentage for bootstrap operation so far.
Fixes#1244, #6733
Otherwise we may trip the following
assert(_closing_state == state::closed);
in ~append_challenged_posix_file_impl
when the output_stream is destructed.
Example stack trace:
non-virtual thunk to seastar::append_challenged_posix_file_impl::~append_challenged_posix_file_impl() at /jenkins/slave/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/future.hh:944
seastar::shared_ptr_count_for<checked_file_impl>::~shared_ptr_count_for() at crtstuff.c:?
seastar::shared_ptr<seastar::file_impl>::~shared_ptr() at /jenkins/slave/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/future.hh:944
(inlined by) seastar::file::~file() at /jenkins/slave/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/file.hh:155
(inlined by) seastar::file_data_sink_impl::~file_data_sink_impl() at /jenkins/slave/workspace/scylla-3.2/build/scylla/seastar/src/core/fstream.cc:312
(inlined by) seastar::file_data_sink_impl::~file_data_sink_impl() at /jenkins/slave/workspace/scylla-3.2/build/scylla/seastar/src/core/fstream.cc:312
seastar::output_stream<char>::~output_stream() at crtstuff.c:?
sstables::sstable::write_crc(sstables::checksum const&) [clone .cold] at sstables.cc:?
sstables::mc::writer::close_data_writer() at crtstuff.c:?
sstables::mc::writer::consume_end_of_stream() at crtstuff.c:?
sstables::sstable::write_components(flat_mutation_reader, unsigned long, seastar::lw_shared_ptr<schema const>, sstables::sstable_writer_config const&, encoding_stats, seastar::io_priority_class const&)::{lambda()#1}::operator()() at sstables.cc:?
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Unify common code for file creation and file_writer construction
for sstable components.
It is defined as noexcept based on `new_sstable_component_file`
and makes sure the file is closed on error by using `file_writer::make`
that guarantees that.
Will be used for auto-closing the writer as a RAII object.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Get rid of seastar::async.
Use seastar::with_file to make sure the opened file is always closed
and move in.close() into a .finally continuation.
While at it, make remove_by_toc_name noexcept.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
* seastar eb452a22a0...e615054c75 (13):
> memory: fix small aligned free memory corruption
Fixes#6831
> logger: specify log methods as noexcept
> logger: mark trivial methods as noexcept
> everywhere: Replace boost::optional with std::optional
> httpd: fix Expect header case sensitivity
> rpc: Avoid excessive casting in has_handler() helper
> test: Increase capacity of fair-queue unit test case
> file_io_test: Simplify a test with memory::with_allocation_failures
> Merge 'Add HTTP/1.1 100 Continue' from Wojciech
Fixes#6844
> future: Use "if constexpr"
> future: Drop code that was avoiding a gcc 8 warning
> file: specify alignment get methods noexcept
> merge: Specify abort_source subscription handlers as noexcept
read_toc can be marked as noexcept now that new_sstable_component_file is.
With that, other methods that call it can be marked noexcept too.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In preparation for closing the file in all paths,
get rid of the file_path sstring and just recompute
it as needed on error paths using the this->filename method.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that sstable::open_file is noexcept,
if any of open or create of data/index doesn't succeed,
we don't set the respective sstable member and return the failure.
When destroyed, the sstable destructor will close any file (data or index),
that we managed to open.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that new_sstable_component_file is noexcept,
open_file can be specified as noexcept too.
This makes error handling in create/open sstable
data and index files easier using when_all_succeed().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the function is handed over a `file` that it just passes through
on success. Let its single caller do that to simplify its error handling.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that open_integrity_checked_file_dma (and open_file_dma)
are noexcept, open_sstable_component_file_non_checked can be noexcept too.
Also, get a std::string_view name instead of a const sstring&
to match both open_integrity_checked_file_dma and open_file_dma
name arg.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Convert to accepting std::string_view name.
Move the sstring allocation to make_integrity_checked_file
that may already throw.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use seastar::with_file_close_on_failure to make sure the file
we open is closed on failure of the continuation,
as make_integrity_checked_file may throw from ::make_shared.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move the classes representing CQL expressions (and utility functions
on them) from the `restrictions` namespace to a new namespace `expr`.
Most of the restriction.hh content was moved verbatim to
expression.hh. Similarly, all expression-related code was moved from
statement_restrictions.cc verbatim to expression.cc.
As suggested in #5763 feedback
https://github.com/scylladb/scylla/pull/5763#discussion_r443210498
Tests: dev (unit)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Iotune isn't run for supported/recommended GCP instances anymore,
we set the I/O properties now based on GCP tables or our
measurements(where applicable).
Not recommended/supported setups will still run iotune.
Fixes#6631
GCP VMs with NVMEs that are recommended can be recognized now.
Detection is done using resolving of internal GCP metadata server.
We recommend 2 cpu instances at least. For more than 16 disks
we mandate at least 32 cpus. 50:1 disk to ram ratio also has to be
kept.
Instances that use NVMEs as root disks are also considered as unsupported.
Supported instances for NVMEs are n1, n2, n2d, c2 and m1-megamem-96.
All others are unsupported now.
While answering a stackoverflow question on how to create an item but only
if we don't already have an item with the same key, I realized that we never
actually tested that ConditionExpressions works on key columns: all the
tests we had in test_condition_expression.py had conditions on non-key
attributes. So in this patch we add two tests with a condition on the key
attribute.
Most examples of conditions on the key attributes would be silly, but
in these two tests we demonstrate how a test on key attributes can be
useful to solve the above need of creating an item if no such item
exists yet. We demonstrate two ways to do this using a condition on
the key - using either the "<>" (not equal) operator, or the
"attribute_not_exists()" function.
These tests pass - we don't have a bug in this. But it's nice to have
a test that confirms that we don't (and don't regress in that area).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200806200322.1568103-1-nyh@scylladb.com>
* tools/java aa7898d771...f2c7cf8d8d (2):
> dist: debian: support non-x86
> NodeProbe: get all histogram values in a single call
* tools/jmx 626fd75...c5ed831 (1):
> dist: debian: support non-x86
"
The current implementation of B+ benefits from using SIMD
instruction in intra-nodes keys search. This set adds this
functionality.
The general idea behind the implementation is in "asking"
the less comparator if it is the plain "<" and allows for
key simplification to do this natural comparison. If it
does, the search key is simplified to int64_t, the node's
array of keys is casted to array of integers, then both are
fed into avx-optimized searcher.
The searcher should work on nodes that are not filled with
keys. For performance the "unused" keys are set to int64_t
minimum, the search loop compares them too (!) and adjusts
the result index by node size. This needs some care in the
maybe_key{} wrapper.
fixes: #186
tests: unit(dev)
"
* 'br-bptree-avx-b' of https://github.com/xemul/scylla:
utils: AVX searcher
bptree: Special intra-node key search when possible
bptree: Add lesses to maybe_key template
token: Restrict TokenCarrier concept with noexcept
The column names in SlicePredicate can be passed in arbitrary order.
We converted them to clustering ranges in read_command preserving the
original order. As a result, the clustering ranges in read command may
appear out of order. This violates storage engine's assumptions and
lead to undefined behavior.
It was seen manifesting as a SIGSEGV or an abort in sstable reader
when executing a get_slice() thrift verb:
scylla: sstables/consumer.hh:476: seastar::future<> data_consumer::continuous_data_consumer<StateProcessor>::fast_forward_to(size_t, size_t) [with StateProcessor = sstables::data_consume_rows_context_m; size_t = long unsigned int]: Assertion `end >= _stream_position.position' failed.
Fixes#6486.
Tests:
- added a new dtest to thrift_tests.py which reproduces the problem
Message-Id: <1596725657-15802-1-git-send-email-tgrabiec@scylladb.com>
In case of an initialization failure after
db.get_compaction_manager().enable();
But before stop_database, we would never stop the compaction manager
and it would assert during destruction.
I am trying to add a test for this using the memory failure injector,
but that will require fixing other crashes first.
Found while debugging #6831.
Refs #6831.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200805181840.196064-1-espindola@scylladb.com>
With all the preparations made so far it's now possible to implement
the avx-powered search in an array.
The array to search in has both -- capacity and size, so searching in
it needs to take allocated, but unused tail into account. Two options
for that -- limit the number of comparisons "by hands" or keep minimal
and impossible value in this tail, scan "capacity" elements, then
correct the result with "size" value. The latter approach is up to 50%
faster than any (tried) attempt to do the former one.
The run-time selection of the array search code is done with the gnu
target attribute. It's available since gcc 4.8. For AVX-less platforms
the default linear scanner is used.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If the key type is int64_t and the less-comparator is "natural" (i.e. it's
literally 'a < b') we may use the SIMD instructions to search for the key
on a node. Before doing so, the maybe_key and the searcher should be prepared
for that, in particular:
1. maybe_key should set unused keys to the minimal value
2. the searcher for this case should call the gt() helper with
primitive types -- int64_t search key and array of int64_t values
To tell to B+ code that the key-less pair is such the less-er should define
the simplify_key() method converting search keys to int64_t-s.
This searcher is selected automatically, if any mismatch happens it silently
falls back to default one. Thus also add a static assertion to the row-cache
to mitigate this.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The way maybe_key works will be in-sync with the intra-node searching
code and will require to know what the Less type is, so prepare for that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The <search-key>.token() is noexcept currently and will have
to be explicitly such for future optimized key searcher, so
restrict the constraint and update the related classes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The single-node test Scylla run by test/alternator/run uses, as the
default, 256 vnodes. When we have 256 vnodes and two shards, our CDC
implementation produces 512 separate "streams" (called "shards" in
DynamoDB lingo). This causes each of the tests in test_streams.py
which need to read data from the stream to need to do 1024 (!) API
requests (512 calls to GetShardIterator and 512 calls to GetRecords)
which takes significant time - about a second per test.
In this patch, we reduce the number of vnodes to 16. We still have
a non-negligible number of stream "shards" (32) so this part of the
CDC code is still exercised. Moreover, to ensure we still routinely
test the paging feature of DescribeStream (whose default page size
is 100), the patch changes the request to use a Limit of 10, so
paging will still be used to retrieve the list of 32 shards.
The time to run the 27 tests in test_streams.py, on my laptop:
Before this patch: 26 seconds
After this patch: 6 seconds.
Fixes#6979
Message-Id: <20200805093418.1490305-1-nyh@scylladb.com>
This patch adds additional tests for Alternator Streams, which helped
uncover 9 new issues.
The stream tests are noticibly slower than most other Alternator tests -
test_streams.py now has 27 tests taking a total of 20 seconds. Much of this
slowness is attributed to Alternator Stream's 512 "shards" per stream in the
single-node test setup with 256 vnodes, meaning that we need over 1000 API
requests per test using GetRecords. These tests could be made significantly
faster (as little as 4 seconds) by setting a lower number of vnodes.
Issue #6979 is about doing this in the future.
The tests in this patch have comments explaining clearly (I hope) what they
test, and also pointing to issues I opened about the problems discovered
through these tests. In particular, the tests reproduce the following bugs:
Refs #6918
Refs #6926
Refs #6930
Refs #6933
Refs #6935
Refs #6939
Refs #6942
The tests also work around the following issues (and can be changed to
be more strict and reproduce these issues):
Refs #6918
Refs #6931
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200804154755.1461309-1-nyh@scylladb.com>
"
This patch series converts a few more global variables from sstring to
constexpr std::string_view.
Doing that makes it impossible for them to be part of any
initialization order problems.
"
* 'espindola/more-constexpr-v2' of https://github.com/espindola/scylla:
auth: Turn DEFAULT_USER_NAME into a std::string_view variable
auth: Turn SALTED_HASH into a std::string_view variable
auth: Turn meta::role_members_table::qualified_name into a std::string_view variable
auth: Turn meta::roles_table::qualified_name into a std::string_view variable
auth: Turn password_authenticator_name into a std::string_view variable
auth: Inline default_authorizer_name into only use
auth: Turn allow_all_authorizer_name into a std::string_view variable
auth: Turn allow_all_authenticator_name into a std::string_view variable
If initialization of a TLS variable fails there is nothing better to
do than call std::unexpected.
This also adds a disable_failure_guard to avoid errors when using
allocation error injection.
With init() being noexcept, we can also mark clear_functions.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200804180550.96150-1-espindola@scylladb.com>
There is no constexpr operator+ for std::string_view, so we have to
concatenate the strings ourselves.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/6910
by Wojciech Mitros:
This patch enables selecting more than 2^32 rows from a table. The change
becomes active after upgrading whole cluster - until then old limits are
used.
Tested reading 4.5*10^9 rows from a virtual table, manually upgrading a
cluster with ccm and performing cql SELECT queries during the upgrade,
ran unit tests in dev mode and cql and paging dtests.
tests: add large paging state tests
increase the maximum size of query results to 2^64
Add a unit test checking if the top 32 bits of the number of remaining rows in paging state
is used correctly and a manual test checking if it's possible to select over 2^32 rows from a table
and a virtual reader for this table.
This commit takes out some responsibilities of `cdc::transformer`
(which is currently a big ball of mud) into a separate class.
This class is a simple abstraction for creating entries in a CDC log
mutation.
Low-level calls to the mutation API (such as `set_cell`) inside
`cdc::transformer` were replaced by higher-level calls to the builder
abstraction, removing some duplication of logic.
util/loading_cache.hh includes adjusted.
* seastar 02ad74fa7d...eb452a22a0 (17):
> core: add missing include for std::allocator_traits
> exceptions: move timed_out_error and factory into its own header file
> future: parallel_for_each: add disable_failure_guard for parallel_for_each_state
> Merge "Improve file API noexcept correctness" from Rafael
> util: Add a with_allocation_failures helper
> future: Fix indentation
> future: Refactor duplicated try/catch
> future: Make set_to_current_exception public
> future: Add noexcept to continuation related functions
> core: mark timer cancellation functions as noexcept
> future: Simplify future::schedule
> test: add a case for overwriting exact routes
> http: throw on duplicated routes to prevent memory leaks
> metrics: Remove the type label
> fstream: turn file_data_source_impl's memory corruption bugs into aborts
> doc: update tutorial splitting script
> reactor_backend: let the reactor know again if any work was done by aio backend
Merged pull request https://github.com/scylladb/scylla/pull/6969 by
Calle Wilund:
Fixes#6942Fixes#6926Fixes#6933
We use clustering [lo:hi) range for iterator query.
To avoid encoding inclusive/exclusive range (depending on
init/last get_records call), instead just increment
the timeuuid threshold.
Also, dynamo result always contains a "records" entry. Include one for us as well.
Also, if old (or new) image for a change set is empty, dynamo will not include
this key at all. Alternator did return an empty object. This changes it to be excluded
on empty.
alternator::streams: Don't include empty new/old image
alternator::streams: Always include "Records" array in get_records reponse
alternator::streams: Incr shard iterator threshold in get_records
"
With this patches a monitor is destroyed before the writer, which
simplifies the writer destructor.
"
* 'espindola/simplify-write-monitor-v2' of https://github.com/espindola/scylla:
sstables: Delete write_failed
sstables: Move monitor after writer in compaction_writer
Fixes#6933
If old (or new) image for a change set is empty, dynamo will not
include this key at all. Alternator did return an empty object.
This changes it to be excluded on empty.
Fixes#6942
We use clustering [lo:hi) range for iterator query.
To avoid encoding inclusive/exclusive range (depending on
init/last get_records call), instead just increment
the timeuuid threshold.
Fixes#6866
If we try to create/alter an Alternator table to include streams,
we must check that the cluster does in fact support CDC
(experimental still). If not, throw a hopefully somewhat descriptive
error.
(Normal CQL table create goes through a similar check in cql_prop_defs)
Note: no other operations are prohibited. The cluster could have had CDC
enabled before, so streams could exist to list and even read.
Any tables loaded from schema tables should be reposnsible for their
own validation.
Refs #6864
When booting a clean scylla, CDC stream ID:s will not be availble until
a n*ring delay time period has passed. Before this, writing to a CDC
enabled table will fail hard.
For alternator (and its tests), we can report the stream(s) for tables as not yet
available (ENABLING) until such time as id:s are
computed.
v2:
* Keep storage service ref in executor
"
This is inspired by #6781. The idea is to make Scylla listen for CQL connections on port 9042 (where both old shard-aware and shard-unaware clients can still connect the traditional way). On top of that I added a new port, where everything works the same way, only the port from client's socket used to determine the shard No. to connect to. Desired shard No. is the result of `clientside_port % num_shards`.
The new port is configurable from scylla.yaml and defaults to 19042 (unencrypted, unless user configures encryption options and omits `native_shard_aware_transport_port_ssl` in DB config).
Two "SUPPORTED" tags are added: "SCYLLA_SHARD_AWARE_PORT" and "SCYLLA_SHARD_AWARE_PORT_SSL". For compatibility, "SCYLLA_SHARDING_ALGORITHM" is still kept.
Fixes#5239
"
* jul-stas-shard-aware-listener:
docs: Info about shard-aware listeners in protocol-extensions
transport: Added listener with port-based load balancing
Currently, we cannot select more than 2^32 rows from a table because we are limited by types of
variables containing the numbers of rows. This patch changes these types and sets new limits.
The new limits take effect while selecting all rows from a table - custom limits of rows in a result
stay the same (2^32-1).
In classes which are being serialized and used in messaging, in order to be able to process queries
originating from older nodes, the top 32 bits of new integers are optional and stay at the end
of the class - if they're absent we assume they equal 0.
The backward compatibility was tested by querying an older node for a paged selection, using the
received paging_state with the same select statement on an upgraded node, and comparing the returned
rows with the result generated for the same query by the older node, additionally checking if the
paging_state returned by the upgraded node contained new fields with correct values. Also verified
if the older node simply ignores the top 32 bits of the remaining rows number when handling a query
with a paging_state originating from an upgraded node by generating and sending such a query to
an older node and checking the paging_state in the reply(using python driver).
Fixes#5101.
Fixes#6341
Since scylla no longer supports upgrading from a version without the
"new" (dedicated) truncation record table, we can remove support for these
and the migtration thereof.
Make sure the above holds whereever this is committed.
Note that this does not remove the "truncated_at" field in
system.local.
Merged pull request https://github.com/scylladb/scylla/pull/6914
by By Juliusz Stasiewicz:
The goal is to have finer control over CDC "delta" rows, i.e.:
disable them totally (mode off);
record only base PK+CK columns (mode keys);
make them behave as usual (mode full, default).
The editing of log rows is performed at the stage of finishing CDC mutation.
Fixes#6838
tests: Added CQL test for `delta mode`
cdc: Implementations of `delta_mode::off/keys`
cdc: Infrastructure for controlling `delta_mode`
The rule for THE REST results in each person listed in it
to receive notifications about every single pull request,
which can easily lead to inbox overload - the generic
rule is therefore dropped and authors of pull requests
are expected to manually add reviewers. GitHub offers
semi-random suggestions for reviewers anyway.
Message-Id: <3c0f7a2f13c098438a8abf998ec56b74db87c733.1596450426.git.sarna@scylladb.com>
This reverts commit b97f466438.
It turns out that the schema mechanism has a lot of nuances,
after this change, for unknown reason, it was empirically
proven that the amount of cross shard on an upgraded node was
increased significantly with a steady stress traffic, if
was so significant that the node appeared unavailable to
the coordinators because all of the requests started to fail
on smp_srvice_group semaphore.
This revert will bring back a caveat in Scylla, the caveat is
that creating a table in a mixed cluster **might** under certain
condition cause schema mismatch on the newly created table, this
make the table essentially unusable until the whole cluster has
a uniform version (rolling upgrade or rollback completion).
Fixes#6893.
We need only one of the shards owning each ssatble to call create_links.
This will allow us to simplify it and only handle crash/replay scenarios rather than rename/link/remove races.
Fixes#1622
Test: unit(dev), database_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200803065505.42100-3-bhalevy@scylladb.com>
The "outdir" variable in configure.py and "$builddir" in build.ninja
file specifies the build directory. Let's use them to eliminate
hard-coded "build" paths from configure.py.
Message-Id: <20200731105113.388073-1-penberg@scylladb.com>
Alternator Streams have a "alternator_streams_time_window_s" parameter which
is used to allow for correct ordering in the stream in the face of clock
differences between Scylla nodes and possibly network delays. This parameter
currently defaults to 10 seconds, and there is a discussion on issue #6929
on whether it is perhaps too high. But in any case, for tests running on a
single node there is no reason not to set this parameter to zero.
Setting this parameter to zero greatly speeds up the Alternator Streams
tests which use ReadRecords to read from the stream. Previously each such
test took at least 10 seconds, because the data was only readable after a
10 second delay. With alternator_streams_time_window_s=0, these tests can
finish in less than a second. Unfortunately they are still relatively slow
because our Streams implementation has 512 shards, and thus we need over a
thousand (!) API calls to read from the stream).
Running "test/alternator/run test_streams.py" with 25 tests took before
this patch 114 seconds, after this patch, it is down to 18 seconds.
Refs #6929
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Reviewed-by: Calle Wilund <calle@scylladb.com>
Message-Id: <20200728184612.1253178-1-nyh@scylladb.com>
"
While working on another patch I was getting odd compiler errors
saying that a call to ::make_shared was ambiguous. The reason was that
seastar has both:
template <typename T, typename... A>
shared_ptr<T> make_shared(A&&... a);
template <typename T>
shared_ptr<T> make_shared(T&& a);
The second variant doesn't exist in std::make_shared.
This series drops the dependency in scylla, so that a future change
can make seastar::make_shared a bit more like std::make_shared.
"
* 'espindola/make_shared' of https://github.com/espindola/scylla:
Everywhere: Explicitly instantiate make_lw_shared
Everywhere: Add a make_shared_schema helper
Everywhere: Explicitly instantiate make_shared
cql3: Add a create_multi_column_relation helper
main: Return a shared_ptr from defer_verbose_shutdown
"
To mount RAID volume correctly (#6876), we need to wait for MDRAID initialization.
To do so we need to add After=mdmonitor.service on var-lib-scylla.mount.
Also, `lsblk -n -oPARTTYPE {dev}` does not work for CentOS7, since older lsblk does not supported PARTTYPE column (#6954).
We need to provide relocatable lsblk and run it on out() / run() function instead of distribution provided version.
"
* syuu1228-scylla_raid_setup_mount_correctly_beyond_reboot:
scylla_raid_setup: initialize MDRAID before mounting data volume
create-relocatable-package.py: add lsblk for relocatable CLI tools
scylla_util.py: always use relocatable CLI tools
The constructors of these global variables can allocate memory. Since
the variables are thread_local, they are initialized at first use.
There is nothing we can do if these allocations fail, so use
disable_failure_guard.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200729184901.205646-1-espindola@scylladb.com>
The row_cache::_partitions type is nowadays a double_decker which is B+tree of
intrusive_arrays of cache_entrys, so scylla cache command will raise an error
being unable to parse this new data type.
The respective iterator for double decker starts on the tree and walks the list
of leaf nodes, on each node it walks the plain array of data nodes, then on each
data node it walks the intrusive array of cache_entrys yielding them to the
caller.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200730145851.8819-1-xemul@scylladb.com>
The new port is configurable from scylla.yaml and defaults to 19042
(unencrypted, unless client configures encryption options and omits
`native_shard_aware_transport_port_ssl`).
Two "SUPPORTED" tags are added: "SCYLLA_SHARD_AWARE_PORT" and
"SCYLLA_SHARD_AWARE_PORT_SSL". For compatibility,
"SCYLLA_SHARDING_ALGORITHM" is still kept.
Fixes#5239
var-lib-scylla.mount should wait for MDRAID initilization, so we need to add
'After=mdmonitor.service'.
However, currently mdmonitor.service fails to start due to no mail address
specified, we need to add the entry on mdadm.conf.
Fixes#6876
On some CLI tools, command options may different between latest version
vs older version.
To maximize compatibility of setup scripts, we should always use
relocatable CLI tools instead of distribution version of the tool.
Related #6954
The .gitignore entry for testlog/ directory is generalized
from "testlog/*" to "testlog", in order to please everyone
who potentially wants test logs to use ramfs by symlinking
testlog to /tmp. Without the change, the symlink remains
visible in `git status`.
Message-Id: <e600f5954868aea7031beb02b1d8e12a2ff869e2.1596115302.git.sarna@scylladb.com>
Replace the MAINTAINERS file with a CODEOWNERS file, which Github is
able to parse, and suggest reviewers for pull requests.
* penberg-penberg/codeowners:
Replace MAINTAINERS with CODEOWNERS
Update MAINTAINERS
The tests in this patch, which pass on DynamoDB but fail on Alternator,
reproduce a bug described in issue #6951. This bug makes it impossible for
a Query (or Scan) to filter on an attribute if that attribute is not
requested to be included in the output.
This patch includes two xfailing tests of this type: One testing a
combination of FilterExpression and ProjectionExpression, and the second
testing a combination of QueryFilter and AttributesToGet; These two
pairs are, respectively, DynamoDB's newer and older syntaxes to achieve
the same thing.
Additionally, we add two xfailing tests that demonstrates that combining
old and new style syntax (e.g., FilterExpression with AttributesToGet)
should not have been allowed (DynamoDB doesn't allow such combinations),
but Alternator currently accepts these combinations.
Refs #6951
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200729210346.1308461-1-nyh@scylladb.com>
The mutation object may be freed prematurely during commitlog replay
in the schema upgrading path. We will hit the problem if the memtable
is full and apply_in_memory() needs to defer.
This will typically manifest as a segfault.
Fixes#6953
Introduced in 79935df
Tests:
- manual using scylla binary. Reproduced the problem then verified the fix makes it go away
Message-Id: <1596044010-27296-1-git-send-email-tgrabiec@scylladb.com>
The test/alternator/run script follows the pytest log with a full log of
Scylla. This saved log can be useful in diagnosing problems, but most of
it is filled with non-useful "INFO"-level messages. The two biggest
offenders are compaction - which logs every single compaction happening,
and the migration manager, which is just a second (and very long) message
about schema change operations (e.g., table creations). Neither of these
are interesting for Alternator's tests, which shouldn't care exactly when
compaction of which sstable is happening. These two components alone
are reponsible for 80% of the log lines, and 90% of the log bytes!
In this patch we increase the log level of just these two components -
compaction and migration_manager - to WARN, which reduces the log
by the same percentages (80% by lines, 90% by bytes).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200728191420.1254961-1-nyh@scylladb.com>
On CentOS7, systemd does not support percentage-based parameter.
To apply memory parameter on CentOS7, we need to override the parameter
in bytes, instead of percentage.
Fixes#6783
"
Non-paged queries completely ignore the query result size limiter
mechanism. They consume all the memory they want. With sufficiently
large datasets this can easily lead to a handful or even a single
unpaged query producing an OOM.
This series continues the work started by 134d5a5f7, by introducing a
configurable pair of soft/hard limit (default to 1MB/100MB) that is
applied to otherwise unlimited queries, like reverse and unpaged ones.
When an unlimited query reaches the soft limit a warning is logged. This
should give users some heads-up to adjust their application. When the
hard limit is reached the query is aborted. The idea is to not greet
users with failing queries after an upgrade while at the same time
protect the database from the really bad queries. The hard limit should
be decreased from time to time gradually approaching the desired goal of
1MB.
We don't want to limit internal queries, we trust ourselves to either
use another form of memory usage control, or read only small datasets.
So the limit is selected according to the query class. User reads use
the `max_memory_for_unlimited_query_{soft,hard}_limit` configuration
items, while internal reads are not limited. The limit is obtained by
the coordinator, who passes it down to replicas using the existing
`max_result_size` parameter (which is not a special type containing the
two limits), which is now passed on every verb, instead of once per
connection. This ensures that all replicas work with the same limits.
For normal paged queries `max_result_size` is set to the usual
`query::result_memory_limiter::maximum_result_size` For queries that can
consume unlimited amount of memory -- unpaged and reverse queries --
this is set to the value of the aforementioned
`max_memory_for_unlimited_query_{soft,hard}_limit` configuration item,
but only for user reads, internal reads are not limited.
This has the side-effect that reverse reads now send entire
partitions in a single page, but this is not that bad. The data was
already read, and its size was below the limit, the replica might as well
send it all.
Fixes: #5870
"
* 'nonpaged-query-limit/v5' of https://github.com/denesb/scylla: (26 commits)
test: database_test: add test for enforced max result limit
mutation_partition: abort read when hard limit is exceeded for non-paged reads
query-result.hh: move the definition of short_read to the top
test: cql_test_env: set the max_memory_unlimited_query_{soft,hard}_limit
test: set the allow_short_read slice option for paged queries
partition_slice_builder: add with_option()
result_memory_accounter: remove default constructor
query_*(): use the coordinator specified memory limit for unlimited queries
storage_proxy: use read_command::max_result_size to pass max result size around
query: result_memory_limiter: use the new max_result_size type
query: read_command: add max_result_size
query: read_command: use tagged ints for limit ctor params
query: read_command: add separate convenience constructor
service: query_pager: set the allow_short_read flag
result_memory_accounter: check(): use _maximum_result_size instead of hardcoded limit
storage_proxy: add get_max_result_size()
result_memory_limiter: add unlimited_result_size constant
database: add get_statement_scheduling_group()
database: query_mutations(): obtain the memory accounter inside
query: query_class_config: use max_result_size for the max_memory_for_unlimited_query field
...
* tools/java a9480f3a87...aa7898d771 (4):
> dist: debian: do not require root during package build
> cassandra-stress: Add serial consistency options
> dist: debian: fix detection of debuild
> bin tools: Use non-default `cassandra.config`
* tools/jmx c0d9d0f...626fd75 (1):
> dist: debian: do not require root during package build
Fixes#6655.
If the read is not paged (short read is not allowed) abort the query if
the hard memory limit is reached. On reaching the soft memory limit a
warning is logged. This should allow users to adjust their application
code while at the same time protecting the database from the really bad
queries.
The enforcement happens inside the memory accounter and doesn't require
cooperation from the result builders. This ensures memory limit set for
the query is respected for all kind of reads. Previously non-paged reads
simply ignored the memory accounter requesting the read to stop and
consumed all the memory they wanted.
This is a 4.2% reduction in the scylla text size, from 38975956 to
37404404 bytes.
When benchmarking perf_simple_query without --shuffle-sections, there
is no performance difference.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200724032504.3004-1-espindola@scylladb.com>
Some tests use the lower level methods directly and meant to use paging
but didn't and nobody noticed. This was revealed by the enforcement of
max result size (introduced in a later patch), which caused these tests
to fail due to exceeding the max result size.
This patch fixes this by setting the `allow_short_reads` slice option.
If somebody wants to bypass proper memory accounting they should at
the very least be forced to consider if that is indeed wise and think a
second about the limit they want to apply.
It is important that all replicas participating in a read use the same
memory limits to avoid artificial differences due to different amount of
results. The coordinator now passes down its own memory limit for reads,
in the form of max_result_size (or max_size). For unpaged or reverse
queries this has to be used now instead of the locally set
max_memory_unlimited_query configuration item.
To avoid the replicas accidentally using the local limit contained in
the `query_class_config` returned from
`database::make_query_class_config()`, we refactor the latter into
`database::get_reader_concurrency_semaphore()`. Most of its callers were
only interested in the semaphore only anyway and those that were
interested in the limit as well should get it from the coordinator
instead, so this refactoring is a win-win.
Use the recently added `max_result_size` field of `query::read_command`
to pass the max result size around, including passing it to remote
nodes. This means that the max result size will be sent along each read,
instead of once per connection.
As we want to select the appropriate `max_result_size` based on the type
of the query as well as based on the query class (user or internal) the
previous method won't do anymore. If the remote doesn't fill this
field, the old per-connection value is used.
This field will replace max size which is currently passed once per
established rpc connection via the CLIENT_ID verb and stored as an
auxiliary value on the client_info. For now it is unused, but we update
all sites creating a read command to pass the correct value to it. In the
next patch we will phase out the old max size and use this field to pass
max size on each verb instead.
The convenience constructor of read_command now has two integer
parameter next to each other. In the next patch we intend to add another
one. This is recipe for disaster, so to avoid mistakes this patch
converts these parameters to tagged integers. This makes sure callers
pass what they meant to pass. As a matter of fact, while fixing up
call-sites, I already found several ones passing `query::max_partitions`
to the `row_limit` parameter. No harm done yet, as
`query::max_partitions` == `query::max_rows` but this shows just how
easy it is to mix up parameters with the same type.
query::read_command currently has a single constructor, which serves
both as an idl constructor (order of parameters is fixed) and a convenience one
(most parameters have default values). This makes it very error prone to
add new parameters, that everyone should fill. The new parameter has to
be added as last, with a default value, as the previous ones have a
default value as well. This means the compiler's help cannot be enlisted
to make sure all usages are updated.
This patch adds a separate convenience constructor to be used by normal
code. The idl constructor looses all default parameters. New parameters
can be added to any position in the convenience constructor (to force
users to fill in a meaningful value) while the removed default
parameters from the idl constructor means code cannot accidentally use
it without noticing.
All callers should set this already before passing the slice to the
pager, however not all actually do (e.g.
`cql3::indexed_table_select_statement::read_posting_list()`). Instead of
auditing each call site, just make sure this is set in the pager
itself. If someone is creating a pager we can be sure they mean to use
paging.
The use of the global `result_memory_limiter::maximum_result_size` is
probably a leftover from before the `_maximum_result_size` member was
introduced (aa083d3d85).
Meant to be used by the coordinator node to obtain the max result size
applicable to the query-class (determined based on the current
scheduling group). For normal paged queries the previously used
`query::result_memory_limiter::maximum_result_size` is used uniformly.
For reverse and unpaged queries, a query class dependent value is used.
For user reads, the value of the
`max_memory_for_unlimited_query_{soft,hard}_limit` configuration items
is used, for other classes no limit is used
(`query::result_memory_limiter::unlimited_result_size`).
We want to switch from using a single limit to a dual soft/hard limit.
As a first step we switch the limit field of `query_class_config` to use
the recently introduced type for this. As this field has a single user
at the moment -- reverse queries (and not a lot of propagation) -- we
update it in this same patch to use the soft/hard limit: warn on
reaching the soft limit and abort on the hard limit (the previous
behaviour).
This pair of limits replace the old max_memory_for_unlimited_query one,
which remains as an alias to the hard limit. The soft limit inherits the
previous value of the limit (1MB), when this limit is reached a warning
will be logged allowing the users to adjust their client codes without
downtime. The hard limit starts out with a more permissive default of
100MB. When this is reached queries are aborted, the same behaviour as
with the previous single limit.
The idea is to allow clients a grace period for fixing their code, while
at the same time protecting the database from the really bad queries.
Now that there are no ad-hoc aliases needing to overwrite the name and
description parameter of this method, we can drop these and have each
config item just use `name()` and `desc()` to access these.
We already uses aliases for some configuration items, although these are
created with an ad-hoc mechanism that only registers them on the command
line. Replace this with the built-in alias mechanism in the previous
patch, which has the benefit of conflict resolution and also working
with YAML.
Allow configuration items to also have an alias, besides the name.
This allows easy replacement of configuration items, with newer names,
while still supporting the old name for backward compatibility.
The alias mechanism takes care of registering both the name and the
alias as command line arguments, as well as parsing them from YAML.
The command line documentation of the alias will just refer to the name
for documentation.
This procedure consists of trimming SSTables off a compaction job until its weight[1]
is smaller than one already taken by a running compaction. Min threshold is respected
though, we only trim a job while its size is > min threshold.
[1]: this value is a logarithimic function of the total size of the SSTables in a
given job, and it's used to control the compaction parallelism.
It's intended to improve the compaction efficiency by allowing more jobs to run in
parallel, but it turns out that this can have an opposite effect because the write
amplification can be significantly increased.
Take STCS for example, the more similar-sized SSTables you compact together, the
higher the compaction efficiency will be. With the trimming procedure, we're aiming
at running smaller jobs, thinking that running more parallel compactions will provide
us with better performance, but that's not true. Most of the efficiency comes from
making informed decisions when selecting candidates for compaction.
Similarly, this will also hurt TWCS, which does STCS in current window, and a sort
of major compaction when the current window closes. If the TWCS jobs are trimmed,
we'll likely need another compaction to get to the desired state, recompacting
the same data again.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200728143648.31349-1-raphaelsc@scylladb.com>
On GCE, /dev/sda14 reported as unused disk but it's BIOS boot partition,
should not use for scylla data partition, also cannot use for it since it's
too small.
It's better to exclude such partiotion from unsed disk list.
Fixes#6636
We saw scylla hit user after free in repair with the following procedure during tests:
- n1 and n2 in the cluster
- n2 ran decommission
- n2 sent data to n1 using repair
- n2 was killed forcely
- n1 tried to remove repair_meta for n1
- n1 hit use after free on repair_meta object
This was what happened on n1:
1) data was received -> do_apply_rows was called -> yield before create_writer() was called
2) repair_meta::stop() was called -> wait_for_writer_done() / do_wait_for_writer_done was called
with _writer_done[node_idx] not engaged
3) step 1 resumed, create_writer() was called and _repair_writer object was referenced
4) repair_meta::stop() finished, repair_meta object and its member _repair_writer was destroyed
5) The fiber created by create_writer() at step 3 hit use after free on _repair_writer object
To fix, we should call wait_for_writer_done() after any pending
operations were done which were protected by repair_meta::_gate. This
prevents wait for writer done finishes before the writer is in the
process of being created.
Fixes: #6853Fixes: #6868
Backports: 4.0, 4.1, 4.2
The log "removing pending replacing node" is printed whenever a node
jumps to normal status including a normal restart. For example, on
node1, we saw the following when node2 restarts.
[shard 0] storage_service - Node 127.0.0.2 state jump to normal
[shard 0] storage_service - Remove node 127.0.0.2 from pending replacing endpoint
This is confusing since no node is really being replaced.
To fix, log only if a node is really removed from the pending replacing
nodes.
In addition, since do_remove_node will call del_replacing_endpoint,
there is no need to call del_replacing_endpoint again in
storage_service::handle_state_normal after do_remove_node.
Fixes#6936
Currently, encountering an error when loading view build progress
would result in view builder refusing to start - which also means
that future views would not be built until the server restarts.
A more user-friendly solution would be to log an error message,
but continue to boot the view builder as if no views are currently
in progress, which would at least allow future views to be built
correctly.
The test case is also amended, since now it expects the call
to return that "no view builds are in progress" instead of
an exception.
Fixes#6934
Tests: unit(dev)
Message-Id: <9f26de941d10e6654883a919fd43426066cee89c.1595922374.git.sarna@scylladb.com>
In untyped_result_set::get_view, there exists a silent assumption
that the underlying data, which is an optional, to always be engaged.
In case the value happens to be disengaged it may lead to creating
an incorrect bytes view from a disengaged optional.
In order to make the code safer (since values parsed by this code
often come from the network and can contain virtually anything)
a segfault is replaced with an exception, by calling optional's
value() function, which throws when called on disengaged optionals.
Fixes#6915
Tests: unit(dev)
Message-Id: <6e9e4ca67e0e17c17b718ab454c3130c867684e2.1595834092.git.sarna@scylladb.com>
Turns out the fix f591c9c710 wasn't enough to make sure all input streams
are properly closed on failure.
It only closes the main input stream that belongs to context, but it misses
all the input streams that can be opened in the consumer for promote index
reading. Consumer stores a list of indexes, where each of them has its own
input stream. On failure, we need to make sure that every single one of
them is properly closed before destroying the indexes as that could cause
memory corruption due to read ahead.
Fixes#6924.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200727182214.377140-1-raphaelsc@scylladb.com>
With this the monitor is destroyed first. It makes intuitive sense to
me to destroy a monitor_X before X. This is also the order we had
before 55a8b6e3c9.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The goal is to have finer control over CDC "delta" rows, i.e.:
- disable them totally (mode `off`);
- record only PK+CK (mode `keys`);
- make them behave as usual (mode `full`, default).
This commit adds the necessary infrastructure to `cdc_options`.
In commit 8d27e1b, we added tracing (see docs/tracing.md) support to
Alternator requests. However, we never had a functional test that
verifies this feature actually works as expected, and we recently
noticed that for the GetItem and BatchGetItem requestd, the trace
doesn't really work (it returns an empty list of events).
So this patch adds a test, test/alternator/test_tracing.py, which verifies
that the tracing feature works for the PutItem, GetItem, DeleteItem,
UpdateItem, BatchGetItem, BatchWriteItem, Query and Scan operations.
This test is very peculiar. It needs to use out-of-band REST API
requests to enable and disable tracing (of course, the test is skipped
when running against AWS - this is a Scylla-only feature). It also needs
to read CQL-only system tables and does this using Alternator's
".scylla.alternator" interface for system tables - which came through
for us here beautifully and demonstrated their usefulness.
I paid a lot of attention for this test to remain reasonably fast -
this entire test now runs in a little less than one second. Achieving
this while testing eight different requests was a bit of a challenge,
because traces take time until they are visible in the trace table.
This is the main reason why in this patch the test for all eight
request types are done in one test, instead of eight separate tests.
Fixes#6891
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200727115401.1199024-1-nyh@scylladb.com>
"
0c6bbc8 refactored `get_rpc_client_idx()` to select different clients
for statement verbs depending on the current scheduling group.
The goal was to allow statement verbs to be sent on different
connections depending on the current scheduling group. The new
connections use per-connection isolation. For backward compatibility the
already existing connections fall-back to per-handler isolation used
previously. The old statement connection, called the default statement
connection, also used this. `get_rpc_client_idx()` was changed to select
the default statement connection when the current scheduling group is
the statement group, and a non-default connection otherwise.
This inadvertently broke `scheduling_group_for_verb()` which also used
this method to get the scheduling group to be used to isolate a verb at
handle register time. This method needs the default client idx for each
verb, but if verb registering is run under the system group it instead
got the non-default one, resulting in the per-handler isolation not
being set-up for the default statement connection, resulting in default
statement verb handlers running in whatever scheduling group the process
loop of the rpc is running in, which is the system scheduling group.
This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore.
In particular this caused severe problems with ranges scans, which in
some cases ended up using different semaphores per page resulting in a
crash. This could happen because when the page was read locally the code
would run in the statement scheduling group, but when the request
arrived from a remote coordinator via rpc, it was read in a system
scheduling group. This caused a mismatch between the semaphore the saved
reader was created with and the one the new page was read with. The
result was that in some cases when looking up a paused reader from the
wrong semaphore, a reader belonging to another read was returned,
creating a disconnect between the lifecycle between readers and that of
the slice and range they were referencing.
This series fixes the underlying problem of the scheduling group
influencing the verb handler registration, as well as adding some
additional defenses if this semaphore mismatch ever happens in the
future. Inactive read handles are now unique across all semaphores,
meaning that it is not possible anymore that a handle succeeds in
looking up a reader when used with the wrong semaphore. The range scan
algorithm now also makes sure there is no semaphore mismatch between the
one used for the current page and that of the saved reader from the
previous page.
I manually checked that each individual defense added is already
preventing the crash from happening.
Fixes: #6613Fixes: #6907Fixes: #6908
Tests: unit(dev), manual(run the crash reproducer, observe no crash)
"
* 'query-classification-regressions/v1' of https://github.com/denesb/scylla:
multishard_mutation_query: use cached semaphore
messaging: make verb handler registering independent of current scheduling group
multishard_mutation_query: validate the semaphore of the looked-up reader
reader_concurrency_semaphore: make inactive read handles unique across semaphores
reader_concurrency_semaphore: add name() accessor
reader_concurrency_semaphore: allow passing name to no-limit constructor
Issue #6919 was caused by an incorrect assumption: I *assumed* that we see
the tracing session record, we can be sure that the event records for this
session had already been written. In this patch we add a paragraph to
the tracing documentation - docs/tracing.md, which explains that this
assumption is in fact incorrect:
1. On a multi-node setup, replicas may continue to write tracing events
after the coordinator "finished" (moved to background) the request
and wrote the session record.
2. Even on a single-node setup, the writes of the session record and the
individual events are asynchronous, and can happen in an unexpected
order (which is what happened in issue #6919).
Refs #6919.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200727102438.1194314-1-nyh@scylladb.com>
On 2d63acdd6a we replaced 'ol' and 'amzn'
to 'oracle' and 'amazon', but distro.id() actually returns 'amzn' for
Amazon Linux 2, so we need to revert the change.
Fixes#6882
Merged patch set by Botond Dénes:
The view update generation process creates two readers. One is used to
read the staging sstables, the data which needs view updates to be
generated for, and another reader for each processed mutation, which
reads the current value (pre-image) of each row in said mutation. The
staging reader is created first and is kept alive until all staging data
is processed. The pre-image reader is created separately for each
processed mutation. The staging reader is not restricted, meaning it
does not wait for admission on the relevant reader concurrency
semaphore, but it does register its resource usage on it. The pre-image
reader however *is* restricted. This creates a situation, where the
staging reader possibly consumes all resources from the semaphore,
leaving none for the later created pre-image reader, which will not be
able to start reading. This will block the view building process meaning
that the staging reader will not be destroyed, causing a deadlock.
This patch solves this by making the staging reader restricted and
making it evictable. To prevent thrashing -- evicting the staging reader
after reading only a really small partition -- we only make the staging
reader evictable after we have read at least 1MB worth of data from it.
test/boost: view_build_test: add test_view_update_generator_buffering
test/boost: view_build_test: add test test_view_update_generator_deadlock
reader_permit: reader_resources: add operator- and operator+
reader_concurrency_semaphore: add initial_resources()
test: cql_test_env: allow overriding database_config
mutation_reader: expose new_reader_base_cost
db/view: view_updating_consumer: allow passing custom update pusher
db/view: view_update_generator: make staging reader evictable
db/view: view_updating_consumer: move implementation from table.cc to view.cc
database: add make_restricted_range_sstable_reader()
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
---
db/view/view_updating_consumer.hh | 51 ++++++++++++++++++++++++++++---
db/view/view.cc | 39 +++++++++++++++++------
db/view/view_update_generator.cc | 19 +++++++++---
3 files changed, 91 insertions(+), 18 deletions(-)
In some cases estimated number of partitions can be 0, which is albeit a
legit estimation result, breaks many low-level sstable writer code, so
some of these have assertions to ensure estimated partitions is > 0.
To avoid hitting this assert all users of the sstable writers do the
clamping, to ensure estimated partitions is at least 1. However leaving
this to the callers is error prone as #6913 has shown it. As this
clamping is standard practice, it is better to do it in the writers
themselves, avoiding this problem altogether. This is exactly what this
patch does. It also adds two unit tests, one that reproduces the crash
in #6913, and another one that ensures all sstable writers are fine with
estimated partitions being 0 now. Call sites previously doing the
clamping are changed to not do it, it is unnecessary now as the writer
does it itself.
Fixes#6913
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200724120227.267184-1-bdenes@scylladb.com>
"
This makes sure that monitors are always owned by the same struct that
owns the monitored writer, simplifying the lifetime management.
This hopefully fixes some of the crashes we have observed around this
area.
"
* 'espindola/use-compaction_writer-v6' of https://github.com/espindola/scylla:
sstables: Rename _writer to _compaction_writer
sstables: Move compaction_write_monitor to compaction_writer
sstables: Add couple of writer() getters to garbage_collected_sstable_writer
sstables: Move compaction_write_monitor earlier in the file
A DELETE statement checks that the deletion range is symmetrically
bounded. This check was broken for expression TRUE.
Test the fix by setting initial_key_restrictions::expression to TRUE,
since CQL doesn't currently allow WHERE TRUE. That change has been
proposed anyway in feedback to #5763:
https://github.com/scylladb/scylla/pull/5763#discussion_r443213343
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
0c6bbc8 refactored `get_rpc_client_idx()` to select different clients
for statement verbs depending on the current scheduling group.
The goal was to allow statement verbs to be sent on different
connections depending on the current scheduling group. The new
connections use per-connection isolation. For backward compatibility the
already existing connections fall-back to per-handler isolation used
previously. The old statement connection, called the default statement
connection, also used this. `get_rpc_client_idx()` was changed to select
the default statement connection when the current scheduling group is
the statement group, and a non-default connection otherwise.
This inadvertently broke `scheduling_group_for_verb()` which also used
this method to get the scheduling group to be used to isolate a verb at
handle register time. This method needs the default client idx for each
verb, but if verb registering is run under the system group it instead
got the non-default one, resulting in the per-handler isolation not
being set-up for the default statement connection, resulting in default
statement verb handlers running in whatever scheduling group the process
loop of the rpc is running in, which is the system scheduling group.
This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore.
The new verb is used to replace the current gossip shadow round
implementation. Current shadow round implementation reuses the gossip
syn and ack async message, which has plenty of drawbacks. It is hard to
tell if the syn messages to a specific peer node has responded. The
delayed responses from shadow round can apply to the normal gossip
states even if the shadow round is done. The syn and ack message
handler are full special cases due to the shadow round. All gossip
application states including the one that are not relevant are sent
back. The gossip application states are applied and the gossip
listeners are called as if is in the normal gossip operation. It is
completely unnecessary to call the gossip listeners in the shadow round.
This patch introduces a new verb to request the exact gossip application
states the shadow round needed with a synchronous verb and applies the
application states without calling the gossip listeners. This patch
makes the shadow round easier to reason about, more robust and
efficient.
Refs: #6845
Tests: update_cluster_layout_tests.py
"
`close_on_failure` was committed to seastar so use
the library version.
This requires making the lambda function passed to
it nothrow move constructible, so this series also
makes db::commitlog::descriptor move constructor noexcept
and changes allocate_segment_ex and segment::segment
to get a descriptor by value rather than by reference.
Test: unit(dev), commitlog_test(debug)
"
* tag 'commit-log-use-with_file_close_on_failure-v1' of github.com:bhalevy/scylla:
commitlog: use seastar::with_file_close_on_failure
commitlog: descriptor: make nothrow move constructible
commitlog: allocate_segment_ex, segment: pass descriptor by value
commitlog: allocate_segment_ex: filename capture is unused
In another patch I noticed gcc producing dead functions. I am not sure
why gcc is doing that. Some of those functions are already placed in
independent sections, and so can be garbage collected by the linker.
This is a 1% text section reduction in scylla, from 39363380 to
38974324 bytes. There is no difference in the tps reported by
perf_simple_query.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200723152511.8214-1-espindola@scylladb.com>
In the patch "Add exception overloads for Dynamo types", Alternator's single
api_error exception type was replaced by a more complex hierarchy of types.
The implementation was not only longer and more complex to understand -
I believe it also negated an important observation:
The "api_error" exception type is special. It is not an exception created
by code for other code. It is not meant to be caught in Alternator code.
Instead, it is supposed to contain an error message created for the *user*,
containing one of the few supported exception exception "names" described
in the DynamoDB documentation, and a user-readable text message. Throwing
such an exception in Alternator code means the thrower wants the request
to abort immediately, and this message to reach the user. These exceptions
are not designed to be caught in Alternator code. Code should use other
exceptions - or alternatives to exceptions (e.g., std::optional) for
problems that should be handled before returning a different error to the
user. Moreover, "api_error" isn't just thrown as an exception - it can
also be returned-by-value in a executor::request_return_type) - which is
another reason why it should not be subclassed.
For these reasons, I believe we should have a single api_error type, and
it's wrong to subclass it. So in this patch I am reverting the subclasses
and template added in the aforementioned patch.
Still, one correct observation made in that patch was that it is
inconvenient to type in DynamoDB exception names (no help from the editor
in completing those strings) and also error-prone. In this patch we
propse a different - simpler - solution to the same problem:
We add trivial factory functions, e.g., api_error::validation(std::string)
as a shortcut to api_error("ValidationException"). The new implementation
is easy to understand, and also more self explanatory to readers:
It is now clear that "api_error::validation()" is actually a user-visible
"api_error", something which was obscured by the name validation_exception()
used before this patch.
Finally, this patch also improves the comment in error.hh explaining the
purpose of api_error and the fact it can be returned or thrown. The fact
it should not be subclassed is legislated with a "finally". There is also
no point of this class inheriting from std::exception or having virtual
functions, or an empty constructor - so all these are dropped as well.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
* 'api-error-refactor' of https://github.com/nyh/scylla:
alternator: use api_error factory functions in auth.cc
alternator: use api_error::validation()
alternator: use api_error factory functions in executor.cc
alternator: use api_error factory functions in server.cc
alternator: refactor api_error class
It's better to assert a certain vector size first and only then
dereference its elements - otherwise, if a bug causes the size
to be different, the test can crash with a segfault on an invalid
dereference instead of graciously failing with a test assertion.
Generating bounds from multi-column restrictions used to create
incorrect nonwrapping intervals, which only happened to work because
they're implemented as wrapping intervals underneath.
The following CQL restriction:
WHERE (a, b) >= (1, 0)
should translate to
(a, b) >= (1, 0), no upper bound,
while it incorrectly translates to
(a, b) >= (1, 0) AND (a, b) < empty-prefix.
Since empty prefix is smaller than any other clustering key,
this range was in fact not correct, since the assumption
was that starting bound was never greater than the ending bound.
While the bug does not trigger any errors in tests right now,
it starts to do so after the code is modified in order to
correctly handle empty intervals (intervals with end > start).
To make sure it belongs to the same semaphore that the database thinks
is appropriate for the current query. Since a semaphore mismatch points
to a serious bug, we use `on_internal_error()` to allow generating
coredumps on-demand.
Currently inactive read handles are only unique within the same
semaphore, allowing for an unregister against another semaphore to
potentially succeed. This can lead to disasters ranging from crashes to
data corruption. While a handle should never be used with another
semaphore in the first place, we have recently seen a bug (#6613)
causing exactly that, so in this patch we prevent such unregister
operations from ever succeeding by making handles unique across all
semaphores. This is achieved by adding a pointer to the semaphore to the
handle.
All the places in auth.cc where we constructed an api_error with inline
strings now use api_error factory functions.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
All the places in conditions.cc, expressions.cc and serialization.cc where
we constructed an api_error, we always used the ValidationException type
string, which the code repeated dozens of times.
This patch converts all these places to use the factory function
api_error::validation().
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
All the places in executor.cc where we constructed an api_error with inline
strings now use api_error factory functions. Most of them, but not all of
them, were api_error::validation(). We also needed to add a couple more of
these factory functions.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
All the places in server.cc where we constructed an api_error with inline
strings now use api_error factory functions - we needed to add a few more.
Interestingly, we had a wrong type string for "Internal Server Error",
which we fix in this patch. We wrote the type string like that - with spaces -
because this is how it was listed in the DynamoDB documentation at
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html
But this was in fact wrong, and it should be without spaces:
"InternalServerError". The botocore library (for example) recognizes it
this way, and this string can also be seen in other online DynamoDB examples.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In the patch "Add exception overloads for Dynamo types", Alternator's single
api_error exception type was replaced by a more complex hierarchy of types.
The implementation was not only longer and more complex to understand -
I believe it also negated an important observation:
The "api_error" exception type is special. It is not an exception created
by code for other code. It is not meant to be caught in Alternator code.
Instead, it is supposed to contain an error message created for the *user*,
containing one of the few supported exception exception "names" described
in the DynamoDB documentation, and a user-readable text message. Throwing
such an exception in Alternator code means the thrower wants the request
to abort immediately, and this message to reach the user. These exceptions
are not designed to be caught in Alternator code. Code should use other
exceptions - or alternatives to exceptions (e.g., std::optional) for
problems that should be handled before returning a different error to the
user. Moreover, "api_error" isn't just thrown as an exception - it can
also be returned-by-value in a executor::request_return_type) - which is
another reason why it should not be subclassed.
For these reasons, I believe we should have a single api_error type, and
it's wrong to subclass it. So in this patch I am reverting the subclasses
and template added in the aforementioned patch.
Still, one correct observation made in that patch was that it is
inconvenient to type in DynamoDB exception names (no help from the editor
in completing those strings) and also error-prone. In this patch we
propse a different - simpler - solution to the same problem:
We add trivial factory functions, e.g., api_error::validation(std::string)
as a shortcut to api_error("ValidationException"). The new implementation
is easy to understand, and also more self explanatory to readers:
It is now clear that "api_error::validation()" is actually a user-visible
"api_error", something which was obscured by the name validation_exception()
used before this patch.
Finally, this patch also improves the comment in error.hh explaining the
purpose of api_error and the fact it can be returned or thrown. The fact
it should not be subclassed is legislated with a "finally". There is also
no point of this class inheriting from std::exception or having virtual
functions, or an empty constructor - so all these are dropped as well.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
"
There are 5 services, that register their RPC handlers in messaging
service, but quite a few of them unregister them on stop.
Unregistering is somewhat critical, not just because it makes the
code look clean, but also because unregistration does wait for the
message processing to complete, thus avoiding use-after-free's in
the handlers.
In particular, several handlers call service::get_schema_for_write()
which, in turn, may end up in service::maybe_sync() calling for
the local migration manager instance. All those handlers' processing
must be waited for before stopping the migration manager.
The set brings the RPC handlers unregistration in sync with the
registration part.
tests: unit (dev)
dtest (dev: simple_boot_shutdown, repair)
start-stop by hands (dev)
fixes: #6904
"
* 'br-rpc-unregister-verbs' of https://github.com/xemul/scylla:
main: Add missing calls to unregister RPC hanlers
messaging: Add missing per-service unregistering methods
messaging: Add missing handlers unregistration helpers
streaming: Do not use db->invoke_on_all in vain
storage_proxy: Detach rpc unregistration from stop
main: Shorten call to storage_proxy::init_messaging_service
Currently, we talk to a seed node in each gossip round with some
probability, i.e., nr_of_seeds / (nr_of_live_nodes + nr_of_unreachable_nodes)
For example, with 5 seeds in a 50 nodes cluster, the probability is 0.1.
Now that we talk to all live nodes, including the seed nodes, in a
bounded time period. It is not a must to talk to seed node in each
gossip round.
In order to get rid of the seed concept, do not talk to seed node
explicitly in each gossip round.
This patch is a preparatory patch to remove the seed concept in gossip.
Refs: #6845
Tests: update_cluster_layout_tests.py
Currently, we select 10 percent of random live nodes to talk with in
each gossip round. There is no upper bound how long it will take to talk
to all live nodes.
This patch changes the way we select live nodes to talk with as below:
1) Shuffle all the live endpoints randomly
2) Split the live endpoints into 10 groups
3) Talk to one of the groups in each gossip round
4) Go to step 1 to shuffle again after we groups are talked with
We keep both randomness of selecting nodes as before and determinacy to complete
talking to all live nodes.
In addition, the way to favor newly added node is simplified. When a
new live node is added, it is always added to the front of the group, so
it will be talked with in the next gossip round.
This patch is a preparatory patch to remove the seed concept in gossip.
Refs: #6845
Tests: update_cluster_layout_tests.py
Let each submodule be responsible for its own dependencies, and
call the submodule's dependency installation script.
Reviewed-by: Piotr Jastrzebski <piotr@scylladb.com>
Reviewed-by: Takuya ASADA <syuu@scylladb.com>
tools/java and tools/jmx have their own relocatable packages (and rpm/deb),
so they should not be part of the main relocatable package.
Enforce this by enabling the filter parameter in reloc_add, and passing
a filter that excludes tools/java and tools/jmx.
There is one monitor per writer, so we new keep them together in the
compaction_writer struct.
This trivially guarantees that the monitor is always destroyed before
the writer.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The gossiper's and migration_manager's unregistration is done on
the services' stopm, for the rest we need to call the recently
introduced methods.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
5 services register handlers in messaging, but not all of them
have clear unregistration methods.
Summary:
migration_manager: everything is in place, no changes
gossiper: ditto
proxy: some verbs unregistration is missing
repair: no unregistration at all
streaming: ditto
This patch adds the needed unregistration methods.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Handlers for each verb have both -- register and unregister helpers, but unregistration ones
for some verbs are missing, so here they are.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The proxy's stop method is not called (and unlikely will be soon), but stopping
the message handlers is needed now, so prepare the existing method for this.'
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If ring_delay == 0, something fishy is going on, e.g. single-node tests
are being performed. In this case we want the CDC generation to start
operating immediately. There is no need to wait until it propagates to
the cluster.
You should not use ring_delay == 0 in production.
Fixes https://github.com/scylladb/scylla/issues/6864.
"
This adds a '--io-setup N' command line option, which users can pass to
specify whether they want to run the "scylla_io_setup" script or not.
This is useful if users want to specify I/O settings themselves in
environments such as Kubernetes, where running "iotune" is problematic.
While at it, add the same option to "scylla_setup" to keep the interface
between that script and Docker consistent.
Fixes#6587
"
* penberg-penberg/docker-no-io-setup:
scylla_setup: Add '--io-setup ENABLE' command line option
dist/docker: Add '--io-setup ENABLE' command line option
seastar::make_lw_shared has a constructor taking a T&&. There is no
such constructor in std::make_shared:
https://en.cppreference.com/w/cpp/memory/shared_ptr/make_shared
This means that we have to move from
make_lw_shared(T(...)
to
make_lw_shared<T>(...)
If we don't want to depend on the idiosyncrasies of
seastar::make_lw_shared.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This replaces a lot of make_lw_shared(schema(...)) with
make_shared_schema(...).
This makes it easier to drop a dependency on the differences between
seastar::make_shared and std::make_shared.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
seastar::make_shared has a constructor taking a T&&. There is no such
constructor in std::make_shared:
https://en.cppreference.com/w/cpp/memory/shared_ptr/make_shared
This means that we have to move from
make_shared(T(...)
to
make_shared<T>(...)
If we don't want to depend on the idiosyncrasies of
seastar::make_shared.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This moves a few calls to make_shared to a single location.
This makes it easier to drop a dependency on the differences between
seastar::make_shared and std::make_shared.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This moves a few calls to make_shared to a single location.
This makes it easier to drop a dependency on the differences between
seastar::make_shared and std::make_shared.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
* seastar 4a99d56453...02ad74fa7d (5):
> TLS: Use "known" (precalculated) DH parameters if available
> tutorial: fix advanced service_loop examples
> tutorial: further fix service_loop example text
> linux-aio: make the RWF_NOWAIT support work again
> locking_test: Fix a use after return
To make the "scylla_setup" interface similar to Docker image, let's add
a "--io-setup ENABLE" command line option. The old "--no-io-setup"
option is retained for compatibility.
This adds a '--io-setup N' command line option, which users can pass to
specify whether they want to run the "scylla_io_setup" script or not.
This is useful if users want to specify I/O settings themselves in
environments such as Kubernetes, where running "iotune" is problematic.
Fixes#6587
A test case which reproduces the view update generator hang, where the
staging reader consumes all resources and leaves none for the pre-image
reader which blocks on the semaphore indefinitely.
We recently saw a weird log message:
WARN 2020-07-19 10:22:46,678 [shard 0] repair - repair id [id=4,
uuid=0b1092a1-061f-4691-b0ac-547b281ef09d] failed: std::runtime_error
({shard 0: fmt::v6::format_error (invalid type specifier), shard 1:
fmt::v6::format_error (invalid type specifier)})
It turned out we have:
throw std::runtime_error(format("repair id {:d} on shard {:d} failed to
repair {:d} sub ranges", id, shard, nr_failed_ranges));
in the code, but we changed the id from integer to repair_uniq_id class.
We do not really need to specify the format specifiers for numbers.
Fixes#6874
To allow tests to reliably calculate the amount of resources they need
to consume in order to effectively reduce the resources of the semaphore
to a desired amount. Using `available_resources()` is not reliable as it
doesn't factor in resources that are consumed at the moment but will be
returned later.
This will also benefit debugging coredumps where we will now be able to
tell how much resources the semaphore was created with and this
calculate the amount of memory and count currently used.
So that tests can test the `view_update_consumer` in isolation, without
having to set up the whole database machinery. In addition to less
infrastructure setup, this allows more direct checking of mutations
pushed for view generation.
The view update generation process creates two readers. One is used to
read the staging sstables, the data which needs view updates to be
generated for, and another reader for each processed mutation, which
reads the current value (pre-image) of each row in said mutation. The
staging reader is created first and is kept alive until all staging data
is processed. The pre-image reader is created separately for each
processed mutation. The staging reader is not restricted, meaning it
does not wait for admission on the relevant reader concurrency
semaphore, but it does register its resource usage on it. The pre-image
reader however *is* restricted. This creates a situation, where the
staging reader possibly consumes all resources from the semaphore,
leaving none for the later created pre-image reader, which will not be
able to start reading. This will block the view building process meaning
that the staging reader will not be destroyed, causing a deadlock.
This patch solves this by making the staging reader restricted and
making it evictable. To prevent thrashing -- evicting the staging reader
after reading only a really small partition -- we only make the staging
reader evictable after we have read at least 1MB worth of data from it.
A variant of `make_range_sstable_reader()` that wraps the reader in a
restricting reader, hence making it wait for admission on the read
concurrency semaphore, before starting to actually read.
Staging SSTables can be incorrectly added or removed from the backlog tracker,
after an ALTER TABLE or TRUNCATE, because the add and removal don't take
into account if the SSTable requires view building, so a Staging SSTable can
be added to the tracker after a ALTER table, or removed after a TRUNCATE,
even though not added previously, potentially causing the backlog to
become negative.
Fixes#6798.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200716180737.944269-1-raphaelsc@scylladb.com>
test.py passes the "--junit-xml" option to test/alternator/run, which passes
this option to pytest to get an xunit-format summary of the test results.
However, unfortunately until very recent versions (which aren't yet in Linux
distributions), pytest defaulted to a non-standard xunit format which tools
like Jenkins couldn't properly parse. The more standard format can be chosen
by passing the option "-o junit_family=xunit2", so this is what we do in
this patch.
Fixes#6767 (hopefully).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200719203414.985340-1-nyh@scylladb.com>
"
The set's goal is to reduce the indirect fanout of 3 headers only,
but likely affects more. The measured improvement rates are
flat_mutation_reader.hh: -80%
mutation.hh : -70%
mutation_partition.hh : -20%
tests: dev-build, 'checkheaders' for changed headers (the tree-wide
fails on master)
"
* 'br-debloat-mutation-headers' of https://github.com/xemul/scylla:
headers:: Remove flat_mutation_reader.hh from several other headers
migration_manager: Remove db/schema_tables.hh inclustion into header
storage_proxy: Remove frozen_mutation.hh inclustion
storage_proxy: Move paxos/*.hh inclusions from .hh to .cc
storage_proxy: Move hint_wrapper from .hh to .cc
headers: Remove mutation.hh from trace_state.hh
schema_mutations
When upgrading from a version that lacks some schema features,
during the transition, when we have a mixed cluster. Schema digests
are calculated without taking into account the mixed cluster supported
features. Every node calculate the digest as if the whole cluster supports
its supported features.
Scylla already has a mechanism of redaction to the lowest common
denominator, but it haven't been used in this context.
This commit is using the redaction mechanism when calculating the digest on
the newly added table so it will match the supported features of the
whole cluster.
Tests: Manual upgrading - upgraded to a version with an additional
feature and additional schema column and validated that the digest
of the tables schema is identical on every node on the mixed cluster.
All they can live with forward declaration of the f._m._r. plus a
seastar header in commitlog code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The schema_tables.hh -> migration_manager.hh couple seems to work as one
of "single header for everyhing" creating big blot for many seemingly
unrelated .hh's.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's only used there, but requires mutation_query.hh, which can thus be
removed from storage_proxy.hh
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In test_streams.py, we had the line:
assert desc['StreamDescription'].get('StreamLabel')
In Alternator, the 'StreamLabel' attribute is missing, which the author of
this test probably thought would cause this test to fail (which is expected,
the test is marked with "xfail"). However, my version of pytest actually
doesn't like that assert is given a value instead of a comparison, and we
get the warning message:
PytestAssertRewriteWarning: asserting the value None, please use "assert is None"
I think that the nicest replacement for this line is
assert 'StreamLabel' in desc['StreamDescription']
This is very readable, "pythonic", and checks the right thing - it checks
that the JSON must include the 'StreamLabel' item, as the get() assertion
was supposed to have been doing.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200716124621.906473-1-nyh@scylladb.com>
If external is true, _u.ptr is not null. An empty managed_bytes uses
the internal representation.
The current code looks scary, since it seems possible that backref
would still point to the old location, which would invite corruption
when the reclaimer runs.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200716233124.521796-1-espindola@scylladb.com>
* seastar 0fe32ec59...4a99d5645 (3):
> httpd: Don't warn on ECONNABORTED
> httpd: Avoid calling future::then twice on the same future
Fixes#6709.
> futures: Add a test for a broken promise in repeat
Ninja has a special pool called console that causes programs in that
pool to output directly to the console instead of being logged. By
putting test.py in it is now possible to run just
$ ninja dev-test
And see the test.py output while it is running.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200716204048.452082-1-espindola@scylladb.com>
Besdies being more robust than passing const descriptor&
to continuations, this helps simplify making allocate_segment_ex's
continuations nothrow_move_constructible, that is need for using
seastar::with_file_close_on_failure().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
Fix#6825 by explicitly distinguishing single- from multi-column expressions in AST.
Tests: unit (dev), dtest secondary_indexes_test.py (dev)
"
* dekimir-single-multiple-ast:
cql3/restrictions: Separate AST for single column
cql3/restrictions: Single-column helper functions
Corresponding overload of `storage_proxy::mutate_locally`
was hardcoded to pass `db::commitlog::force_sync::no` to the
`database::apply`. Unhardcode it and substitute `force_sync::no`
to all existing call sites (as it were before).
`force_sync::yes` will be used later for paxos learn writes
when trying to apply mutations upgraded from an obsolete
schema version (similar to the current case when applying
locally a `frozen_mutation` stored in accepted proposal).
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200716124915.464789-1-pa.solodovnikov@scylladb.com>
"
We'd like to use the same uuid both for printing compaction log
messages and to update compaction_history.
Generate one when starting compaction and keep it in
compaction_info. Then use it by convention in all
compaction log messages, along with compaction type,
and keyspace.table information. Finally, use the
same uuid to update compaction_history.
Fixes#6840
"
* tag 'compaction-uuid-v1' of github.com:bhalevy/scylla:
compaction: print uuid in log messages
compaction: report_(start|finish): just return description
compaction: move compaction uuid generation to compaction_info
The targets {dev|debug|release}-test run all unit tests, including
alternator/run. But this test requires the Scylla executable, which
wasn't among the dependencies. Fix it by adding build/$mode/scylla to
the dependency list.
Fixes#6855.
Tests: `ninja dev-test` after removing build/dev/scylla
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
It is legal for a user to create a table with name that has a
_scylla_cdc_log suffix. In such case, the table won't be treated as a
cdc log table, and does not require a corresponding base table to exist.
During refactoring done as a part of initial implemetation of of
Alternator streams (#6694), `is_log_for_some_table` started throwing
when trying to check a name like `X_scylla_cdc_log` when there was no
table with name `X`. Previously, it just returned false.
The exception originates inside `get_base_table`, which tries to return
the base table schema, not checking for its existence - which may throw.
It makes more sense for this function to return nullptr in such case (it
already does when provided log table name does not have the cdc log
suffix), so this patch adds an explicit check and returns nullptr when
necessary.
A similar oversight happened before (see #5987), so this patch also adds
a comment which explains why existence of `X_scylla_cdc_log` does not
imply existence of `X`.
Fixes: #6852
Refs: #5724, #5987
By convention, print the following information
in all compaction log messages:
[{compaction.type} {keyspace}.{table} {compaction.uuid}]
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than logging the message in the virtual callee method
just return a string description and make the logger call in
the common caller.
1. There is no need to do the logger call in the callee,
it is simpler to format the log message in the the caller
and just retrieve the per-compaction-type description.
2. Prepare to centrally print the compaction uuid.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We'd like to use the same uuid both for printing compaction log
messages and to update compaction_history.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Existing AST assumes the single-column expression is a special case of
multi-column expressions, so it cannot distinguish `c=(0)` from
`(c)=(0)`. This leads to incorrect behaviour and dtest failures. Fix
it by separating the two cases explicitly in the AST representation.
Modify AST-creation code to create different AST for single- and
multi-column expressions.
Modify AST-consuming code to handle column_name separately from
vector<column_name>. Drop code relying on cardinality testing to
distinguisn single-column cases.
Add a new unit test for `c=(0)`.
Fixes#6825.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
This commit is separated out for ease of review. It introduces some
functions that subsequent commits will use, thereby reducing diff
complexity of those subsequent commits. Because the new functions
aren't invoked anywhere, they are guarded by `#if 0` to avoid
unused-function errors.
The new functions perform the same work as their existing namesakes,
but assuming single-column expressions. The old versions continue to
try handling single-column as a special case of multi-column,
remaining unable to distinguish between `c = (1)` and `(c) = (1)`.
This will change in the next commit, which will drop attempts to
handle single-column cases from existing functions.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Currently, repair uses an integer to identify a repair job. The repair
id starts from 1 since node restart. As a result, different repair jobs
will have same id across restart.
To make the id more unique across restart, we can use an uuid in
addition to the integer id. We can not drop the use of the integer id
completely since the http api and nodetool use it.
Fixes#6786
The markdown syntax in the contribution section is incorrect, which is
why the links appear on the same line.
Improve the contribution section by making it more explicit what the
links are about.
Message-Id: <20200714070716.143768-1-penberg@scylladb.com>
This test usually fails, with the following error. Marking it "xfail" until
we can get to the bottom of this.
dynamodb = dynamodb.ServiceResource()
dynamodbstreams = <botocore.client.DynamoDBStreams object at 0x7fa91e72de80>
def test_get_records(dynamodb, dynamodbstreams):
# TODO: add tests for storage/transactionable variations and global/local index
with create_stream_test_table(dynamodb, StreamViewType='NEW_AND_OLD_IMAGES') as table:
arn = wait_for_active_stream(dynamodbstreams, table)
p = 'piglet'
c = 'ninja'
val = 'lucifers'
val2 = 'flowers'
> table.put_item(Item={'p': p, 'c': c, 'a1': val, 'a2': val2})
test_streams.py:316:
...
E botocore.exceptions.ClientError: An error occurred (Internal Server Error) when calling the PutItem operation (reached max retries: 3): Internal server error: std::runtime_error (cdc::metadata::get_stream: could not find any CDC stream (current time: 2020/07/15 17:26:36). Are we in the middle of a cluster upgrade?)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/6694
by Calle Wilund:
Implementation of DynamoDB streams using Scylla CDC.
Fixes#5065
Initial, naive implementation insofar that it uses 1:1 mapping CDC stream to
DynamoDB shard. I.e. there are a lot of shards.
Includes tests verified against both local DynamoDB server and actual AWS
remote one.
Note:
Because of how data put is implemented in alternator, currently we do not
get "proper" INSERT labels for first write of data, because to CDC it looks
like an update. The test compensates for this, but actual users might not
like it.
In the script test/alternator/run, which runs Scylla for the Alternator
tests, add the "--experimental-features=cdc" option, to allow us testing
the streams API whose implementation requires the experimenal CDC feature.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
* tools/java 50dbf77123...3eca0e3511 (1):
> config: Avoid checking options and filtering scylla.yaml
* tools/jmx 5820992...aa94fe5 (3):
> dist: redhat: reduce log spam from unpacking sources when building rpm
> Merge 'gitignore: fix typo and add scylla-apiclient/target/' from Benny
> apiclient: Bump Jackson version to 2.10.4
On .deb package with new relocatable package format, all files moved to
under scylla/ directory.
So we need to call ./scylla/install.sh on debian/rules, but it does not work
correctly, since install.sh does not support calling from other directory.
To support this, we need to changedir to scylla top directory before
copying files.
Merge pull request https://github.com/scylladb/scylla/pull/6834 by
Juliusz Stasiewicz:
NULLs used to give false positives in GT, LT, GEQ and LEQ ops performed upon
ALLOW FILTERING. That was a consequence of not distinguishing NULL from an
empty buffer.
This patch excludes NULLs on high level, preventing them from entering LHS
of any comparison, i.e. it assumes that any binary operation should return
false whenever the LHS operand is NULL (note: at the moment filters with
RHS NULL, such as ...WHERE x=NULL ALLOW FILTERING, return empty sets anyway).
Fixes#6295
* '6295-do-not-compare-nulls-v2' of github.com:jul-stas/scylla:
filtering_test: check that NULLs do not compare to normal values
cql3/restrictions: exclude NULLs from comparison in filtering
The data model is now
bplus::tree<Key = int64_t, T = array<entry>>
where entry can be cache_entry or memtable_entry.
The whole thing is encapsulated into a collection called "double_decker"
from patch #3. The array<T> is an array of T-s with 0-bytes overhead used
to resolve hash conflicts (patch #2).
branch:
tests: unit(debug)
tests before v7:
unit(debug) for new collections, memtable and row_cache
unit(dev) for the rest
perf(dev)
* https://github.com/xemul/scylla/commits/row-cache-over-bptree-9:
test: Print more sizes in memory_footprint_test
memtable: Switch onto B+ rails
row_cache: Switch partition tree onto B+ rails
memtable: Count partitions separately
token: Introduce raw() helper and raw comparator
row-cache: Use ring_position_comparator in some places
dht: Detach ring_position_comparator_for_sstables
double-decker: A combination of B+tree with array
intrusive-array: Array with trusted bounds
utils: B+ tree implementation
test: Move perf measurement helpers into header
impl_count_function doesn't explicitly initialize _count, so its correctness
depends on default initialization. Let's explicitly initialize _count to
make the code future proof.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200714162604.64402-1-raphaelsc@scylladb.com>
Option to control the alternator streams CDC query/shard range time
confidence interval, i.e. the period we enforce as timestamp threshold
when reading. The default, 10s, should be sufficient on a normal
cluster, but across DCs:, or with client timestamps or whatever, one might
need a larger window.
Subroutines needed by (in this case) streams implementation
moved from being file-static to class-static (exported).
To make putting handler routines in separate sources possible.
Because executor.cc is large and slow to compile.
Separation is nice.
Unfortunately, not all methods can be kept class-private,
since unrelated types also use them.
Reviewer suggested to instead place there is a top-level
header for export, i.e. not class-private at all.
I am skipping that for now, mainly because I can't come up
with a good file name. Can be part of a generate refactor
of helper routine organization in executor.
Add types exception overloads for ValidationException, ResourceNotFoundException, etc,
to avoid writing explicit error type as string everywhere (with the potential for
spelling errors ever present).
Also allows intellisense etc to complete the exception when coded.
To allow immediate json value conversion for types we
have TypeHelper<...>:s for.
Typed opt-get to get both automatic type conversion, _and_
find functionality in one call.
Tested operators are: `<` and `>`. Tests all types of NULLs except
`duration` because durations are explicitly not comparable. Values
to compare against were chosen arbitrarily.
The row cache memory footprint changed after switch to B+
because we no longer have a sole cache_entry allocation, but
also the bplus::data and bplus::node. Knowing their sizes
helps analyzing the footprint changes.
Also print the size of memtable_entry that's now also stored
in B+'s data.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The change is the same as with row-cache -- use B+ with int64_t token
as key and array of memtable_entry-s inside it.
The changes are:
Similar to those for row_cache:
- compare() goes away, new collection uses ring_position_comparator
- insertion and removal happens with the help of double_decker, most
of the places are about slightly changed semantics of it
- flags are added to memtable_entry, this makes its size larger than
it could be, but still smaller than it was before
Memtable-specific:
- when the new entry is inserted into tree iterators _might_ get
invalidated by double-decker inner array. This is easy to check
when it happens, so the invalidation is avoided when possible
- the size_in_allocator_without_rows() is now not very precise. This
is because after the patch memtable_entries are not allocated
individually as they used to. They can be squashed together with
those having token conflict and asking allocator for the occupied
memory slot is not possible. As the closest (lower) estimate the
size of enclosing B+ data node is used
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The row_cache::partitions_type is replaced from boost::intrusive::set
to bplus::tree<Key = int64_t, T = array_trusted_bounds<cache_entry>>
Where token is used to quickly locate the partition by its token and
the internal array -- to resolve hashing conflicts.
Summary of changes in cache_entry:
- compare's goes away as the new collection needs tri-compare one which
is provided by ring_position_comparator
- when initialized the dummy entry is added with "after_all_keys" kind,
not "before_all_keys" as it was by default. This is to make tree
entries sorted by token
- insertion and removing of cache_entries happens inside double_decker,
most of the changes in row_cache.cc are about passing constructor args
from current_allocator.construct into double_decker.empace_before()
- the _flags is extended to keep array head/tail bits. There's a room
for it, sizeof(cache_entry) remains unchanged
The rest fits smothly into the double_decker API.
Also, as was told in the previous patch, insertion and removal _may_
invalidate iterators, but may leave them intact. However, currently
this doesn't seem to be a problem as the cache_tracker ::insert() and
::on_partition_erase do invalidate iterators unconditionally.
Later this can be otimized, as iterators are invalidated by double-decker
only in case of hash conflict, otherwise it doesn't change arrays and
B+ tree doesn't invalidate its.
tests: unit(dev), perf(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In next patches the entries having token on-board will be
moved onto B+-tree rails. For this the int64_t value of the
token will be used as B+ key, so prepare for this.
One corner case -- the after_all_keys tokens must be resolved
to int64::max value to appear at the "end" of the tree. This
is not the same as "before_all_keys" case, which maps to the
int64::min value which is not allowed for regular tokens. But
for the sake of B+ switch this is OK, the conflicts of token
raw values are explicitly resolved in next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The row cache (and memtable) code uses own comparators built on top
of the ring_position_comparator for collections of partitions. These
collections will be switched from the key less-compare to the pair
of token less-compare + key tri-compare.
Prepare for the switch by generalizing the ring_partition_comparator
and by patching all the non-collections usage of less-compare to use
one.
The memtable code doesn't use it outside of collections, but patch it
anyway as a part of preparations.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patches will generalize ring_position_comparator with templates
to replace cache_entry's and memtable_entry's comparators. The overload
of operator() for sstables has its own implementation, that differs from
the "generic" one, for smoother generalization it's better to detach it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The collection is K:V store
bplus::tree<Key = K, Value = array_trusted_bounds<V>>
It will be used as partitions cache. The outer tree is used to
quickly map token to cache_entry, the inner array -- to resolve
(expected to be rare) hash collisions.
It also must be equipped with two comparators -- less one for
keys and full one for values. The latter is not kept on-board,
but it required on all calls.
The core API consists of just 2 calls
- Heterogenuous lower_bound(search_key) -> iterator : finds the
element that's greater or equal to the provided search key
Other than the iterator the call returns a "hint" object
that helps the next call.
- emplace_before(iterator, key, hint, ...) : the call construct
the element right before the given iterator. The key and hint
are needed for more optimal algo, but strictly speaking not
required.
Adding an entry to the double_decker may result in growing the
node's array. Here to B+ iterator's .reconstruct() method
comes into play. The new array is created, old elements are
moved onto it, then the fresh node replaces the old one.
// TODO: Ideally this should be turned into the
// template <typename OuterCollection, typename InnerCollection>
// but for now the double_decker still has some intimate knowledge
// about what outer and inner collections are.
Insertion into this collection _may_ invalidate iterators, but
may leave intact. Invalidation only happens in case of hashing
conflict, which can be clearly seen from the hint object, so
there's a good room for improvement.
The main usage by row_cache (the find_or_create_entry) looks like
cache_entry find_or_create_entry() {
bound_hint hint;
it = lower_bound(decorated_key, &hint);
if (!hint.found) {
it = emplace_before(it, decorated_key.token(), hint,
<constructor args>)
}
return *it;
}
Now the hint. It contains 3 booleans, that are
- match: set to true when the "greater or equal" condition
evaluated to "equal". This frees the caller from the need
to manually check whether the entry returned matches the
search key or the new one should be inserted.
This is the "!found" check from the above snippet.
To explain the next 2 bools, here's a small example. Consider
the tree containing two elements {token, partition key}:
{ 3, "a" }, { 5, "z" }
As the collection is sorted they go in the order shown. Next,
this is what the lower_bound would return for some cases:
{ 3, "z" } -> { 5, "z" }
{ 4, "a" } -> { 5, "z" }
{ 5, "a" } -> { 5, "z" }
Apparently, the lower bound for those 3 elements are the same,
but the code-flows of emplacing them before one differ drastically.
{ 3, "z" } : need to get previous element from the tree and
push the element to it's vector's back
{ 4, "a" } : need to create new element in the tree and populate
its empty vector with the single element
{ 5, "a" } : need to put the new element in the found tree
element right before the found vector position
To make one of the above decisions the .emplace_before would need
to perform another set of comparisons of keys and elements.
Fortunately, the needed information was already known inside the
lower_bound call and can be reported via the hint.
Said that,
- key_match: set to true if tree.lower_bound() found the element
for the Key (which is token). For above examples this will be
true for cases 3z and 5a.
- key_tail: set to true if the tree element was found, but when
comparing values from array the bounding element turned out
to belong to the next tree element and the iterator was ++-ed.
For above examples this would be true for case 3z only.
And the last, but not least -- the "erase self" feature. Which is
given only the cache_entry pointer at hands remove it from the
collection. To make this happen we need to make two steps:
1. get the array the entry sits in
2. get the b+ tree node the vectors sits in
Both methods are provided by array_trusted_bounds and bplus::tree.
So, when we need to get iterator from the given T pointer, the algo
looks like
- Walk back the T array untill hitting the head element
- Call array_trusted_bounds::from_element() getting the array
- Construct b+ iterator from obtained array
- Construct the double_decker iterator from b+ iterator and from
the number of "steps back" from above
- Call double_decker::iterator.erase()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A plain array of elements that grows and shrinks by
constructing the new instance from an existing one and
moving the elements from it.
Behaves similarly to vector's external array, but has
0-bytes overhead. The array bounds (0-th and N-th
elemements) are determined by checking the flags on the
elements themselves. For this the type must support
getters and setters for the flags.
To remove an element from array there's also a nothrow
option that drops the requested element from array,
shifts the righter ones left and keeps the trailing
unused memory (so called "train") until reconstruction
or destruction.
Also comes with lower_bound() helper that helps keeping
the elements sotred and the from_element() one that
returns back reference to the array in which the element
sits.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
// The story is at
// https://groups.google.com/forum/#!msg/scylladb-dev/sxqTHM9rSDQ/WqwF1AQDAQAJ
This is the B+ version which satisfies several specific requirements
to be suitable for row-cache usage.
1. Insert/Remove doesn't invalidate iterators
2. Elements should be LSA-compactable
3. Low overhead of data nodes (1 pointer)
4. External less-only comparator
5. As little actions on insert/delete as possible
6. Iterator walks the sorted keys
The design, briefly is:
There are 3 types of nodes: inner, leaf and data, inner and leaf
keep build-in array of N keys and N(+1) nodes. Leaf nodes sit in
a doubly linked list. Data nodes live separately from the leaf ones
and keep pointers on them. Tree handler keeps pointers on root and
left-most and right-most leaves. Nodes do _not_ keep pointers or
references on the tree (except 3 of them, see below).
changes in v9:
- explicitly marked keys/kids indices with type aliases
- marked the whole erase/clear stuff noexcept
- disposers now accept object pointer instead of reference
- clear tree in destructor
- added more comments
- style/readability review comments fixed
Prior changes
**
- Add noexcepts where possible
- Restrict Less-comparator constraint -- it must be noexcept
- Generalized node_id
- Packed code for beging()/cbegin()
**
- Unsigned indices everywhere
- Cosmetics changes
**
- Const iterators
- C++20 concepts
**
- The index_for() implmenetation is templatized the other way
to make it possible for AVX key search specialization (further
patching)
**
- Insertion tries to push kids to siblings before split
Before this change insertion into full node resulted into this
node being split into two equal parts. This behaviour for random
keys stress gives a tree with ~2/3 of nodes half-filled.
With this change before splitting the full node try to push one
element to each of the siblings (if they exist and not full).
This slows the insertion a bit (but it's still way faster than
the std::set), but gives 15% less total number of nodes.
- Iterator method to reconstruct the data at the given position
The helper creates a new data node, emplaces data into it and
replaces the iterator's one with it. Needed to keep arrays of
data in tree.
- Milli-optimize erase()
- Return back an iterator that will likely be not re-validated
- Do not try to update ancestors separation key for leftmost kid
This caused the clear()-like workload work poorly as compared to
std:set. In particular the row_cache::invalidate() method does
exactly this and this change improves its timing.
- Perf test to measure drain speed
- Helper call to collect tree counters
**
- Fix corner case of iterator.emplace_before()
- Clean heterogenous lookup API
- Handle exceptions from nodes allocations
- Explicitly mark places where the key is copied (for future)
- Extend the tree.lower_bound() API to report back whether
the bound hit the key or not
- Addressed style/cleanness review comments
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
1. storage_service: Make handle_state_left more robust when tokens are empty
In case the tokens for the node to be removed from the cluster are
empty, log the application_state of the leaving node to help understand
why the tokens are empty and try to get the tokens from token_metadata.
2. token_metadata: Do not throw if empty tokens are passed to remove_bootstrap_tokens
Gossip on_change callback calls storage_service::excise which calls
remove_bootstrap_tokens to remove the tokens of the leaving node from
bootstrap tokens. If empty tokens, e.g., due to gossip propagation issue
as we saw in #6468, are passed
to remove_bootstrap_tokens, it will throw. Since the on_change callback
is marked as noexcept, such throw will cause the node to terminate which
is an overkill.
To avoid such error causing the whole cluster to down in worse cases,
just log the tokens are empty passed to remove_bootstrap_tokens.
Refs #6468
Currently, when update_normal_tokens is called, a warning logged.
Token X changing ownership from A to B
It is not correct to log so because we can call update_normal_tokens
against a temporary token_metadata object during topology calculation.
Refs: #6437
NULLs used to give false positives in GT, LT, GEQ and LEQ ops
performed upon `ALLOW FILTERING`. That was a consequence of
not distinguishing NULL from an empty buffer. This patch excludes
NULLs on high level, preventing them from entering any comparison,
i.e. it assumes that any binary operator should return `false`
whenever one of the operands is NULL (note: ATM filters such as
`...WHERE x=NULL ALLOW FILTERING` return empty sets anyway).
`restriction_test/regular_col_slice` had to be updated accordingly.
Fixes#6295
The default ninja build target now builds artifacts and packages. Let's
add a 'build' target that only builds the artifacts.
Message-Id: <20200714105042.416698-1-penberg@scylladb.com>
Consider a cluster with two nodes:
- n1 (dc1)
- n2 (dc2)
A third node is bootstrapped:
- n3 (dc2)
The n3 fails to bootstrap as follows:
[shard 0] init - Startup failed: std::runtime_error
(bootstrap_with_repair: keyspace=system_distributed,
range=(9183073555191895134, 9196226903124807343], no existing node in
local dc)
The system_distributed keyspace is using SimpleStrategy with RF 3. For
the keyspace that does not use NetworkTopologyStrategy, we should not
require the source node to be in the same DC.
Fixes: #6744
Backports: 4.0 4.1, 4.2
This adds a short project description to README to make the git
repository more discoverable. The text is an edited version of a Scylla
blurb provided by Peter Corless.
Message-Id: <20200714065726.143147-1-penberg@scylladb.com>
This patch changes the per table latencies histograms: read, write,
cas_prepare, cas_accept, and cas_learn.
Beside changing the definition type and the insertion method, the API
was changed to support the new metrics.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch changes the row locking latencies to use
time_estimated_histogram.
The change consist of changing the histogram definition and changing how
values are inserted to the histogram.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch moves the alternator latencies histograms to use the time_estimated_histogram.
The changes requires changing the defined type and use the simpler
insertion method.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
In case a row hash conflict, a hash in set_diff will get more than one
row from get_row_diff.
For example,
Node1 (Repair master):
row1 -> hash1
row2 -> hash2
row3 -> hash3
row3' -> hash3
Node2 (Repair follower):
row1 -> hash1
row2 -> hash2
We will have set_diff = {hash3} between node1 and node2, while
get_row_diff({hash3}) will return two rows: row3 and row3'. And the
error below was observed:
repair - Got error in row level repair: std::runtime_error
(row_diff.size() != set_diff.size())
In this case, node1 should send both row3 and row3' to peer node
instead of fail the whole repair. Because node2 does not have row3 or
row3', otherwise node1 won't send row with hash3 to node1 in the first
place.
Refs: #6252
The test/alternator/run script creates a temporary directory for the Scylla
database in /tmp. The assumption was that this is the fastest disk (usually
even a ramdisk) on the test machine, and we didn't need anything else from
it.
But it turns out that on some systems, /tmp is actually a slow disk, so
this patch adds a way to configure the temporary directory - if the TMPDIR
environment variable exists, it is used instead of /tmp. As before this
patch, a temporary subdirectry is created in $TMPDIR, and this subdirectory
is automatically deleted when the test ends.
The test.py script already passes an appropriate TMPDIR (testlog/$mode),
which after this patch the Alternator test will use instead of /tmp.
Fixes#6750
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200713193023.788634-1-nyh@scylladb.com>
Export TMPDIR environment variable pointing at a subdir of testlog.
This variable is used by seastar/scylla tests to create a
a subdirectory with temporary test data. Normally a test cleans
up the temporary directory, but if it crashes or is killed the
directory remains.
By resetting the default location from /tmp to testlog/{mode}
we allow test.py we consolidate all test artefacts in a single
place.
Fixes#6062, "test.py uses tmpfs"
* seastar 5632cf2146...0fe32ec596 (11):
> futures: Add a test for a broken promise in a parallel_for_each
> future: Simplify finally_body implementation
> futures_test: Extend nested_exception test
> Merge "make gate methods noexcept" from Benny
> tutorial: fix service_loop example
> sharded: fix doxygen \example clause for sharded_parameter
> Merge "future: Don't call need_preempt in 'then' and 'then_impl'" from Rafael
> future: Refactor a bit of duplicated code
> Merge "Add with_file helpers" from Benny
> Merge "Fix doxygen warnings" from Benny
> build: add doxygen to install-dependencies.sh
While fetching CDC generations, various exceptions can occur. They
are divided into "fatal" and "nonfatal", where "fatal" ones prevent
retrying of the fetch operation.
This patch makes `read_failure_exception` "non-fatal", because such
error may appear during restart. In general this type of error can
mean a few different things (e.g. an error code in a response from
replica, but also a broken connection) so retrying seems reasonable.
Fixes#6804
Consolidate the Docker image build instructions into the "Building Scylla"
section of the README instead of having it in a separate section in a different
place of the file.
Message-Id: <20200713132600.126360-1-penberg@scylladb.com>
Currently, if a user tries to CreateTable with a forbidden set of tags,
e.g., the Tags list is too long or contains an invalid value for
system:write_isolation, then the CreateTable request fails but the table
is still created. Without the tag of course.
This patch fixes this bug, and adds two test cases for it that fail
before this patch, and succeed with it. One of the test cases is
scylla_only because it checks the Scylla-specific system:write_isolation
tag, but the second test case works on DynamoDB as well.
What this patch does is to split the update_tags() function into two
parts - the first part just parses the Tags, validates them, and builds
a map. Only the second part actually writes the tags to the schema.
CreateTable now does the first part early, before creating the table,
so failure in parsing or validating the Tags will not leave a created
table behind.
Fixes#6809.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200713120611.767736-1-nyh@scylladb.com>
When tests are run in parallel, it is hard to tell how much time each test
ran. The time difference between consecutive printouts (indicating a test's
end) says nothing about the test's duration.
This patch adds in "--verbose" mode, at the end of each test result, the
duration in seconds (in wall-clock time) of the test. For example,
$ ./test.py --mode dev --verbose alternator
================================================================================
[N/TOTAL] TEST MODE RESULT
------------------------------------------------------------------------------
[1/2] boost/alternator_base64_test dev [ PASS ] 0.02s
[2/2] alternator/run dev [ PASS ] 26.57s
These durations are useful for recognizing tests which are especially slow,
or runs where all the tests are unusually slow (which might indicate some
sort of misconfiguration of the test machine).
Fixes#6759
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200706142109.438905-1-nyh@scylladb.com>
This adds a new "dist-<mode>" target, which builds the server package in
selected build mode together with the other packages, and wires it to
the "<mode>" target, which is built as part of default "ninja"
invocation.
This allows us to perform a full build, package, and test cycle across
all build modes with:
./configure.py && ninja && ./test.py
Message-Id: <20200713101918.117692-1-penberg@scylladb.com>
Improve the "pull_github_pr.sh" to detect the number of commits in a
pull request, and use "git cherry-pick" to merge single-commit pull
requests.
Message-Id: <20200713093044.96764-1-penberg@scylladb.com>
There are a bunch of renames that are done if PRODUCT is not the
default, but the Python code for them is incorrect. Path.glob()
is not a static method, and Path does not support .endswith().
Fix by constructing a Path object, and later casting to str.
Let's report each missing CPU feature individually, and improve the
error message a bit. For example, if the "clmul" instruction is missing,
the report looks as follows:
ERROR: You will not be able to run Scylla on this machine because its CPU lacks the following features: pclmulqdq
If this is a virtual machine, please update its CPU feature configuration or upgrade to a newer hypervisor.
Fixes#6528
Add links to the users and developers mailing lists, and the Slack
channel in README.md to make them more discoverable.
Message-Id: <20200713074654.90204-1-penberg@scylladb.com>
Simplify the build and run instructions by splitting the text in three
sections (prerequisites, building, and running) and streamlining the
steps a bit.
Message-Id: <20200713065910.84582-1-penberg@scylladb.com>
scylla fiber often fails to really unwind the entire fiber, stopping
sooner than expected. This is expected as scylla fiber only recognizes
the most standard continuations but can drop the ball as soon as there
is an unusual transmission.
This commits adds a message below the found tasks explaining that the
list might not be exhaustive and prints a command which can be used to
explain why the unwinding stopped at the last task.
While at it also rephrase an out-of-date comment.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200710120813.100009-1-bdenes@scylladb.com>
As requested in #5763 feedback, require that Fn be callable with
binary_operator in the functions mentioned above.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
The problem is that this option is defined in seastar testing wrapper,
while no unit tests use it, all just start themselves with app.run() and
would complain on unknown option.
"Would", because nowadays every single test in it declares its own options
in suite.yaml, that override test.py's defaults. Once an option-less unit
test is added (B+ tree ones) it will complain.
The proposal is to remove this option from defaults, if any unit test will
use the seastar testing wrappers and will need this option, it can add one
to the suite.yaml.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200709084602.8386-1-xemul@scylladb.com>
The goal is to make the lambdas, that are fed into partition cache's
clear_and_dispose() and erase_in_dispose(), to be noexcept.
This is to satisfy B+, which strictly requires those to be noexcept
(currently used collections don't care).
The set covers not only the strictly required minimum, but also some
other methods that happened to be nearby.
* https://github.com/xemul/scylla/tree/br-noexcepts-over-the-row-cache:
row_cache: Mark invalidation lambda as noexcept
cache_tracker: Mark methods noexcept
cache_entry: Mark methods noexcept
region: Mark trivial noexcept methods as such
allocation_strategy: Mark returning lambda as noexcept
allocation_strategy: Mark trivial noexcept methods as such
dht: Mark noexcept methods
The "NULL" operator in Expected (old-style conditional operations) doesn't
have any parameters, so we insisted that the AttributeValueList be empty.
However, we forgot to allow it to also be missing - a possibility which
DynamoDB allows.
This patch adds a test to reproduce this case (the test passes on DyanmoDB,
fails on Alternator before this patch, and succeeds after this patch), and
a fix.
Fixes#6816.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200709161254.618755-1-nyh@scylladb.com>
If any of the compared bytes_view's is empty
consider the empty prefix is same and proceed to compare
the size of the suffix.
A similar issue exists in legacy_compound_view::tri_comparator::operator().
It too must not pass nullptr to memcmp if any of the compared byte_view's
is empty.
Fixes#6797
Refs #6814
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Test: unit(dev)
Branches: all
Message-Id: <20200709123453.955569-1-bhalevy@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/6741
by Piotr Dulikowski:
This PR changes the algorithm used to generate preimages and postimages
in CDC log. While its behavior is the same for non-batch operations
(with one exception described later), it generates pre/postimages that
are organized more nicely, and account for multiple updates to the same
row in one CQL batch.
Fixes#6597, #6598
Tests:
- unit(dev), for each consecutive commit
- unit(debug), for the last commit
Previous method
The previous method worked on a per delta row basis. First, the base
table is queried for the current state of the rows being modified in
the processed mutation (this is called the "preimage query"). Then,
for each delta row (representing a modification of a row):
If preimage is enabled and the row was already present in the table,
a corresponding preimage row is inserted before the delta row.
The preimage row contains data taken directly from the preimage
query result. Only columns that are modified by the delta are
included in the preimage.
If postimage is enabled, then a postimage row is inserted after the
delta row. The postimage row contains data which was a result of
taking row data directly from the preimage query result and applying
the change the corresponding delta row represented. All columns
of the row are included in the postimage.
The above works well for simple cases such like singular CQL INSERT,
UPDATE, DELETE, or simple CQL BATCH-es. An example:
cqlsh:ks> BEGIN UNLOGGED BATCH
INSERT INTO tbl (pk, ck, v) VALUES (0, 1, 111);
INSERT INTO tbl (pk, ck, v) VALUES (0, 2, 222);
APPLY BATCH;
cqlsh:ks> SELECT "cdc$batch_seq_no", "cdc$operation", "cdc$ttl",
pk, ck, v from ks.tbl_scylla_cdc_log ;
cdc$batch_seq_no | cdc$operation | cdc$ttl | pk | ck | v
------------------+---------------+---------+----+----+-----
...snip...
0 | 0 | null | 0 | 1 | 100
1 | 2 | null | 0 | 1 | 111
2 | 9 | null | 0 | 1 | 111
3 | 0 | null | 0 | 2 | 200
4 | 2 | null | 0 | 2 | 222
5 | 9 | null | 0 | 2 | 222
Preimage rows are represented by cdc operation 0, and postimage by 9.
Please note that all rows presented above share the same value of
cdc$time column, which was not shown here for brevity.
Problems with previous approach
This simple algorithm has some conceptual and implementational problems
which arise when processing more complicated CQL BATCH-es. Consider
the following example:
cqlsh:ks> BEGIN UNLOGGED BATCH
INSERT INTO tbl (pk, ck, v1) VALUES (0, 0, 1) USING TTL 1000;
INSERT INTO tbl (pk, ck, v2) VALUES (0, 0, 2) USING TTL 2000;
APPLY BATCH;
cqlsh:ks> SELECT "cdc$batch_seq_no", "cdc$operation", "cdc$ttl",
pk, ck, v1, v2 FROM tbl_scylla_cdc_log;
cdc$batch_seq_no | cdc$operation | cdc$ttl | pk | ck | v1 | v2
------------------+---------------+---------+----+----+------+------
...snip...
0 | 0 | null | 0 | 0 | null | 0
1 | 2 | 2000 | 0 | 0 | null | 2
2 | 9 | null | 0 | 0 | 0 | 2
3 | 0 | null | 0 | 0 | 0 | null
4 | 1 | 1000 | 0 | 0 | 1 | null
5 | 9 | null | 0 | 0 | 1 | 0
A single cdc group (corresponding to rows sharing the same cdc$time)
might have more than one delta that modify the same row. For example,
this happens when modifying two columns of the same row with
different TTLs - due to our choice of CDC log schema, we must
represent such change with two delta rows.
It does not make sense to present a postimage after the first delta
and preimage before the second - both deltas are applied
simultaneously by the same CQL BATCH, so the middle "image" is purely
imaginary and does not appear at any point in the table.
Moreover, in this example, the last postimage is wrong - v1 is updated,
but v2 is not. None of the postimages presented above represent the
final state of the row.
New algorithm
The new algorithm works now on per cdc group basis, not delta row.
When starting processing a CQL BATCH:
Load preimage query results into a data structure representing
current state of the affected rows.
For each cdc group:
For each row modified within the group, a preimage is produced,
regardless if the row was present in the table. The preimage
is calculated based on the current state. Only include columns
that are modified for this row within the group.
For each delta, produce a delta row and update the current state
accordingly.
Produce postimages in the same way as preimages - but include all
columns for each row in the postimage.
The new algorithm produces postimage correctly when multiple deltas
affect one, because the state of the row is updated on the fly.
This algorithm moves preimage and postimage rows to the beginning and
the end of the cdc group, accordingly. This solves the problem of
imaginary preimages and postimages appearing inside a cdc group.
Unfortunately, it is possible for one CQL BATCH to contain changes that
use multiple timestamps. This will result in one CQL BATCH creating
multiple cdc groups, with different cdc$time. As it is impossible, with
our choice of schema, to tell that those cdc groups were created from
one CQL BATCH, instead we pretend as if those groups were separate CQL
operations. By tracking the state of the affected rows, we make sure
that preimage in later groups will reflect changes introduces in
previous groups.
One more thing - this algorithm should have the same results for
singular CQL operations and simple CQL BATCH-es, with one exception.
Previously, preimage not produced if a row was not present in the
table. Now, the preimage row will appear unconditionally - it will have
nulls in place of column values.
* 'cdc-pre-postimage-persistence' of github.com:piodul/scylla:
cdc: fix indentation
cdc: don't update partition state when not needed
cdc: implement pre/postimage persistence
cdc: add interface for producing pre/postimages
cdc: load preimage query result into partition state fields
cdc: introduce fields for keeping partition state
cdc: rename set_pk_columns -> allocate_new_log_row
cdc: track batch_no inside transformer
cdc: move cdc$time generation to transformer
cdc: move find_timestamp to split.cc
cdc: introduce change_processor interface
cdc: remove redundant schema arguments from cdc functions
cdc: move management of generated mutations inside transformer
cdc: move preimage result set into a field of transformer
cdc: keep ts and tuuid inside transformer
cdc: track touched parts of mutations inside transformer
cdc: always include preimage for affected rows
All but few are trivially such.
The clear_continuity() calls cache_entry::set_continuous() that had become noexcept
a patch ago.
The allocator() calls region.allocator() which had been marked noexcept few patches
back.
The on_partition_erase() calls allocator().invalidate_references(), both had
been marked noexcept few patches back.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All but one are trivially such, the position() one calls is_dummy_entry()
which has become noexcept right now.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
cquery_nofail returns the query result, not a future. Invoking .get()
on its result is unnecessary. This just happened to compile because
shared_ptr has a get() method with the same signature as future::get.
Tests: cql_query_test unit test (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When "trace"-level logging is enabled for Alternator, we log every request,
but currently only the request's body. For debugging, it is sometimes useful
to also see the headers - which are important to debug authentication,
for example. So let's print the headers as well.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200709103414.599883-1-nyh@scylladb.com>
In one of the longevity tests, we observed 1.3s reactor stall which came from
repair_meta::get_full_row_hashes_source_op. It traced back to a call to
std::unordered_set::insert() which triggered big memory allocation and
reclaim.
I measured std::unordered_set, absl::flat_hash_set, absl::node_hash_set
and absl::btree_set. The absl::btree_set was the only one that seastar
oversized allocation checker did not warn in my tests where around 300K
repair hashes were inserted into the container.
- unordered_set:
hash_sets=295634, time=333029199 ns
- flat_hash_set:
hash_sets=295634, time=312484711 ns
- node_hash_set:
hash_sets=295634, time=346195835 ns
- btree_set:
hash_sets=295634, time=341379801 ns
The btree_set is a bit slower than unordered_set but it does not have
huge memory allocation. I do not measure real difference of total time
to finish repair of the same dataset with unordered_set and btree_set.
To fix, switch to absl btree_set container.
Fixes#6190
We had some tests for the number type in Alternator and how it can be
stored, retrieved, calculated and sorted, but only had rudementary tests
for the allowed magnitude and precision of numbers.
This patch creates a new test file, test_number.py, with tests aiming to
check exactly the supported magnitudes and precision of numbers.
These tests verify two things:
1. That Alternator's number type supports the full precision and magnitude
that DynamoDB's number type supports. We don't want to see precision
or magnitude lost when storing and retrieving numbers, or when doing
calculations on them.
2. That Alternator's number type does not have *better* precision or
magnitude than DynamoDB does. If it did, users may be tempted to rely
on that implementation detail.
The three tests of the first type pass; But all four tests of the second
type xfail: Alternator currently stores numbers using big_decimal which
has unlimited precision and almost-unlimited magnitude, and is not yet
limited by the precision and magnitude allowed by DynamoDB.
This is a known issue - Refs #6794 - and these four new xfailing tests
will can be used to reproduce that issue.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200707204824.504877-1-nyh@scylladb.com>
While building unified-deb we first use scylla/reloc/build_deb.sh to create the scylla core package, and after that scylla/reloc/python3/build_deb.sh to create python3.
On 058da69#diff-4a42abbd0ed654a1257c623716804c82 a new rm -rf command was added.
It causes python3 process to erase Scylla-core process.
Set python3 to erase its own dir scylla-python3-package only.
In some cases, tracking the state of processed rows inside `transformer`
is not needd at all. We don't need to do it if either:
- Preimage and postimage are disabled for the table,
- Only preimage is enabled and we are processing the last timestamp.
This commit disables updating the state in the cases listed above.
Moves responsibility for generating pre/postimage rows from the
"process_change" method to "produce_preimage" and "produce_postimage".
This commit actually affects the contents of generated CDC log
mutations.
Added a unit test that verifies more complicated cases with CQL BATCH.
Introduces new methods to the change_processor interface that will cause
it to produce pre/postimage rows for requested clustering key, or for
static row.
Introduces logic in split.cc responsible for calling pre/postimage
methods of the change_processor interface. This does not have any effect
on generated CDC log mutations yet, because the transformer class has
empty implementations in place of those methods.
Instead of looking up preimage data directly from the raw preimage query
results, use the raw results to populate current partition state data,
and read directly from the current partition state.
The function is no longer used in log.cc, so instead it is moved to
split.cc.
Removed declaration of the function from the log.hh header, because it
is not used elsewhere - apart from testing code, but it already
declared find_timestamp in the cdc_test.cc file.
This allows for a more refined use of the transformer by the
for_each_change function (now named "process_changes_with_splitting).
The change_processor interface exposes two methods so far:
begin_timestamp, and process_change (previously named "transform").
By separating those two and exposing them, process_changes_with\
_splitting can cause the transformer to generate less CDC log mutations
- only one for each timestamp in the batch.
Adds a `begin_timestamp` method which tells the `transformer` to start
using the following timestamp and timeuuid when generating new log row
mutations.
Moves tracking of the "touched parts" statistics inside the transformer
class.
This commit is the first of multiple commits in this series which move
parts of the state used in CDC log row generation inside the
`transformer` class. There is a lot of state being passed to
`transformer` each time its methods are called, which could be as well
tracked by the `transformer` itself. This will result in a nicer
interface and will allow us to generate less CDC log mutations which
give the same result.
"
For collections and UDTs the `MIN()` and `MAX()` functions are
generated on the fly. Until now they worked by comparing just the
byte representations of their arguments.
This patch employs specific per-type comparators to provide semantically
sensible, dynamically created aggregates.
Fixes#6768
"
* jul-stas-6768-use-type-comparators-for-minmax:
tests: Test min/max on set
aggregate_fcts: Use per-type comparators for dynamic types
Expected behavior is the lexicographical comparison of sets
(element by element), so this test was failing when raw byte
representations were compared.
For collections and UDTs the `MIN()` and `MAX()` functions are
generated on the fly. Until now they worked by comparing just the
byte representations of arguments.
This patch uses specific per-type comparators to provide semantically
sensible, dynamically created aggregates.
Fixes#6768
* seastar dbecfff5a4...cbf88f59f2 (14):
> future: mark detach_promise noexcept
> net/tls: wait for flush() in shutdown
> httpd: Use handle_exception instead of then_wrapped
> httpd: Use std::unique_ptr instead of a raw pointer
> with_lock: Handle mutable lambdas
> Merge "Make the coroutine implementation a bit more like seastar::thread" from Rafael
> tests: Fix perf fair queue stuck
> util/backtrace: don't get hash from moved simple_backtrace in tasktrace ctor
> scheduling: Allow scheduling_group_get_specific(key) to access elements with queue_is_initialized = false
> prometheus: be compatible with protobuf < 3.4.0
> Merge "fix with_lock error handling" from Benny
> Merge "Simplify the logic for detecting broken promises" from Rafael
> Merge "make scheduling_group functions noexcept" from Benny
> Merge "io_queue: Fixes and reworks for shared fair-queue" from Pavel E
"
This is the first stage of replacing the existing restrictions code with a new representation. It adds a new class `expression` to replace the existing class `restriction`. Lots of the old code is deleted, though not all -- that will come in subsequent stages.
Tests: unit (dev, debug restrictions_test), dtest (next-gating)
"
* dekimir-restrictions-rewrite:
cql3/restrictions: Drop dead code
cql3/restrictions: Use free functions instead of methods
cql3/restrictions: Create expression objects
cql3/restrictions: Add free functions over new classes
cql3/restrictions: Add new representation
Delete unused parts of the old restrictions representation:
- drop all methods, members, and types from class restriction, but
keep the class itself: it's the return type of
relation::to_restriction, which we're keeping intact for now
- drop all subclasses of single_column_restriction and
token_restriction, but keep multi_column_restriction subclasses for
their bounds_ranges method
Keep the restrictions (plural) class, because statement_restrictions
still keeps partition/clustering/other columns in separate
collections.
Move the restriction::merge_with method to primary_key_restrictions,
where it's still being used.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Instead of `restriction` class methods, use the new free functions.
Specific replacement actions are listed below.
Note that class `restrictions` (plural) remains intact -- both its
methods and its type hierarchy remain intact for now.
Ensure full test coverage of the replacement code with new file
test/boost/restrictions_test.cc and some extra testcases in
test/cql/*.
Drop some existing tests because they codify buggy behaviour
(reference #6369, #6382). Drop others because they forbid relation
combinations that are now allowed (eg, mixing equality and
inequality, comparing to NULL, etc.).
Here are some specific categories of what was replaced:
- restriction::is_foo predicates are replaced by using the free
function find_if; sometimes it is used transitively (see, eg,
has_slice)
- restriction::is_multi_column is replaced by dynamic casts (recall
that the `restrictions` class hierarchy still exists)
- utility methods is_satisfied_by, is_supported_by, to_string, and
uses_function are replaced by eponymous free functions; note that
restrictions::uses_function still exists
- restriction::apply_to is replaced by free function
replace_column_def
- when checking infinite_bound_range_deletions, the has_bound is
replaced by local free function bounded_ck
- restriction::bounds and restriction::value are replaced by the more
general free function possible_lhs_values
- using free functions allows us to simplify the
multi_column_restriction and token_restriction hierarchies; their
methods merge_with and uses_function became identical in all
subclasses, so they were moved to the base class
- single_column_primary_key_restrictions<clustering_key>::needs_filtering
was changed to reuse num_prefix_columns_that_need_not_be_filtered,
which uses free functions
Fixes#5799.
Fixes#6369.
Fixes#6371.
Fixes#6372.
Fixes#6382.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
"
When commitlog is recreated in hints manager, only shutdown() method is
called, but not release(). Because of that, some internal commitlog
objects (`segment_manager` and `segment`s) may be left pointing to each
other through shared_ptr reference cycles, which may result in memory
leak when the parent commitlog object is destroyed.
This PR prevents memory leaks that may happen this way by calling
release() after shutdown() from the hints manager.
Fixes: #6409, Fixes#6776
"
* piodul-fix-commitlog-memory-leak-in-hinted-handoff:
hinted handoff: disable warnings about segments left on disk
hinted handoff: release memory on commitlog termination
When a mutation is written to the commitlog, a rp_handle object is
returned which keeps a reference to commitlog segment. A segment is
"dirty" when its reference count is not zero, otherwise it is "clean".
When commitlog object is being destroyed, a warning is being printed
for every dirty segment. On the other hand, clean segments are deleted.
In case of standard mutation writing path, the rp_handle moves
responsibility for releasing the reference to the memtable to which the
mutation is written. When the memtable is flushed to disk, all
references accumulated in the memtable are released. In this context, it
makes sense to warn about dirty segments, because such segments contain
mutations that are not written to sstables, and need to be replayed.
However, hinted handoff uses a different workflow - it recreates a
commitlog object periodically. When a hint is written to commitlog, the
rp_handle reference is not released, so that segments with hints are not
deleted when destroying the commitlog. When commitlog is created again,
we get a list of saved segments with hints that we can try to send at a
later time.
Although this is intended behavior, now that releasing the hints
commitlog is done properly, it causes the mentioned warning to
periodically appear in the logs.
This patch adds a parameter for the commitlog that allows to disable
this warning. It is only used when creating hinted handoff commitlogs.
When commitlog is recreated in hints manager, only shutdown() method is
called, but not release(). Because of that, some internal commitlog
objects (`segment_manager` and `segment`s) may be left pointing to each
other through shared_ptr reference cycles, which may result in memory
leak when the parent commitlog object is destroyed.
This commit prevents memory leaks that may happen this way by calling
release() after shutdown() from the hints manager.
Fixes: #6409, #6776
Merged patch set from Piotr Sarna:
This series addresses issue #6700 again (it was reopened),
by forbidding all non-local schema changes to be performed
from within the database via CQL interface. These changes
are dangerous since they are not directly propagated to other
nodes.
Tests: unit(dev)
Fixes#6700
Piotr Sarna (4):
test: make schema changes in query_processor_test global
cql3: refuse to change schema internally for distributed tables
test: expand testing internal schema changes
cql3: add explanatory comments to execute_internal
cql3/query_processor.hh | 13 ++++++++++++-
cql3/statements/alter_table_statement.cc | 6 ------
cql3/statements/schema_altering_statement.cc | 15 +++++++++++++++
test/boost/cql_query_test.cc | 8 ++++++--
test/boost/query_processor_test.cc | 16 ++++++++--------
5 files changed, 41 insertions(+), 17 deletions(-)
"coredumpctl info" behavior had been changed since systemd-v232, we need to
support both version.
Before systemd-v232, it was simple.
It print 'Coredump' field only when the coredump exists on filesystem.
Otherwise print nothing.
After the change made on systemd-v232, it become more complex.
It always print 'Storage' field even the coredump does not exists.
Not just available/unavailable, it describe more:
- Storage: none
- Storage: journal
- Storage: /path/to/file (inacessible)
- Storage: /path/to/file
To support both of them, we need to detect message version first, then
try to detect coredump path.
Fixes: #6789
reference: 47f5064207
Instead of running shutil.copy() for each *.{service,default},
create symlink for these files.
Python will copy original file when copying debian directory.
Executing internal CQL queries needs to be done with caution,
since they were designed to be used mainly for local tables
and have very specific semantics wrt. propagating
schema changes. A short comment is added in order to prevent
future misuse of this interface.
Changing the schemas via internal calls to CQL is dangerous,
since the changes are not propagated to other nodes. Thus, it should
never be used for regular distributed tables.
The guarding code was already added for ALTER TABLE statement
and it's now expanded to cover all schema altering statements.
Tests: unit(dev)
Fixes#6700
Modified log message in view_builder::calculate_shard_build_step to make it distinct from the one in view_builder::execute, changed their logging level to warning, since we're continuing even if we handle an exception.
Fixes#4600
WHERE clauses with start point above the end point were handled
incorrectly. When the slice bounds are transformed to interval
bounds, the resulting interval is interpreted as wrap-around (because
start > end), so it contains all values above 0 and all values below
0. This is clearly incorrect, as the user's intent was to filter out
all possible values of a.
Fix it by explicitly short-circuiting to false when start > end. Add
a test case.
Fixes#5799.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
from Botond.
Recently it was observed (#6603) that since 4e6400293ea, the staging
reader is reading from a lot of sstables (200+). This consumes a lot of
memory, and after this reaches a certain threshold -- the entire memory
amount of the streaming reader concurrency semaphore -- it can cause a
deadlock within the view update generation. To reduce this memory usage,
we exploit the fact that the staging sstables are usually disjoint, and
use the partitioned sstable set to create the staging reader. This
should ensure that only the minimum number of sstable readers will be
opened at any time.
Refs: #6603Fixes: #6707
Tests: unit(dev)
* 'view-update-generator-use-partitioned-set/v1' of https://github.com/denesb/scylla:
db/view: view_update_generator: use partitioned sstable set
sstables: make_partitioned_sstable_set(): return an sstable_set
And pass it to `make_range_sstable_reader()` when creating the reader,
thus allowing the incremental selector created therein to exploit the
fact that staging sstables are disjoint (in the case of repair and
streaming at least). This should reduce the memory consumption of the
staging reader considerably when reading from a lot of sstables.
Instead of an `std::unique_ptr<sstable_set_impl>`. The latter doesn't
have a publicly available destructor, so it can only be called from
withing `sstables/compaction_strategy.cc` where its definition resides.
Thus it is not really usable as a public function in its current form,
which shows as it has no users either.
This patch makes it usable by returning an `sstable_set`. That is what
potential callers would want anyway. In fact this patch prepares the
ground for the next one, which wishes to use this function for just
that but can't in its current form.
It is not used any more after 75cf1d18b5
(storage_service: Unify handling of replaced node removal from gossip)
in the "Make replacing node take writes" series.
Refs #5482
In case the tokens for the node to be removed from the cluster are
empty, log the application_state of the leaving node to help understand
why the tokens are empty and try to get the tokens from token_metadata.
Refs #6468
"
This mini-series adds protection against reference loops between tasks,
preventing infinite recursion in this case.
It also contains some other improvements, like updating the task
whitelist as well as the task identification mechanism w.r.t. recent
changes in seastar.
It also improves verbose logging, which was found to not work well while
investigating the other issues fixed herein.
"
* 'scylla-gdb.py-scylla-fiber-update/v1' of https://github.com/denesb/scylla:
scylla-gdb.py: scylla_fiber: add protection against reference loops
scylla-gdb.py: scylla_fiber: relax requirement w.r.t. what object qualifies as task
scylla-gdb.py: scylla_fiber: update whitelist
scylla-gdb.py: scylla_fiber: improve verbose log output
Our JSON legacy helper functions for parsing documents to/from
string maps are indirectly tested by several unit tests, e.g.
caching_options_test.cc. They however lacked one corner case
detected only by dtest - parsing an empty map from a null JSON document.
This case is hereby added in order to prevent future regressions.
Message-Id: <df8243bd083b2ba198df665aeb944c8710834736.1594020411.git.sarna@scylladb.com>
"
Rather than relying on a gate to serialize seek's
background work with close(), change seek() to return a
future<> and wait on it.
Also, now random_access_reader read_exactly(), seek(), and close()
are made noexcept. This will be followed up by making
sstable parse methods noexcept.
Test: unit(dev)
"
* tag 'random_access_reader-v4' of github.com:bhalevy/scylla:
sstables: random_access_reader: make methods noexcept
sstables: random_access_reader: futurize seek
sstables: random_access_reader: unify input stream close code
sstables: random_access_reader: let file_random_access_reader set the input stream
sstables: random_access_reader: move functions out of line
Gossip on_change callback calls storage_service::excise which calls
remove_bootstrap_tokens to remove the tokens of the leaving node from
bootstrap tokens. If empty tokens, e.g., due to gossip propagation issue
as we saw in https://github.com/scylladb/scylla/issues/6468, are passed
to remove_bootstrap_tokens, it will throw. Since the on_change callback
is marked as noexcept, such throw will cause the node to terminate which
is an overkill.
To avoid such error causing the whole cluster to down in worse cases,
just log the tokens are empty passed to remove_bootstrap_tokens.
Refs #6468
Instead, label the mapped volume by passing `:z` options to `-v`
argument, like we do for other mapped volumes in the `dbuild` script.
Passing the `--privileged` flag doesn't work after the most recent
Fedora update and anyway, using `:z` is the proper way to make sure the
mounted volume is accessible. Historically it was needed to be able to
open cores as well, but since 5b08e91bd this is not necessary as the
container is created with SYS_PTRACE capability.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200703072703.10355-1-bdenes@scylladb.com>
handle all exceptions in read_exactly, seek, and close
and specify them as noexcept.
Also, specify eof() as noexcept as it trivially is.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And adjust its callers to wait on the returned future.
With this, there is no need for a gate to serialize close()
with the background work seek() used to leave behind.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Define a close_if_needed() helper function, to be called
from seek() and close().
A future patch will call it with a possibly disengaged
`_in` so it will close it only if it was engaged.
close_if_needed() captures the input stream unique ptr
so it will remain valid throughout close.
This was missing from close().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Allow file_random_access_reader constructor to set the
input stream to prepare for futurizing seek() by adding
a protected set() method.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
Follow scylla-server package changes, simplified .rpm/.deb build process which merge build scripts into single script.
"
* syuu1228-python3_simplified_pkg_scripts:
python3: simplified .deb build process
python3: simplified .rpm build process
"
This series converts an API to use std::string_view and then converts
a few sstring variables to be constexpr std::string_view. This has the
advantage that a constexpr variables cannot be part of any
initialization order problem.
"
* 'espindola/convert-to-constexpr' of https://github.com/espindola/scylla:
auth: Convert sstring variables in common.hh to constexpr std::string_view
auth: Convert sstring variables in default_authorizer to constexpr std::string_view
cql_test_env: Make ks_name a constexpr std::string_view
class_registry: Use std::string_view in (un)?qualified_name
We could hit "cannot serialize '_io.BufferedReader' object" when request get 404 error from the server
Now you will get legit error message in the case.
Fixes#6690
Print error message and exit with non-zero status by following condition:
- coredumpctl says the coredump file is inaccessible
- failed to detect coredump file path from 'coredumpctl info <pid>'
- deleting coredump file failed because the file is missing
Fixes#6654
Sync version number fixup from main package, contains #6546 and #6752 fixes.
Note that scylla-python3 likely does not affect this versioning issue,
since it uses python3 version, which normally does not contain 'rcX'.
This converts the following variables:
DEFAULT_SUPERUSER_NAME AUTH_KS USERS_CF AUTH_PACKAGE_NAME
Since they are now constexpr they will not be part of any
initialization order problems.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This converts the following variables:
ROLE_NAME RESOURCE_NAME PERMISSIONS_NAME PERMISSIONS_CF
Since they are now constexpr they will not be part of any
initialization order problems.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Merged patch series by Piotr Sarna:
The alternator project was in need of a more optimized
JSON library, which resulted in creating "rjson" helper functions.
Scylla generally used libjsoncpp for its JSON handling, but in order
to reduce the dependency hell, the usage is now migrated
to rjson, which is faster and offers the same functionality.
The original plan was to be able to drop the dependency
on libjsoncpp-lib altogether and remove it from install-dependencies.sh,
but one last usage of it remains in our test suite,
namely cql_repl. The tool compares its output JSON textually,
so it depends on how a library presents JSON - what are the delimeters,
indentation, etc. It's possible to provide a layer of translation
to force rjson to print in an identical format, but the other issue
is that libjsoncpp keeps subobjects sorted by their name,
while rjson uses an unordered structure.
There are two possible solutions for the last remaining usage
of libjsoncpp:
1. change our test suite to compare JSON documents with a JSON parser,
so that we don't rely on internal library details
2. provide a layer of translation which forces rjson to print
its objects in a format idential to libjsoncpp.
(1.) would be preferred, since now we're also vulnerable for changes
inside libjsoncpp itself - if they change anything in their output
format, tests would start failing. The issue is not critical however,
so it's left for later.
Tests: unit(dev), manual(json_test),
dtest(partitioner_tests.TestPartitioner.murmur3_partitioner_test)
Piotr Sarna (8):
alternator,utils: move rjson.hh to utils/
alternator: remove ambiguous string overloads in rjson
rjson: add parse_to_map helper function
rjson: add from_string_map function
rjson: add non-throwing parsing
rjson: move quote_json_string to rjson
treewide: replace libjsoncpp usage with rjson
configure: drop json.cc and json.hh helpers
alternator/base64.hh | 2 +-
alternator/conditions.cc | 2 +-
alternator/executor.hh | 2 +-
alternator/expressions.hh | 2 +-
alternator/expressions_types.hh | 2 +-
alternator/rmw_operation.hh | 2 +-
alternator/serialization.cc | 2 +-
alternator/serialization.hh | 2 +-
alternator/server.cc | 2 +-
caching_options.hh | 9 +-
cdc/log.cc | 4 +-
column_computation.hh | 5 +-
configure.py | 3 +-
cql3/functions/functions.cc | 4 +-
cql3/statements/update_statement.cc | 24 ++--
cql3/type_json.cc | 212 ++++++++++++++++++----------
cql3/type_json.hh | 7 +-
db/legacy_schema_migrator.cc | 12 +-
db/schema_tables.cc | 1 -
flat_mutation_reader.cc | 1 +
index/secondary_index.cc | 80 +++++------
json.cc | 80 -----------
json.hh | 113 ---------------
schema.cc | 25 ++--
test/boost/cql_query_test.cc | 9 +-
test/manual/json_test.cc | 4 +-
test/tools/cql_repl.cc | 1 +
{alternator => utils}/rjson.cc | 75 +++++++++-
{alternator => utils}/rjson.hh | 40 +++++-
29 files changed, 344 insertions(+), 383 deletions(-)
delete mode 100644 json.cc
delete mode 100644 json.hh
rename {alternator => utils}/rjson.cc (86%)
rename {alternator => utils}/rjson.hh (81%)
In order to eventually switch to a single JSON library,
most of the libjsoncpp usage is dropped in favor of rjson.
Unfortunately, one usage still remains:
test/utils/test_repl utility heavily depends on the *exact textual*
format of its output JSON files, so replacing a library results
in all tests failing because of differences in formatting.
It is possible to force rjson to print its documents in the exact
matching format, but that's left for later, since the issue is not
critical. It would be nice though if our test suite compared
JSON documents with a real JSON parser, since there are more
differences - e.g. libjsoncpp keeps children of the object
sorted, while rapidjson uses an unordered data structure.
This change should cause no change in semantics, it strives
just to replace all usage of libjsoncpp with rjson.
Existing infrastructure relies on being able to parse a JSON string
straight into a map of strings. In order to make rjson a drop-in
replacement(tm) for libjsoncpp, a similar helper function is provided.
It's redundant to provide function overloads for both string_view
and const string&, since both of them can be implicitly created from
const char*. Thus, only string_view overloads are kept.
Example code which was ambiguous before the patch, but compiles fine
after it:
rjson::from_string("hello");
Without the patch, one had to explicitly state the type, e.g.:
rjson::from_string(std::string_view("hello"));
which is excessive.
We currently does not able to apply version number fixup for .orig.tar.gz file,
even we applied correct fixup on debian/changelog, becuase it just reading
SCYLLA-VERSION-FILE.
We should parse debian/{changelog,control} instead.
Fixes#6736
Thrift 0.12 includes a change [1] that avoids writing the generated output
if it has not changed. As a result, if you touch cassandra.thrift
(but not change it), the generated files will not update, and as
a result ninja will try to rebuild them every time. The compilation
of thrift files will be fast due to ccache, but still we will re-link
everything.
This touching of cassandra.thrift can happen naturally when switching
to a different git branch and then switching back. The net result
is that cassandra.thrift's contents has not changed, but its timestamp
has.
Fix by adding the "restat" option to the thrift rule. This instructs
ninja to check of the output has changed as expected or not, and to
avoid unneeded rebuilds if it has not.
[1] https://issues.apache.org/jira/browse/THRIFT-4532
It looks like an order version of my patch series was merged. The only
difference is that the new one had more tests. This patch adds the
missing ones.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200630141150.1286893-1-espindola@scylladb.com>
Remember all previously visited tasks and stop if one of them is seen
again. The walk algorithm is converted from recursive to iterative to
facilitate this.
Don't require that the object is located at the start of the allocation
block. Some tasks, like `seastar::internal::when_all_state_component`
might not.
Gdb doesn't seem to handle multiple calls to `gdb.write()` writing to
the same line well, the content of some of these calls just disappears.
So make sure each log message is a separate line and use indentation
instead to illustrate references between messages.
In commit 12d929a5ae (repair: Add table_id
to row_level_repair), a call to find_column_family() was added in
repair_range. In case of the table is dropped, it will fail the
repair_range which in turn fails the bootstrap operation.
Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_new_node_while_schema_changes_test
Fixes: #5942
"
compaction_manager: Avoid stall in perform_cleanup
The following stall was seen during a cleanup operation:
scylla: Reactor stalled for 16262 ms on shard 4.
| std::_MakeUniq<locator::tokens_iterator_impl>::__single_object std::make_unique<locator::tokens_iterator_impl, locator::tokens_iterator_impl&>(locator::tokens_iterator_impl&) at /usr/include/fmt/format.h:1158
| (inlined by) locator::token_metadata::tokens_iterator::tokens_iterator(locator::token_metadata::tokens_iterator const&) at ./locator/token_metadata.cc:1602
| locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at simple_strategy.cc:?
| (inlined by) locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at ./locator/simple_strategy.cc:56
| locator::abstract_replication_strategy::get_ranges(gms::inet_address, locator::token_metadata&) const at /usr/include/fmt/format.h:1158
| locator::abstract_replication_strategy::get_ranges(gms::inet_address) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_ranges_for_endpoint(seastar::basic_sstring<char, unsigned int, 15u, true> const&, gms::inet_address const&) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_local_ranges(seastar::basic_sstring<char, unsigned int, 15u, true> const&) const at /usr/include/fmt/format.h:1158
| (inlined by) operator() at ./sstables/compaction_manager.cc:691
| (inlined by) _M_invoke at /usr/include/c++/9/bits/std_function.h:286
| std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>::operator()(table const&) const at /usr/include/fmt/format.h:1158
| (inlined by) compaction_manager::rewrite_sstables(table*, sstables::compaction_options, std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>) at ./sstables/compaction_manager.cc:604
| compaction_manager::perform_cleanup(table*) at /usr/include/fmt/format.h:1158
To fix, we furturize the function to get sstables. If get_local_ranges()
is called inside a thread, get_local_ranges will yield automatically.
Fixes#6662
"
* asias-compaction_fix_stall_in_perform_cleanup:
compaction_manager: Avoid stall in perform_cleanup
compaction_manager: Return exception future in perform_cleanup
abstract_replication_strategy: Add get_ranges_in_thread
"
Currently, the fast forwarding implementation of the shard reader is
broken in some read-ahead related corner cases, namely:
* If the reader was not created yet, but there is an ongoing read-ahead
(which is going to create it), the function bails out. This will
result in this shard reader not being fast-forwarded to the new range
at all.
* If the reader was already created and there is an ongoing read-ahead,
the function will wait for this to complete, then fast-forward the
reader, as it should. However, the buffer is cleared *before* the
read-ahead is waited for. So if the read-ahead brings in new data,
this will land in the buffer. This data will be outside of the
fast-forwarded-to range and worse, as we just cleared the buffer, it
might violate mutation fragment stream monotonicity requirements.
This series fixes these two bugs and adds a unit test which reproduces
both of them.
There are no known field issues related to these bugs. Only row-level
repair ever fast-forwards the multishard reader, but it only uses it in
heterogenous clusters. Even so, in theory none of these bugs affect
repair as it doesn't ever fast-forward the multishard reader before all
shards arrive at EOS.
The bugs were found while auditing the code, looking for the possible
cause of #6613.
Fixes: #6715
Tests: unit(dev)
"
* 'multishard-combining-reader-fast-forward-to-fixes/v1.1' of https://github.com/denesb/scylla:
test/boost/mutation_reader_test: multishard reader: add tests for fast-forwarding with read-ahead
test/boost/mutation_reader_test: extract multishard read-ahead test setup
test/boost/mutation_reader_test: puppet_reader: fast-forward-support
mutation_reader_test: puppet_reader: make interface more predictable
dht::sharder: add virtual destructor
mutation_reader: shard_reader: fix fast-forwarding with read-ahead
Testing the multishard reader's various read-ahead related corner cases
requires a non-trivial setup. Currently there is just one such test,
but we plan to add more so in this patch we extract this setup code to a
free function to allow reuse across multiple tests.
A fast-forwarded puppet reader goes immediately to EOS. A counter is
added to the remote control to allow tests to check which readers were
actually fast forwarded.
Currently the puppet reader will do an automatic (half) buffer-fill in
the constructor. This makes it very hard to reason about when and how
the action that was passed to it will be executed. Refactor it to take a
list of actions and only execute those, no hidden buffer-fill anymore.
No better proof is needed for this than the fact that the test which is
supposed to test the multishard reader being destroyed with a pending
read-ahead was silently broken (not testing what it should).
This patch fixes this test too.
Also fixed in this patch is the `pending` and `destroyed` fields of the
remote control, tests can now rely on these to be correct and add
additional checkpoints to ensure the test is indeed doing what it was
intended to do.
The following stall was seen during a cleanup operation:
scylla: Reactor stalled for 16262 ms on shard 4.
| std::_MakeUniq<locator::tokens_iterator_impl>::__single_object std::make_unique<locator::tokens_iterator_impl, locator::tokens_iterator_impl&>(locator::tokens_iterator_impl&) at /usr/include/fmt/format.h:1158
| (inlined by) locator::token_metadata::tokens_iterator::tokens_iterator(locator::token_metadata::tokens_iterator const&) at ./locator/token_metadata.cc:1602
| locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at simple_strategy.cc:?
| (inlined by) locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at ./locator/simple_strategy.cc:56
| locator::abstract_replication_strategy::get_ranges(gms::inet_address, locator::token_metadata&) const at /usr/include/fmt/format.h:1158
| locator::abstract_replication_strategy::get_ranges(gms::inet_address) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_ranges_for_endpoint(seastar::basic_sstring<char, unsigned int, 15u, true> const&, gms::inet_address const&) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_local_ranges(seastar::basic_sstring<char, unsigned int, 15u, true> const&) const at /usr/include/fmt/format.h:1158
| (inlined by) operator() at ./sstables/compaction_manager.cc:691
| (inlined by) _M_invoke at /usr/include/c++/9/bits/std_function.h:286
| std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>::operator()(table const&) const at /usr/include/fmt/format.h:1158
| (inlined by) compaction_manager::rewrite_sstables(table*, sstables::compaction_options, std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>) at ./sstables/compaction_manager.cc:604
| compaction_manager::perform_cleanup(table*) at /usr/include/fmt/format.h:1158
To fix, we furturize the function to get local ranges and sstables.
In addition, this patch removes the dependency to global storage_service object.
Fixes#6662
The current `fast_forward_to(const dht::partition_range&)`
implementation has two problems:
* If the reader was not created yet, but there is an ongoing read-ahead
(which is going to create it), the function bails out. This will
result in this shard reader not being fast-forwarded to the new range
at all.
* If the reader was already created and there is an ongoing read-ahead,
the function will wait for this to complete, then fast-forward the
reader, as it should. However, the buffer is cleared *before* the
read-ahead is waited for. So if the read-ahead brings in new data,
this will land in the buffer. This data will be outside of the
fast-forwarded-to range and worse, as we just cleared the buffer, it
might violate mutation fragment stream monotonicity requirements.
This patch fixes both of these bugs. Targeted reproducer unit tests are
coming in the next patches.
When we started to porting bash script to python script, we are not able to use
subprocess.run() since EPEL only provides python 3.4, but now we have
relocatable python, so we can switch to it.
* seastar 5db34ea8d3...dbecfff5a4 (3):
> sharded: Do not hang on never set freed promise
Fixes#6606.
> foreign_ptr: make constructors and methods conditionally noexcept
> foreign_ptr: specify methods as noexcept
Currently the section on "Debugging coredumps" only briefly mentions
relocatable binaries, then starts with an extensive subsection on how to
open cores generated by non-relocatable binaries. There is a subsection
about relocatable binaries, but it just contains some out-of-date
workaround without any context.
In this patch we completely replace this outdated and not very useful
subsection on relocatable binaries, with a much more extensive one,
documenting step-by-step a procedure that is known to work. Also, this
subsection is moved above the non-relocatable one. All our current
releases except for 2019.1 use relocatable binaries, so the subsection
about these should be the more prominent one.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200630145655.159926-1-bdenes@scylladb.com>
* seastar 11e86172ba...5db34ea8d3 (7):
> scollectd: Avoid a deprecated warning
> prometheus: Avoid protobuf deprecated warning
> Merge "Avoid abandoned futures in tests" from Rafael
> Merge "make circular_buffer methods noexcept" from Benny
> futures_test: Wait for future in test
> iotune: Report disk IOPS instead of kernel IOPS
> io_tester: Ability to add dsync option to open_file_dma
needs_cleanup() returns true if a sstable needs cleanup.
Turns out it's very slow because it iterates through all the local
ranges for all sstables in the set, making its complexity:
O(num_sstables * local_ranges)
We can optimize it by taking into account that abstract_replication_strategy
documents that get_ranges() will return a list of ranges that is sorted
and non-overlapping. Compaction for cleanup already takes advantage of that
when checking if a given partition can be actually purged.
So needs_cleanup() can be optimized into O(num_sstables * log(local_ranges)).
With num_sstables=1000, RF=3, then local_ranges=256(num_tokens)*3, it means
the max # of checks performed will go from 768000 to ~9584.
Fixes#6730.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200629171355.45118-2-raphaelsc@scylladb.com>
"
Recent branch introduced uncaught by regular snapshot tests issue
with snapshots listing via API, this set fixes it (patch #1),
cleans the indentation after the fix (#2) and polishes the way
stuff is captured nearby (#3)
Tests: dtest(nodetool_additional_tests)
"
* 'br-fix-snap-get-details' of https://github.com/xemul/scylla:
api: Remove excessive capture
api: Fix indentation after previous patch
api: Fix wrongly captured map of snapshots
Currently fmt::print is used to print an error message
if (broadcast_address != listen && seeds.count(listen))
and the logger should be used instead.
While at it, the information printed in this message is valueable
also in the error-free case, so this change logs it at `info`
level and then logs an error without repeating the said info.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Test: bootstrap_test.py:TestBootstrap.start_stop_test_node(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200630083826.153326-1-bhalevy@scylladb.com>
"
These patches removes the last few uses of variadic futures in scylla.
"
* 'espindola/no-variadic-future' of https://github.com/espindola/scylla:
row level repair: Don't return a variadic future from get_sink_source
row level repair: Don't return a variadic future from read_rows_from_disk
messaging_service: Don't return variadic futures from make_sink_and_source_for_*
cql3: Don't use variadic futures in select_statement
"
Offstrategy work, on boot and refresh, guarantees that a shared SSTable
will not reach the table whatsoever. We have lots of extra code in
table to make it able to live with those shared SSTables.
Now we can fortunately get rid of all that code.
tests: mode(dev).
also manually tested it by triggering resharding both on boot/refresh.
"
* 'cleanup_post_offstrategy_v2' of https://github.com/raphaelsc/scylla:
distributed_loader: kill unused invoke_shards_with_ptr()
sstables:: kill unused sstables::sstable_open_info
sstables: kill unused sstable::load_shared_components()
distributed_loader: remove declaration of inexistent do_populate_column_family()
table: simplify table::discard_sstables()
table: simplify add_sstable()
table: simplify update_stats_for_new_sstable()
table: remove unused open_sstable function
distributed_loader: remove unused code
table: no longer keep track of sstables that need resharding
table: Remove unused functions no longer used by resharding
table: remove sstable::shared() condition from backlog tracker add/remove functions
table: No longer accept a shared SSTable
This patch set adds a few new features in order to fix issue
The list of changes is briefly as follows:
- Add a new `LWT` flag to `cql3::prepared_metadata`,
which allows clients to clearly distinguish betwen lwt and
non-lwt statements without need to execute some custom parsing
logic (e.g. parsing the prepared query with regular expressions),
which is obviously quite fragile.
- Introduce the negotiation procedure for cql protocol extensions.
This is done via `cql_protocol_extension` enum and is expected
to have an appropriate mirroring implementation on the client
driver side in order to work properly.
- Implmenent a `LWT_ADD_METADATA_MARK` cql feature on top of the
aforementioned algorithm to make the feature negotiable and use
it conditionally (iff both server and client agrees with each
other on the set of cql extensions).
The feature is meant to be further utilized by client drivers
to use primary replicas consistently when dealing with conditional
statements.
* git@github.com:ManManson/scylla feature/lwt_prepared_meta_flag_2:
lwt: introduce "LWT" flag in prepared statement metadata
transport: introduce `cql_protocol_extension` enum and cql protocol extensions negotiation
Currently we we mistakenly made two different way to detect distribution,
directly reading /etc/os-release and use distro package.
distro package provides well abstracted APIs and still have full access to
os-release informations, we should switch to it.
Fixes#6691
After commit 7d86a3b208 (storage_service:
Make replacing node take writes), during replace operation, tokens in
_token_metadata for node being replaced are updated only after the replace
operation is finished. As a result, in range_streamer::add_ranges, the
node being replaced will be considered as a source to stream data from.
Before commit 7d86a3b208, the node being
replaced will not be considered as a source node because it is already
replaced by the replacing node before the replace operation is finished.
This is the reason why it works in the past.
To fix, filter out the node being replaced as a source node explicitly.
Tests: replace_first_boot_test and replace_stopped_node_test
Backports: 4.1
Fixes: #6728
get_shards_for_this_sstable() can be called inside table::add_sstable()
because the shards for a sstable is precomputed and so completely
exception safe. We want a central point for checking that table will
no longer added shared SSTables to its sstable set.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
no longer need to conditionally track the SSTable metadata,
as table will no longer accept shared SSTables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Now that table will no longer accept shared SSTables, it no longer
needs to keep track of them.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Now that table no longer accept shared SSTables, those two functions can
be simplified by removing the shared condition.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
With off-strategy work on reshard on boot and refresh, table no
longer needs to work with Shared SSTables. That will unlock
a host of cleanups.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The results of get_snapshot_details() is saved in do_with, then is
captured on the json callback by reference, then the do_with's
future returns, so by the time callback is called the map is already
free and empty.
Fix by capturing the result directly on the callback.
Fixes recently merged b6086526.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch adds a new `LWT` flag to `cql3::prepared_metadata`.
That allows clients to clearly distinguish betwen lwt and
non-lwt statements without need to execute some custom parsing
logic (e.g. parsing the prepared query with regular expressions),
which is obviously quite fragile.
The feature is meant to be further utilized by client drivers
to use primary replicas consistently when dealing with conditional
statements.
Whether to use lwt optimization flag or not is handled by negotiation
procedure between scylla server and client library via SUPPORTED/STARTUP
messages (`LWT_ADD_METADATA_MARK` extension).
Tests: unit(dev, debug), manual testing with modified scylla/gocql driver
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The tests in test_projection_expression.py test that ProjectionExpression
works - including attribute paths - for the GetItem, Query and Scan
operations.
There is a fourth read operation - BatchGetItem, and it supports
ProjectionExpression too. We tested BatchGetItem + ProjectionExpression in
test_batch.py, but this only tests the basic feature, with top-level
attributes, and we were missing a test for nested document paths.
This patch adds such a test. It is still xfailing on Alternator (and passing
on DynamoDB), because attribute paths are still not supported (this is
issue #5024).
Refs #5024.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200629063244.287571-1-nyh@scylladb.com>
This patch adds three more tests for the ProjectionExpression parameter
of GetItem. They are tests for nested document paths like a.b[2].c.
We don't support nested paths in Alternator yet (this is issue #5024),
so the new tests all xfail (and pass on DynamoDB).
We already had similar tests for UpdateExpression, which also needs to
support document paths, but the tests were missing for ProjectionExpression.
I am planning to start the implementation of document paths with
ProjectionExpression (which is the simplest use of document paths), so I
want the tests for this expression to be as complete as possible.
Refs #5024.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200628213208.275050-1-nyh@scylladb.com>
"
The snapshotting code is already well isolated from the rest of
the storage_service, so it's relatively easy to move it into
independent component, thus de-bloating the storage_service.
As a side effect this allows painless removal of calls to global
get_storage_service() from schema::describe code.
Test: unit(debug), dtest.snapshot_test(dev), manual start-stop
"
* 'br-snapshot-controller-4' of https://github.com/xemul/scylla:
snap: Get rid of storage_service reference in schema.cc
main: Stop http server
snapshot: Make check_snapshot_not_exist a method
snapshots: Move ops gate from storage_service
snapshot: Move lock from storage_service
snapshot: Move all code into db::snapshot_ctl class
storage_service: Move all snapshot code into snapshot-ctl.cc
snapshots: Initial skeleton
snapshots: Properly shutdown API endpoints
api: Rewrap set_server_snapshot lambda
We are currently investigating a segmentation fault, which is suspected
to be caused by a leaked pause handle. Although according to the latest
theory the handle leak is not the root cause of the issue, just a
symptom, its better to catch any bugs that would cause a handle leaking
at the act, and not later when some side-effect causes a segfault.
Refs: #6613
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200625153729.522811-1-bdenes@scylladb.com>
"
Before this series scylla would effectively infinite loop when, for
example, casting a decimal with a negative scale to float.
Fixes#6720
"
* 'espindola/fix-decimal-issue' of https://github.com/espindola/scylla:
big_decimal: Add a test for a corner case
big_decimal: Correctly handle negative scales
big_decimal: Add a as_rational member function
big_decimal: Move constructors out of line
As HACKING.md suggests, we now require gcc version >= 10. Set the
minimum at 10.1.1, as that is the first official 10 release:
https://gcc.gnu.org/releases.html
Tests: manually run configure.py and ensure it passes/fails
appropriately.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Now when the snapshot stopping is correctly handled, we may pull the database
reference all the way down to the schema::describe().
One tricky place is in table::napshot() -- the local db reference is pulled
through an smp::submit_to call, but thanks to the shard checks in the place
where it is needed the db is still "local"
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently it's not stopped at all, so calling a REST request shutdown-time
may crash things at random places.
Fixes: #5702
But it's not the end of the story. Since the server stays up while we are
shutting things down, each subsystem should carefully handle the cases when
it's half-down, but a request comes. A better solution is to unregister
rest verbs eventually, but httpd's rules cannot do it now.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
For this de-static run_snapshot_*_operation (because we no longer have
the static global to get the lock from) and make the snapshot_ctl be
peering_sharded_service to call invoke_on.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This includes
- rename namespace in snapshot-ctl.[cc|hh]
- move methods from storage_service to snapshot_ctl
- move snapshot_details struct
- temporarily make storage_service._snapshot_lock and ._snapshot_ops public
- replace two get_local_storage_service() occurrences with this._db
The latter is not 100% clear as the code that does this references "this"
from another shard, but the _db in question is the distributed object, so
they are all the same on all instances.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is plain move, no other modifications are made, even the
"service" namespace is kept, only few broken indentation fixes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A placeholder for snapshotting code that will be moved into it
from the storage_service.
Also -- pass it through the API for future use.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now with the seastar httpd routes unset() at hands we
can shut down individual API endpoints. Do this for
snapshot calls, this will make snapshot controller stop
safe.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The lambda calls the core snapshot method deep inside the
json marshalling callback. This will bring problems with
stopping the snapshot controller in the next patches.
To prepare for this -- call the .get_snapshot_details()
first, then keep the result in do_with() context. This
change doesn't affect the issue the lambde in question is
about to solve as the whole result set is anyway kept in
memory while being streamed outside.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add expression as a member of restriction. Create or update
expression everywhere restrictions are created or updated.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
These new classes will replace the existing restrictions hierarchy.
Instead of having member functions, they expose their data publicly,
for future free functions to operate on.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
This behavior is different from cassandra, but without arithmetic
operations it doesn't seem possible to notice the difference from
CQL. Using avg produces the same results, since we use an initial
value of 0 (scale = 0).
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
A negative scale was being passed an a positive value to
boost::multiprecision::pow, which would never finish.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Before Scylla 3.0, we used to send streaming mutations using
individual RPC requests and flush them together using dedicated
streaming memtables. This mechanism is no longer in use and all
versions that use it have long reached end-of-life.
Remove this code.
The patch introduces two new features to aid with negotiating
protocol extensions for the CQL protocol:
- `cql_protocol_extensions` enum, which holds all supported
extensions for the CQL protocol (currently contains only
`LWT_ADD_METADATA_MARK` extension, which will be mentioned
below).
- An additional mechainsm of negotiating cql protocol extensions
to be used in a client connection between a scylla server
and a client driver.
These extensions are propagated in SUPPORTED message sent from the
server side with "SCYLLA_" prefix and received back as a response
from the client driver in order to determine intersection between
the cql extensions that are both supported by the server and
acknowledged by a client driver.
This intersection of features is later determined to be a working
set of cql protocol extensions in use for the current `client_state`,
which is associated with a particular client connection.
This way we can easily settle on the used extensions set on
both sides of the connection.
Currently there is only one value: `LWT_ADD_METADATA_MARK`, which
regulates whether to set a designated bit in prepared statement
metadata indicating if the statement at hand is an lwt statement
or not (actual implementation for the feature will be in a later
patch).
Each extension can also propagate some custom parameters to the
corresponding key. CQL protocol specification allows to send
a list of values with each key in the SUPPORTED message, we use
that to pass parameters to extensions as `PARAM=VALUE` strings.
In case of `LWT_ADD_METADATA_MARK` it's
`SCYLLA_LWT_OPTIMIZATION_META_BIT_MASK` which designates the
bitmask for LWT flag in prepared statement metadata in order to be
used for lookup in a client library. The associated bits of code in
`cql3::prepared_metadata` are adjusted to accomodate the feature.
The value for the flag is chosen on purpose to be the last bit
in the flags bitset since we don't want to possibly clash with
C* implementation in case they add more possible flag values to
prepared metadata (though there is an issue regarding that:
https://issues.apache.org/jira/browse/CASSANDRA-15746).
If it's fixed in upstream Cassandra, then we could synchronize
the value for the flag with them.
Also extend the underlying type of `flag` enum in
`cql3::prepared_metadata` to be `uint32_t` instead of `uint8_t`
because in either case flags mask is serialized as 32-bit integer.
In theory, shard-awareness extension support also should be
reworked in terms of provided minimal infrastructure, but for the
sake of simplicity, this is left to be done in a follow-up some
time later.
This solution eliminates the need to assume that all the client
drivers follow the CQL spec carefully because scylla-specific
features and protocol extensions could be enabled only in case both
server and client driver negotiate the supported feature set.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-06-16 11:35:52 +03:00
5901 changed files with 170138 additions and 70809 deletions
Use the [ScyllaDB user mailing list](https://groups.google.com/forum/#!forum/scylladb-users) or the [Slack workspace](http://slack.scylladb.com) for general questions and help.
## Asking questions or requesting help
# Reporting an issue
Use the [Scylla Users mailing list](https://groups.google.com/g/scylladb-users) or the [Slack workspace](http://slack.scylladb.com) for general questions and help.
Please use the [Issue Tracker](https://github.com/scylladb/scylla/issues/) to report issues. Fill in as much information as you can in the issue template, especially for performance problems.
Join the [Scylla Developers mailing list](https://groups.google.com/g/scylladb-dev) for deeper technical discussions and to discuss your ideas for contributions.
# Contributing Code to Scylla
## Reporting an issue
To contribute code to Scylla, you need to sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send your changes as [patches](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches) to the [mailing list](https://groups.google.com/forum/#!forum/scylladb-dev). We don't accept pull requests on GitHub.
Please use the [issue tracker](https://github.com/scylladb/scylla/issues/) to report issues or to suggest features. Fill in as much information as you can in the issue template, especially for performance problems.
## Contributing code to Scylla
Before you can contribute code to Scylla for the first time, you should sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send the signed form cla@scylladb.com. You can then submit your changes as patches to the to the [scylladb-dev mailing list](https://groups.google.com/forum/#!forum/scylladb-dev) or as a pull request to the [Scylla project on github](https://github.com/scylladb/scylla).
If you need help formatting or sending patches, [check out these instructions](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches).
The Scylla C++ source code uses the [Seastar coding style](https://github.com/scylladb/seastar/blob/master/coding-style.md) so please adhere to that in your patches. Note that Scylla code is written with `using namespace seastar`, so should not explicitly add the `seastar::` prefix to Seastar symbols. You will usually not need to add `using namespace seastar` to new source files, because most Scylla header files have `#include "seastarx.hh"`, which does this.
Header files in Scylla must be self-contained, i.e., each can be included without having to include specific other headers first. To verify that your change did not break this property, run `ninja dev-headers`. If you added or removed header files, you must `touch configure.py` first - this will cause `configure.py` to be automatically re-run to generate a fresh list of header files.
Scylla is the real-time big data database that is API-compatible with Apache Cassandra and Amazon DynamoDB.
Scylla embraces a shared-nothing approach that increases throughput and storage capacity to realize order-of-magnitude performance improvements and reduce hardware costs.
For more information, please see the [ScyllaDB web site].
[ScyllaDB web site]: https://www.scylladb.com
## Build Prerequisites
Scylla is fairly fussy about its build environment, requiring very recent
versions of the C++20 compiler and of many libraries to build. The document
[HACKING.md](HACKING.md) includes detailed information on building and
developing Scylla, but to get Scylla building quickly on (almost) any build
machine, Scylla offers offers a [frozen toolchain](tools/toolchain/README.md),
machine, Scylla offers a [frozen toolchain](tools/toolchain/README.md),
This is a pre-configured Docker image which includes recent versions of all
the required compilers, libraries and build tools. Using the frozen toolchain
allows you to avoid changing anything in your build machine to meet Scylla's
requirements - you just need to meet the frozen toolchain's prerequisites
(mostly, Docker or Podman being available).
Building and running Scylla with the frozen toolchain is as easy as:
## Building Scylla
Building Scylla with the frozen toolchain `dbuild` is as easy as:
* run Scylla with one CPU and ./tmp as work directory
This will start a Scylla node with one CPU core allocated to it and data files stored in the `tmp` directory.
The `--developer-mode` is needed to disable the various checks Scylla performs at startup to ensure the machine is configured for maximum performance (not relevant on development workstations).
Please note that you need to run Scylla with `dbuild` if you built it with the frozen toolchain.
co_returnapi_error::request_limit_exceeded(format("too many in-flight requests (configured via max_concurrent_requests_per_shard): {}",_pending_requests.get_count()));
seastar::metrics::description("number of operations via Alternator API"),{op(CamelCaseName)}),
#define OPERATION_LATENCY(name, CamelCaseName) \
seastar::metrics::make_histogram("op_latency", \
seastar::metrics::description("Latency histogram of an operation via Alternator API"),{op(CamelCaseName)},[this]{returnapi_operations.name.get_histogram(1,20);}),
seastar::metrics::description("Latency histogram of an operation via Alternator API"),{op(CamelCaseName)},[this]{returnto_metrics_histogram(api_operations.name);}),
co_returnapi_error::unknown_operation("UpdateTimeToLive not yet supported. Experimental support is available if the 'alternator_ttl' experimental feature is enabled on all nodes.");
"summary":"Creates a hints sync point. It can be used to wait until hints between given nodes are replayed. A sync point allows you to wait for hints accumulated at the moment of its creation - it won't wait for hints generated later. A sync point is described entirely by its ID - there is no state kept server-side, so there is no need to delete it.",
"type":"string",
"nickname":"create_hints_sync_point",
"produces":[
"application/json"
],
"parameters":[
{
"name":"target_hosts",
"description":"A list of nodes towards which hints should be replayed. Multiple hosts can be listed by separating them with commas. If not provided or empty, the point will resolve when current hints towards all nodes in the cluster are sent.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
},
{
"method":"GET",
"summary":"Get the status of a hints sync point, possibly waiting for it to be reached.",
"type":"string",
"enum":[
"DONE",
"IN_PROGRESS"
],
"nickname":"get_hints_sync_point",
"produces":[
"application/json"
],
"parameters":[
{
"name":"id",
"description":"The ID of the hint sync point which should be checked or waited on",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"timeout",
"description":"Timeout in seconds after which the query returns even if hints are still being replayed. No value or 0 will cause the query to return immediately. A negative value will cause the query to wait until the sync point is reached",
"description":"Optional list of table name filters in keyspace:name format",
"required":false,
"allowMultiple":false,
"type":"array",
"items":{
"type":"string"
},
"paramType":"query"
},
{
"name":"keyspace_filters",
"description":"Optional list of keyspace filters",
"required":false,
"allowMultiple":false,
"type":"array",
"items":{
"type":"string"
},
"paramType":"query"
},
{
"name":"duration",
"description":"Duration (in milliseconds) of monitoring operation",
"required":true,
"allowMultiple":false,
"type":"long",
"paramType":"query"
},
{
"name":"list_size",
"description":"number of the top partitions to list",
"required":false,
"allowMultiple":false,
"type":"long",
"paramType":"query"
},
{
"name":"capacity",
"description":"capacity of stream summary: determines amount of resources used in query processing",
"required":false,
"allowMultiple":false,
"type":"long",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/nodes/leaving",
"operations":[
@@ -575,6 +637,14 @@
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"sf",
"description":"Skip flush. When set to \"true\", do not flush memtables before snapshotting (snapshot will not contain unflushed data)",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
},
@@ -700,7 +770,7 @@
"operations":[
{
"method":"GET",
"summary":"Scrub (deserialize + reserialize at the latest version, skipping bad rows if any) the given keyspace. If columnFamilies array is empty, all CFs are scrubbed. Scrubbed CFs will be snapshotted first, if disableSnapshot is false",
"summary":"Scrub (deserialize + reserialize at the latest version, resolving corruptions if any) the given keyspace. If columnFamilies array is empty, all CFs are scrubbed. Scrubbed CFs will be snapshotted first, if disableSnapshot is false. Scrub has the following modes: Abort (default) - abort scrub if corruption is detected; Skip (same as `skip_corrupted=true`) skip over corrupt data, omitting them from the output; Segregate - segregate data into multiple sstables if needed, such that each sstable contains data with valid order; Validate - read (no rewrite) and validate data, logging any problems found.",
"type":"long",
"nickname":"scrub",
"produces":[
@@ -723,6 +793,20 @@
"type":"boolean",
"paramType":"query"
},
{
"name":"scrub_mode",
"description":"How to handle corrupt data (overrides 'skip_corrupted'); ",
"required":false,
"allowMultiple":false,
"type":"string",
"enum":[
"ABORT",
"SKIP",
"SEGREGATE",
"VALIDATE"
],
"paramType":"query"
},
{
"name":"keyspace",
"description":"The keyspace to query about",
@@ -833,6 +917,43 @@
}
]
},
{
"path":"/storage_service/repair_status/",
"operations":[
{
"method":"GET",
"summary":"Query the repair status and return when the repair is finished or timeout",
"type":"string",
"enum":[
"RUNNING",
"SUCCESSFUL",
"FAILED"
],
"nickname":"repair_await_completion",
"produces":[
"application/json"
],
"parameters":[
{
"name":"id",
"description":"The repair ID to check for status",
"required":true,
"allowMultiple":false,
"type":"long",
"paramType":"query"
},
{
"name":"timeout",
"description":"Seconds to wait before the query returns even if the repair is not finished. The value -1 or not providing this parameter means no timeout",
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.