The default rapidjson allocator returns nullptr from
a failed allocation or reallocation. It's not a bug by itself,
but rapidjson internals usually don't check for these return values
and happily use nullptr as a valid pointer, which leads to segmentation
faults and memory corruptions.
In order to prevent these bugs, the default allocator is wrapped
with a class which simply throws once it fails to allocate or reallocate
memory, thus preventing the use of nullptr in the code.
One exception is Malloc/Realloc with size 0, which is expected
to return nullptr by rapidjson code.
Reduce rebuilds and build time by removing unnecessary includes. Along the way,
improve header sanity.
Ref #1.
Test: dev-headers, unit(dev).
Closes#8524
* github.com:scylladb/scylla:
treewide: remove inclusions of storage_proxy.hh from headers
storage_proxy: unnest coordinator_query_result
treewide: make headers self-sufficient
utils: intrusive_btree: add missing #pragma once
Make sure that sstable::unlink will never fail.
It will terminate in the unlikely case toc_filename
throws (e,g, on bad_alloc), otherwise it ignores any other error
and juts warns about it.
Make unlink a coroutine to simplify the implementation
without introducing additional allocations.
Note that remove_by_toc_name and maybe_delete_large_data_entries
are executed asynchronously and concurrently.
Waiting for them to finish is serialized by co_await,
making sure that both are being waited on so not to leave
abandoned futures behind.
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210420135020.102733-1-bhalevy@scylladb.com>
storage_proxy.hh is huge and includes many headers itself, so
remove its inclusions from headers and re-add smaller headers
where needed (and storage_proxy.hh itself in source files that
need it).
Ref #1.
Nested classes cannot be forward declared, and
storage_proxy::coordinator_query_result is used in pagers, where
we'd like to forward-declare it. Unnest it and introduce an alias
for compatibility.
In issue #5021 we noted that Alternator's equality operator needs to be
fixed for the case of comparing two sets, because the equality check needs
to take into account the possibility of different element order.
Unfortunately, we fixed only the equality check operator, but forgot there
is also an inequality operator!
So in this patch we fix the inequality operator, and also add a test for
it that was previously missing.
The implementation of the inequality operator is trivial - it's just the
negation of the equality test. Our pre-existing tests verify that this is
the correct implementation (e.g., if attribute x doesn't exist, then "x = 3"
is false but "x <> 3" is true).
Refs #5021Fixes#8513
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419141450.464968-1-nyh@scylladb.com>
In issue #5021 we noticed that the equality check in Alternator's condition
expressions needs to handle sets differently - we need to compare the set's
elements ignoring their order. But the implementation we added to fix that
issue was only correct when the entire attribute was a set... In the
general case, an attribute can be a nested document, with only some
inner set. The equality-checking function needs to tranverse this nested
document, and compare the sets inside it as appropriate. This is what
we do in this patch.
This patch also adds a new test comparing equality of a nested document with
some inner sets. This test passes on DynamoDB, failed on Alternator before
this patch, and passes with this patch.
Refs #5021Fixes#8514
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419184840.471858-1-nyh@scylladb.com>
When a condition expression (ConditionExpression, FilterExpression, etc.)
checks for equality of two item attributes, i.e., "x = y", and when one of
these attributes was missing we correctly returned false.
However, we also need to return false when *both* attributes are missing in
the item, because this is what DynamoDB does in this case. In other words
an unset attribute is never equal to anything - not even to another unset
attribute. This was not happening before this patch:
When x and y were both missing attributes, Alternator incorrectly returned
true for "x = y", and this patch fixes this case. It also fixes "x <> y"
which should to be true when both x and y are unset (but was false
before this patch).
The other comparison operators - <, <=, >, >=, BETWEEN, were all
implemented correctly even before this patch.
This patch also includes tests for all the two-unset-attribute cases of
all the operators listed above. As usual, we check that these tests pass
on both DynamoDB and Alternator to confirm our new behavior is the correct
one - before this patch, two of the new tests failed on Alternator and
passed on DynamoDB.
Fixes#8511
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419123911.462579-1-nyh@scylladb.com>
Currently said method uses the system semaphore as a catch-all for all
scheduling groups it doesn't know about. This is incompatible with the
recent forward-porting of the service-level infrastructure as it means
that all service level related scheduling groups will fall back to the
system scheduling group, which causes two problems:
* They will experience much limited concurrency, as the system semaphore
is assigned much less count units, to match the much more limited
internal traffic.
* They compete with internal reads, severely impacting the respective
internal processes, potentially causing extreme slowdown, or even
deadlock in the case of an internal query executed on behalf of a
user query being blocked on the latter.
Even if we don't have any custom service level scheduling groups at the
moment, it is better to change this such that unknown scheduling groups
fall-back to using the user semaphore. We don't expect any new internal
scheduling group to pop up any time soon (and if they do we can adjust
get_reader_concurrency_semaphore() accordingly), but we do expect user
scheduling groups to be created in the future, even dynamically.
To minimize the chance of the wrong workload being associated with the
user semaphore, all statically created scheduling groups are now
explicitly listed in `get_reader_concurrency_semaphore()`, to make their
association with the respective semaphore explicit and documented.
Added a unit test which also checks the correct association for all
these scheduling groups.
Fixes: #8508
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210420105156.94002-1-bdenes@scylladb.com>
Back when rjson was only part of alternator, there was a hardcoded
limit of nested levels - 78. The number was calculated as:
- taking the DynamoDB limit (32)
- adding 7 to it to make alternator support more cases
- doubling it because rjson internals bump the level twice
for each alternator object (because the alternator object
is represented as a 2-level JSON object).
Since rjson is no longer specific to alternator, this limit
is now configurable, and the original default value is explained
in a comment.
Message-Id: <51952951a7cd17f2f06ab36211f74086e1b60d2d.1618916299.git.sarna@scylladb.com>
The Redis server started as a copy of the CQL server, but did not
receive all the fixes of the CQL server over time. For example, commit
1a8630e ("transport: silence "broken pipe" and "connection reset by
peer" errors") was only done on the CQL server.
To remedy the situation, this pull request unifies code between the CQL
and Redis servers by introducing a "generic_server" component, and
switching CQL and Redis to use it.
Test: dtest(dev)
Closes#8388
* github.com:scylladb/scylla:
generic_server: Rename "maybe_idle" to "maybe_stop"
generic_server: API documentation for connection and server classes
transport, redis: Use generic server::listen()
transport/server: Remove "redis_server" prefix from logging
transport/server: Remove "cql_server" prefix from logging
generic_server: Remove unneeded static_pointer_cast<>
transport, redis: Use generic server::do_accepts()
transport, redis: Use generic server::process()
redis: Move Redis specific code to handle_error()
transport: Move CQL specific error handling to handle_error()
transport, redis: Move connection tracking to generic_server::server class
transport, redis: Move _stopped and _connections_list to generic_server::server class
transport, redis: Move total_connections to generic_server::server class
transport, redis: Use generic server::maybe_idle()
transport, redis: Move list_base_hook<> inheritance to generic_server::connection
transport, redis: Use generic connection::shutdown()
The set recycles 16 bytes from the reader class, makes use of
rows collection sugar, generalizes range tombstones emission and
adds an invariant-check.
tests: unit(dev)
* xemul/br-cache-reader-cleanups-1.2:
cache_flat_mutation_reader: Generalize range tombstones emission
cache_flat_mutation_reader: Tune forward progress check
cache_flat_mutation_reader: Use rows insertion sugar
cache_flat_mutation_reader: Move state field
cache_flat_mutation_reader: Remove raiish comparator
cache_flat_mutation_reader: Remove unused captured variable
cache_flat_mutation_reader: Fix trace message text
Support for random-seed was added in 4ad06c7eeb
but the program still uses std::rand() to draw random keys.
Use tests::random::get_int instead so we can get reprodicible
sequence of keys given a particular random-seed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210418104455.82086-1-bhalevy@scylladb.com>
It's notoriously hard to find the service level controller symbol
(possible by guessing the offset based on system_distributed_keyspace
address, but it's very cumbersome). To make the debugging process
easier, the symbol is exported via the `debug` namespace.
Closes#8506
`system_distributed_everywhere` is a new keyspace that uses Everywhere
replication strategy. This is useful, for example, when we want to store
internal data that should be accessible by every node; the data can be
written using CL=ALL (e.g. during node operations such as node
bootstrap, which require all nodes to be alive - at least currently) and
then read by each node locally using CL=ONE (e.g. during node restarts).
Closes#8457
repair: Handle everywhere_topology in bootstrap_with_repair
The everywhere_topology returns the number of nodes in the cluster as
RF. This makes only streaming from the node losing the range impossible
since no node is losing the range after bootstrap.
Shortcut to stream from all nodes in local dc in case the keyspace is
everywhere_topology.
Fixes#8503Closes#8505
* github.com:scylladb/scylla:
repair: Make the log more accurate in bootstrap_with_repair
repair: Handle everywhere_topology in bootstrap_with_repair
We have logs
expected 1 node losing range but found *more* nodes
However, we can find zero node as well. Drop the word *more* in the log.
In addition, print the number of nodes found.
Refs #8503
Mention executables (scylla, tools and tests) as well as how to build
individual object files and how to verify individual headers. Also
mention the not-at-all obvious trick of how to build tests with debug
symbols.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210416131950.175413-1-bdenes@scylladb.com>
The everywhere_topology returns the number of nodes in the cluster as
RF. This makes only streaming from the node losing the range impossible
since no node is losing the range after bootstrap.
Shortcut to stream from all nodes in local dc in case the keyspace is
everywhere_topology.
Fixes#8503
By default, argparse will provide the value of the option as str.
Later, we compare it with int, which will be always False. Fix by
telling argparse to provide as int.
Message-Id: <20210415182149.686355-1-tgrabiec@scylladb.com>
Looks like in python 3, division automatically yields a double/float,
even if both operands are integers. This results in
get_base_class_offset() returning a double/float, which breaks pointer
arithmetics (which is what the returned value is used for), because now
instead of decrementing/incrementing the pointer, the pointer will be
converted to a double itself silently, then back to some corrupt pointer
value. One user visible effect is `intrusive_list` being broken, as it
uses the above method to calculate the member type pointer from the node
pointers.
Fix by coercing the returned value to int.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210415080034.167762-1-bdenes@scylladb.com>
The database.hh is the central recursive-headers knot -- it has ~50
includes. This patch leaves only 34 (it remains the champion though).
Similar thing for database.cc.
Both changes help the latter compile ~4% faster :)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210414183107.30374-1-xemul@scylladb.com>
The range tombstone can be added-to-buffer from two places:
when it was found in cache and when it was read from the
underlying reader. Both adders can now be generalized.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When adding a range tombstone to the buffer the need
to stop stuffing the already full one is only done if
this particular range timbstone changes the lower_bound.
This check can be tuned -- if the lower bound changed
_at_ _all_ after a range tombstone was added, we may
still abort the loop.
This change will allow to generalize range tombstone
emission by the next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When inserting a rows_entry via unique_ptr the ptr inquestion
can be pushed as is, the intrusive btree code releases the
pointer (to be exception safe) itself. This makes the code
a bit shorter and simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two alignment gaps in the middle of the
c_f_m_r -- one after the state and another one after
the set of bools. Keeping them togethers allows the
compiler to pack the c_f_m_r better.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The instance of position_in_partition::tri_compare sits
on the reader itself and just occupies memory. It can be
created on demand all the more so it's only one place that
needs it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
First commit:
In the first commit, add validation of `enable` and `postimage` CDC options. Both options are boolean options, but previously they were not validated, meaning you could issue a query:
```
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': 'dsfdsd'};
```
and it would be executed without any errors, silently interpreting `dsfdsd` as false.
The first commit narrows possible values of those boolean CDC options to `false`, `true`, `0`, `1`. After applying this change, issuing the query above would result in this error message:
```
ConfigurationException: Invalid value for CDC option "enabled": dsfdsd
```
I actually encountered this lacking validation myself, as I mistakenly issued a query:
```
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'preimage': true, 'postimage': 'full'};
```
incorrectly assigning `full` to `postimage`, instead of `preimage`. However, before this commit, this query ran correctly and it interpreted `full` as `false` and disabled postimages altogether.
Second commit:
The second commit improves the error message of invalid `ttl` CDC option:
Before:
```
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': 'invalid'};
ServerError: stoi
```
After:
```
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': 'kgjhfkjd'};
ConfigurationException: Invalid value for CDC option "ttl": kgjhfkjd
```
```
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': '75747885787487'};
ConfigurationException: Invalid CDC option: ttl too large
```
Closes#8486
* github.com:scylladb/scylla:
cdc: improve exception message of invalid "ttl"
cdc: add validation of "enable" and "postimage"
On Debian variants, sh -x ./install.sh will fail since our script in
written in bash, and /bin/sh in Debian variants is dash, not bash.
So detect non-bash shell and print error message, let users to run in
bash.
Fixes#8479Closes#8484
We start background reclaim after we bootstrap, so bootstrap doesn't
benefit from it, and sees long stalls.
Fix by moving background reclaim initialization early, before
storage_service::join_cluster().
(storage_service::join_cluster() is quite odd in that main waits
for it synchronously, compared to everything else which is just
a background service that is only initialized in main).
Fixes#8473.
Closes#8474
given that resharding is now a synchronous mandatory step, before
table is populated, snapshot() can now get rid of code which takes
into account whether or not a sstable is shared.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210414121549.85858-1-raphaelsc@scylladb.com>
* seastar d2dcda96bb...0b2c25d133 (4):
> reactor: reactor_backend_epoll: stop using signals for high resolution timers
> reactor: move task_quota_timer_thread_fn from reactor to reactor_backend_epoll
> Merge "Report maximum IO lenghts via file API" from Pavel E
> Merge "Improve efficiency of io-tester" from Pavel E
Improve the exception message of providing invalid "ttl" value to the
table.
Previously, if you executed a CREATE TABLE query with invalid "ttl"
value, you would get a non-descriptive error message:
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': 'invalid'};
ServerError: stoi
This commit adds more descriptive exception messages:
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': 'kgjhfkjd'};
ConfigurationException: Invalid value for CDC option "ttl": kgjhfkjd
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': '75747885787487'};
ConfigurationException: Invalid CDC option: ttl too large
Add validation of "enable" and "postimage" CDC options. Both options
are boolean options, but previously they were not validated, meaning
you could issue a query:
CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': 'dsfdsd'};
and it would be executed without any errors, silently interpreting
"dsfdsd" as false.
This commit narrows possible values of those boolean CDC options to
false, true, 0, 1. After applying this change, issuing the query above
would result in this error message:
ConfigurationException: Invalid value for CDC option "enabled": dsfdsd
This patch fixes cql-pytest/run-cassandra to work on systems which
default to Java 11, including Fedora 33.
Recent versions of Cassandra can run on Java 11 fine, but requires a
bunch of weird JVM options to work around its JPMS (Java Platform Module
System) feature. Cassandra's start scripts require these options to
be listd in conf/jvm11-server.options, which is read by the startup
script cassandra.in.sh.
Because our "run-cassandra" builds its own "conf" directory, we need
to create a jvm11-server.options file in that directory. This is ugly,
but unfortunately necessary if cql-pytest/run-cassandra is to run with
on systems defaulting to Java 11.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210406220039.195796-1-nyh@scylladb.com>
We currently only update the failure detector for a node when a higher
version of application state is received. Since gossip syn messages do
not contain application state, so this means we do not update the
failure detector upon receiving gossip syn messages, even if a message
from peer node is received which implies the peer node is alive.
This patch relaxes the failure detector update rule to update the
failure detector for the sender of gossip messages directly.
Refs #8296Closes#8476
When a particular partition exists in at least one sstable, the cache
expects any single-partition query to this partition to return a `partition_start`
fragment, even if the result is empty.
In `time_series_sstable_set::create_single_key_sstable_reader` it could
happen that all sstables containing data for the given query get
filtered out and only sstables without the relevant partition are left,
resulting in a reader which immediately returns end-of-stream (while it
should return a `partition_start` and if not in forwarding mode, a
`partition_end`). This commit fixes that.
We do it by extending the reader queue (used by the clustering reader
merger) with a `dummy_reader` which will be returned by the queue as
the very first reader. This reader only emits a `partition_start` and,
if not in forwarding mode, a `partition_end` fragment.
Fixes#8447.
Closes#8448
Support native building & unit testing in the Nix ecosystem under
nix-shell.
Actual dist packaging for Nixpkgs/NixOS is not there (yet?), because:
* Does not exactly seem like a huge priority.
* I don't even have a firm idea of how much work it would entail (it
certainly does not need the ld.so trickery, so there's that. But at
least some work would be needed, seeing how ScyllaDB needs to
integrate with its environment and NixOS is a little unorthodox).
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20210413110508.5901-4-michael.livshin@scylladb.com>