'system.status' and 'system.describe_ring' are imperfect names for
what they do, so rename them. Fortunately they aren't exposed in any
released version so there is no compatibility concern.
Closes#9530
* github.com:scylladb/scylla:
system_keyspace: rename 'system.describe_ring' to 'system.token_ring'
system_keyspace: rename 'system.status' to 'system.cluster_status'
Currently, aws_instance.ephemeral_disks() returns both ephemeral disks
and EBS disks on Nitro System.
This is because both are attached as NVMe disks, we need to add disk
type detection code on NVMe handle logic.
Fixes#9440Closes#9462
Some time ago we started gathering stats for system tables in a separate
class in order to be able to distinguish which queries come from the
user - e.g. if the unpaged queries are internal or not.
Originally, only local system tables were moved into this class,
i.e. system and system_schema. It would make sense, however, to also
include other internal keyspaces in this separate class - which includes
system_distributed, system_traces, etc.
Fixes#9380Closes#9490
if mean() is called when there are no elements in the histogram,
a runtime error will happen due to division-by-zero.
approx_exponential_histogram::mean() handles it but for some
reason we forgot to do the same for estimated_histogram.
this problem was found when adding an unit test which calls
mean() in an empty histogram.
Fixes#9531.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211027142813.56969-1-raphaelsc@scylladb.com>
Throw a proper exception from do_load_schemas if parse_statements
fails to parse the schema cql.
Catch it in scylla-sstable main() function so
it won't be reported as seastar - unhandled exception.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211027124032.1787347-1-bhalevy@scylladb.com>
Table names are usually nouns, so SELECT/INSERT statements
sound natural: "SELECT * FROM pets". 'system.describe_ring' defies
this convention. Rename it to 'system.token_ring' so selects are natural.
The name is not in any released version, so we can safely rename it.
'system.status' is too generic, it doesn't explain the status of what.
'system.node_status' is also ambiguous (this node? all nodes?) so I
picked 'system.cluster_status'.
The internal name, nodetool_status_table, was even worse (we're
not querying the status of nodetool!) but fortunately wasn't exposed.
The name is not in any released version, so we can safely rename it.
We have two identical "Truncated frame" errors, at:
* read_frame_size() in serialization_visitors.hh;
* cql_server::connection::read_and_decompress_frame() in
transport/server.cc;
When such an exception is thrown, it is impossible to tell where was it
thrown from and it doesn't have any further information contained in it
(beyond the basic information it being thrown implies).
This patch solves both problems: it makes the exception messages unique
per location and it adds information about why it was thrown (the
expected vs. real size of the frame).
Ref: #9482Closes#9520
In definition for /column_family/major_compaction/{name} there is an
incorrect use of "bool" instead of "boolean".
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#9516
Although the sstable name is part of the system.large_* records,
it is not printed in the log.
In particular, this is essential for the "too many rows" warning
that currently does not record a row in any large_* table
so we can't correlate it with a sstable.
Fixes#9524
Test: unit(dev)
DTest: wide_rows_test.py
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211027074104.1753093-1-bhalevy@scylladb.com>
Currently row marker shadowing the shadowable tombstone is only checked
in `apply(row_marker)`. This means that shadowing will only be checked
if the shadowable tombstone and row marker are set in the correct order.
This at the very least can cause flakyness in tests when a mutation
produced just the right way has a shadowable tombstone that can be
eliminated when the mutation is reconstructed in a different way,
leading to artificial differences when comparing those mutations.
This patch fixes this by checking shadowing in
`apply(shadowable_tombstone)` too, making the shadowing check symmetric.
There is still one vulnerability left: `row_marker& row_marker()`, which
allow overwriting the marker without triggering the corresponding
checks. We cannot remove this overload as it is used by compaction so we
just add a comment to it warning that `maybe_shadow()` has to be manually
invoked if it is used to mutate the marker (compaction takes care of
that). A caller which didn't do the manual check is
mutation_source_test: this patch updates it to use `apply(row_marker)`
instead.
Fixes: #9483
Tests: unit(dev)
Closes#9519
Make sure to close the reader created by flush_fragments
if an exception occurs before it's moved to `populate_views`.
Note that it is also ok to close the reader _after_ it has been
moved, in case populate_views itself throws after closing the
reader that was moved it. For conveience flat_mutation_reader::close
supports close-after-move.
Fixes#9479
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211024164138.1100304-1-bhalevy@scylladb.com>
Functions `upper_bound` and `lower_bound` had signatures:
```
template<typename T, typename... Args>
static rows_iter_type lower_bound(const T& t, Args... args);
```
This caused a dacay from `const schema&` to `schema` as one of the args,
which in turn copied the schema in a fair number of the queries. Fix
that by setting the parameter type to `Args&&`, which doesn't discard
the reference.
Fixes#9502Closes#9507
* seastar 994b4b5a0c...083898a172 (24):
> Revert "memory: always allocate buf using "malloc" for non reactor"
> Revert dpdk update to 21.08.
> tutorial: Fix typos
> queue: add back template requirement for element type to be nothrow move-constructible
> Revert "queue: require element type to be nothrow move-constructible"
> build: add the closing "-Wl,--no-whole-archive" to the ldflags
> build: add -Wno-error=volatile to CXX_FLAGS
> build: Include dpdk as a single object in libseastar.a
> Merge: queue: cleanup exception handling
> build: drop dpdk-specific machine architecture names
> reactor: call memory::configure() before initialize dpdk
> core/loop: parallel_for_each(): make entire function critical alloc section
> Merge 'scheduling groups: Add compile parameter for setting max scheduling groups count at compile time' from Eliran Sinvani
> test: coroutines_test: assign spinner lambda to local variable
> shared_ptr: mark shared_from_this functions noexcept
> lw_shared_ptr: mark shared_from_this functions noexcept
> build: update download URL for Boost
> Merge "build: build with dpdk v21.08" from Kefu
> cpu_stall_detector: handle wraparounds in Linux perf_event ring buffer
> entry_point.cc: default-initialize sigaction struct
> reactor: s/gettid()/syscall(SYS_gettid)/
> memory: always allocate buf using "malloc" for non reactor
> Revert "memory: always allocate buf using "malloc" for non reactor"
> memory: always allocate buf using "malloc" for non reactor
Currently we update the effective_replication_map
only on non-system keyspace, leaving the system keyspace,
that uses the local replication strategy, with the empty
replication_map, as it was first initialized.
This may lead to a crash when get_ranges is called later
as seen in #9494 where get_ranges was called from the
perform_sstable_upgrade path.
This change updates the effective_replication_map on all
keyspaces rather than just on the non-system ones
and adds a unit test that reproduces #9494 without
the fix and passes with it.
Fixes#9494
Test: unit(dev), database_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211020143217.243949-1-bhalevy@scylladb.com>
Fixes#9491
CQL server, when encountering a "general" exception (i.e. not thrown by
cql error checks), reports a wire error with simply the what() part of
exception. However, if we have nested exceptions, we will most likely
lose info here (hello encryption).
General exception case should unwind exception and give back full,
concatenated message to avoid confusion.
Closes#9492
When an experimental feature graduates from being experimental, we want
to continue allow the old "--experimental-features=..." option to work,
in case some user's configuration uses it - just do nothing. The way
we do it is to map in db::experimental_features_t::map() the feature's
name to the UNUSED value - this way the feature's name is accepted, but
doesn't change anything.
When the CDC feature graduated from being experimental, a new bit
UNUSED_CDC was introduced to do the same thing. This separate bit was
not actually necessary - if we ever check for UNUSED_CDC bit anywhere
in the code it means the flag isn't actually unused ;-) And we don't
check it.
So simplify the code by conflating UNUSED_CDC into UNUSED.
This will also make it easy to build from db::experimental_features_t::map()
a list of current experimental features - now it will simply be those that
do not map to UNUSED.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211013105107.123544-1-nyh@scylladb.com>
This series fixes a bug which allowed using a secondary index in a restriction for a DELETE statement, which resulted in generating incorrect slices and deleting the whole partition instead. Secondary indexes are not meant to be used for deletes, which this series enforces by marking the indexes as not queriable.
It also comes with a reproducing test case, originally provided by @fee-mendes (thanks!).
Fixes#9495
Tests: unit(release)
Closes#9496
* github.com:scylladb/scylla:
cql-pytest: add reproducer for deleting based on secondary index
cql3: forbid querying indexes for deletions
So the compaction perf of different compaction strategies can be
compared. Data timestamps are diversified such that they fall into four
different bucket if TWCS is used, in order to be able to stress the
timestamp based splitting code path.
Closes#9488
Accessing tm.sorted_tokens().back() causes undefined behavior
if tm.sorted_tokens is empty.
Check that first and throw/abort using on_internal_error
in this case.
This will prevent the segfault but it doesn't fix the root cause
which is getting here with empty token_metadata. That will be fixed
by the following patch.
Refs #9494
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211019075710.1626808-1-bhalevy@scylladb.com>
This commit makes the column names from an invalid delete statement human readable.
Before that, they were printed in their hex representation, which is not convenient
for debugging.
Before:
InvalidRequest:
Error from server: code=2200 [Invalid query]
message="Invalid where clause contains non PRIMARY KEY columns: 76616c"
After:
InvalidRequest:
Error from server: code=2200 [Invalid query]
message="Invalid where clause contains non PRIMARY KEY columns: val"
Message-Id: <52923335e8837295fd5ba2dfd0921196e21f7f16.1634626777.git.sarna@scylladb.com>
This commit adds a test case for a bug reported by Felipe
<felipemendes@scylladb.com>. The bug involves trying to delete
an entry from a partition based on a secondary index created
on a column which is part of the compound clustering key,
and the unfortunate result is that the whole partition gets wiped.
Cassandra's behavior is in this case correct - deletion based
on a secondary index column is not allowed.
Refs #9495
Using secondary indexes for the purpose of a DELETE statement
was never expected to be well-defined, but an edge case in #9495
showed that the index may sometimes be inadvertently used, which
causes the whole partition to be deleted.
In order to prevent such errors, it's now explicitly defined
that an index is not queriable if it's going to be used for the
purpose of a DELETE statement.
metric currently_open_for_writing, used to inform # of sstables opened for writing,
holds the same value as total_open_for_writing. that means we aren't actually
decreasing the counter, so it is bogus.
Moved to sstable_writer, because sstable is used by writer to open files,
which are then extracted from sstable object, and later the same object is
reused for read-only mode.
Fixes#9455.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211013134812.177398-1-raphaelsc@scylladb.com>
It was auto-expanded only if the strategy name
was the short "NetworkTopologyStrategy" name.
Fixes#9302.
Closes#9304.
* 'prepare_options' of https://github.com/bhalevy/scylla:
cql3: keyspace prepare_options: expand replication_factor also for fully qualified NetworkTopologyStrategy
abstract_replication_strategy: add to_qualified_class_name
TWCS can reshape at most 32 sstables spanning multiple windows, in a
single compaction round. Which sstables are compacted together, when
there are more than 32 sstables, is random.
If sstables with overlapping windows are compacted together, then
write amplification can be reduced because we may be able to push
all the data to a window W in a single compaction round, so we'll
not have to perform another compaction round later in W, to reduce
its number of files. This is also very good to reduce the amount
of transient file descriptors opened, because TWCS reshape
first reshapes all sstables spanning multiple windows, so if
all windows temporarily grow large in number of files, then
there's a risk which file descriptors can be exhausted.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211013203046.233540-3-raphaelsc@scylladb.com>
After a4053dbb72, data segregation is postponed to offstrategy, so reshape
procedure is called with disjoint sstables which belong to different
windows, so let's extend the optimization for disjoint sstables which
span more than one window. In this way, write amplification is reduced
for offstrategy compaction, as all disjoint sstables will be compacted
at once.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211013203046.233540-2-raphaelsc@scylladb.com>
It was auto-expanded only if the strategy name
was the short "NetworkTopologyStrategy" name.
Fixes#9302
Test: cql_query_test.test_rf_expand(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And use it from cql3 check_restricted_replication_strategy and
keyspace_metadata ctor that defined their own `replication_class_strategy`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This mini series contains two fixes that are bundled together since the
second one assumes that the first one exists (or it will not fix
anything really...), the two problems were:
1. When certain operations are called on a service level controller
which doesn't have it's data accessor set, it can lead to a crash
since some operations will still try to dereference the accessor
pointer.
2. The cql environment test initialized the accessor with a
sharded<system_distributed_data>& however this sharded class as
itself is not initialized (sharded::start wasn't called), so for the
same that were unsafe for null dereference the accessor will now crash
for trying to access uninitialized sharded instance.
Closes#9468
* github.com:scylladb/scylla:
CQL test environment: Fix bad initialization order
Service Level Controller: Fix possible dereference of a null pointer
Before this patch, if Scylla crashes during some test in test/alternator,
all tests after it will fail because they can't connect to Scylla - and we
can get a report on hundreds of failures without a clear sign of where the
real problem was.
This patch introduces an autouse fixture (i.e., a fixture automatically
used by every test) which tries to run a do-nothing health-check request
after each test. If this health-check request fails, we conclude that
Scylla crashed and report the test in which this happened - and exit
pytest instead of failing a hundred more tests.
The failure report looks something like this:
```
! _pytest.outcomes.Exit: Scylla appears to have crashed in test test_batch.py::test_batch_get_item !
```
And the entire test run fails.
These extra health checks are not free, but they come fairly close to
being free: In my tests I measured less than 0.1 seconds slowdown of
the entire test suite (which has 618 tests) caused by the extra health
checks.
Fixes#9489
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211017123222.217559-1-nyh@scylladb.com>
Issue #9467 deprecated the blanket "--experimental" option which we
used to enable all experimental Scylla features for testing, and
suggests that individual experimental features should be enabled
instead.
So this is what we do in this patch for the Scylla-running scripts
in test/alternator and test/cql-pytest: We need to enable UDF for
the CQL tests, and to enable Alternator Streams and Alternator TTL
for the Alternator tests.
Refs #9467
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211012110312.719654-2-nyh@scylladb.com>
Earlier we added experimental (and very incomplete) support for
Alternator's TTL feature, but forgot to set a *name* for this
experimental feature. As a result, this feature can be enabled only with
the blanket "--experimental" option and not with a specific
"--experimental-features=..." option.
Since issue #9467 deprecated the blanket "--experimental" option
and users are encouraged to only enable specific experimental
features, it is important that we have a name for it.
So the name chosen in this patch is "alternator-ttl".
Eventually this feature might evolve beyond Alternator-only,
but for now, I think it's a good name and we'll probably
graduate the experimental Alternator TTL feature before
supporting CQL, so it will be a new experimental feature
anyway.
Refs #9467.
db/config.cc
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211012110312.719654-1-nyh@scylladb.com>
The warning was disabled during the migration to clang, but now it
appears unnecessary (perhaps clang added support for the attributes
it did not have then). It is valuable for detecting misspelled
attributes, so enable it again.
Closes#9480
The help string from the "--experimental-features" command-line option
lists the available experimental features, to helping a user who might
want to enable them. But this help string was manually written, and has
since drifted from reality:
* Two of the listed "experimental" features, cdc and lwt, have actually
graduated from being experimental long ago. Although technically a user
may still use the words "cdc" and "lwt" in the "experimental-features"
parameter, doing so is pointless, and worse: This text in the help
string can mislead a user into thinking that these two features are
still experimental - while they are not!
* One experimental feature - alternator-ttl - is missing from this list.
Instead of updating the help string text now - and needing to do this
again and again in the future as we change experimental features - what
this patch does is to construct the list of features automatically from
the map of supported feature names - excluding any features which map
to UNUSED.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211013122635.132582-1-nyh@scylladb.com>
"
The current api design of abstract_replication_strategy
provides a can_yield parameter to calls that may stall
when traversing the token metadata in O(n^2) and even
in O(n) for a large number of token ranges.
But, to use this option the caller must run in a seastar thread.
It can't be used if the caller runs a coroutine or plain
async tasks.
Rather than keep adding threads (e.g. in storage_service::load_and_stream
or storage_service::describe_ring), the series offers an infrastructure
change: precalculating the token->endpoints map once, using an async task,
and keeping the results in a `effective_replication_map` object.
The latter can be used for efficient and stall-free calls, like
get_natural_endpoints, or get_ranges/get_primary_range, replacing their
equivalents in abstract_replication_strategy, and dropping the public
abstract_replication_strategy::calculate_natural_endpoints and its
internal cached_endpoints map.
Other than the performance benefits of:
1. The current calls require running a thread to yield.
Precalculating the map (using async task) allows us to use synchronous calls
without stalling the rector.
2. The replication maps can and should be shared
between keyspaces that use the same replication strategy.
(Will be sent as a follow-up to the series)
The bigger benefits (courtesy of Avi Kivity) are laying the groundwork for:
1. atomic replication metadata - an operation can capture a replication map once, and then use consistent information from the map without worrying that it changes under its feet. We may even be able to s/inet_address/replica_ptr/ later.
2. establish boundaries on the use of replication information - by making a replication map not visible, and observing when its reference count drops to zero, we can tell when the new replication map is fully in use. When we start writing to a new node we'll be able to locate a point in time where all writes that were not aware of the new node were completed (this is the point where we should start streaming).
Notes:
* The get_natural_endpoints method that uses the effective_replication_map
is still provided as a abstract_replication_strategy virtual method
so that local_strategy can override it and privide natural endpoints
for any search token, even in the absence of token_metadata, when\
called early-on, before token_metadata has been established.
The effective_replication_map materializes the replication strategy
over a given replication strategy options and token_metadata.
Whenever either of those change for a keyspace, we make a new
effective_replication_map and keep it in the keyspace for latter use.
Methods that depend on an ad-hoc token_metadata (e.g. during
node operations like bootstrap or replace) are still provided
by abstract_replication_strategy.
TODO:
- effective_replication_map registry
- Move pending ranges from token_metadata to replication map
- get rid of abstract_replication_strategy::get_range_addresses(token_metadata&)
- calculate replication map and use it instead.
Test: unit(dev, debug)
Dtest: next-gating, bootstrap_test.py update_cluster_layout_tests.py alternator_tests.py -a 'dtest-full,!dtest-heavy' (release)
"
* tag 'effective_replication_strategy-v6' of github.com:bhalevy/scylla: (44 commits)
effective_replication_map: add get_range_addresses
abstract_replication_strategy: get rid of shared_token_metadata member and ctor param
abstract_replication_strategy: recognized_options: pass const topology&
abstract_replication_strategy: precacluate get_replication_factor for effective_replication_map
token_metadata: get rid of now-unused sync methods
abstract_replication_strategy: get rid of do_calculate_natural_endpoints
abstract_replication_strategy: futurize get_*address_ranges
abstract_replication_strategy: futurize get_range_addresses
abstract_replication_strategy: futurize get_ranges(inet_address ep, token_metadata_ptr)
abstract_replication_strategy: move get_ranges and get_primary_ranges* to effective_replication_map
compaction_manager: pass owned_ranges via cleanup/upgrade options
abstract_replication_strategy: get rid of cached_endpoints
all replication strategies: get rid of do_get_natural_endpoints
storage_proxy: use effective_replication_map token_metadata_ptr along with endpoints
abstract_replication_strategy: move get_natural_endpoints_without_node_being_replaced to effective_replication_map
storage_service: bootstrap: add log messages
storage_service: get_mutable_token_metadata_ptr: always invalidate_cached_rings
shared_token_metadata: set: check version monotonicity
token_metadata: use static ring version
token_metadata: get rid of copy constructor and assignment operator
...
Make a reader that reads from memtable in reverse order.
This draft PR includes two commits, out of which only the second is
relevant for review.
Described in #9133.
Refs #1413.
Closes#9174
* github.com:scylladb/scylla:
partition_snapshot_reader: pop_range_tombstone returns reference (instead of value) when possible.
memtable: enable native reversing
partition_snapshot_reader: reverse ck_range when needed by Reversing
memtable, partition_snapshot_reader: read from partition in reverse
partition_snapshot_reader: rows_position and rows_iter_type supporting reverse iteration
partition_snapshot_reader: split responsibility of ck_range
partition_snapshot_reader: separate _schema into _query_schema and _partition_schema
query: reverse clustering_range
test: cql_query_test: fix test_query_limit for reversed queries
Our scylla.yaml contains a comment listing the available experimental
features, supposedly helping a user who might want to enable them.
I think the usefuless of this comment is dubious, but as long as we
have one, let's at least make it accurate:
* Two of the listed "experimental" features, cdc and lwt, have actually
graduated from being experimental long ago. Although technically a user
may still use the words "cdc" and "lwt" in the "experimental-features"
list, doing so is pointless, and worse: This comment suggests that these
two features are still experimental - while they are not!
* One experimental feature - alternator-ttl - is missing from this list.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211013083247.13223-1-nyh@scylladb.com>
Equivalent to abstract_replication_strategy get_range_addresses,
yet synchronous, as it uses the precalculated map.
Call it from storage_service::get_new_source_ranges
and range_streamer::get_all_ranges_with_sources_for.
Consequently, get_new_source_ranges and removenode_add_ranges
can become synchronous too.
Unfortunately we can't entirely get rid of
abstract_replication_strategy::get_range_addresses
as it's still needed by
range_streamer::get_all_ranges_with_strict_sources_for.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is not used any more.
Methods either use the token_metadata_ptr in the
effective_replication_map, or receive an ad-hoc
token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prepare for deleting the _shared_token_metadata member.
All we need for recognized_options is the topology
(for network_topology_strategy).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that abstract_replication_strategy methods are all async
clone_only_token_map_sync, and update_normal_tokens_sync
are unused.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is no longer in use.
And with it, the virtual calculate_natural_endpoint_sync method
of which it was the only caller.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Remaining callers of get_address_ranges and get_pending_address_ranges
are all either from a seastar thread or from a coroutine
so we can make the methods always async and drop the
can_yield param.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>