When destroying a permit with leaked resources we call
`on_internal_error_noexcept()` in the destructor. This method logs an
error or asserts depending on the configuration. When not asserting, we
need to return the leaked units to the semaphore, otherwise they will be
leaked for good. We can do this because we know exactly how many
resources the user of the permit leaked (never signalled).
Right now toppartitions can only be invoked on one column family at a time.
This change introduces a natural extension to this functionality,
allowing to specify a list of families.
We provide three ways for filtering in the query parameter "name_list":
1. A specific column family to include in the form "ks:cf"
2. A keyspace, telling the server to include all column families in it.
Specified by omitting the cf name, i.e. "ks:"
3. All column families, which is represented by an empty list
The list can include any amount of one or both of the 1. and 2. option.
Fixes#4520Closes#7864
LCS reshape is basically 'major compacting' level 0 until it contains less than
N sstables.
That produces terrible write amplification, because any given byte will be
compacted (initial # of sstables / max_threshold (32)) times. So if L0 initially
contained 256 ssts, there would be a WA of about 8.
This terrible write amplification can be reduced by performing STCS instead on
L0, which will leave L0 in a good shape without hurting WA as it happens
now.
Fixes#8345.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210322150655.27011-1-raphaelsc@scylladb.com>
During the retry mechanism, it's possible to encounter a gate
closed exception, which should simply be ignored, because
it indicates that the server is shutting down.
Closes#8337
This open option tells seastar that the file in question
will be truncated to the needed size right at once and all
the subsequent writes will happen within this size. This
hint turns off append optimization in seastar that's not
that cheap and helps so save few cpu cycles.
The option was introduced in seastar by 8bec57bc.
tests: unit(dev), dtest(commitlog:
test_batch_commitlog,
test_periodic_commitlog,
test_commitlog_replay_on_startup)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210323115409.31215-1-xemul@scylladb.com>
Until clang figures things out with the now infamous
`-llvm -inline-threshold X` parameter, let's allow customizing
it to make the compilation of release builds less tiresome.
For instance, scylla's row_level.o object file currently does not compile
for me until I decrease the inline threshold to a low value (e.g. 50).
Message-Id: <54113db9438e3c3371410996f49b7fbe9a1b7257.1616422536.git.sarna@scylladb.com>
Row marker has a cell name which sorts after the row tombstone's start
bound. The old code was writing the marker first, then the row
tombstone, which is incorrect.
This was harmeless to our sstable reader, which recognized both as
belonging to the current clustering row fragment, and collects both
fine.
However, if both atoms trigger creation of promoted index blocks, the
writer will create a promoted index with entries wich violate the cell
name ordering. It's very unlikely to run into in practice, since to
trigger promoted index entries for both atoms, the clustering key
would be so large so that the size of the marker cell exceeds the
desired promoted index block size, which is 64KB by default (but
user-controlled via column_index_size_in_kb option). 64KB is also the
limit on clustering key size accepted by the system.
This was caught by one of our unit tests:
sstable_conforms_to_mutation_source_test
...which runs a battery of mutation reader tests with various
desired promoted index block sizes, including the target size of 1
byte, which triggers an entry for every atom.
The test started to fail for some random seeds after commit ecb6abe
inside the
test_streamed_mutation_forwarding_is_consistent_with_slicing test
case, reporting a mutation mismatch in the following line:
assert_that(*sliced_m).is_equal_to(*fwd_m, slice_with_ranges.row_ranges(*m.schema(), m.key()));
It compares mutations read from the same sstable using different
methods, slicing using clustering key restricitons, and fast
forwarding. The reported mismatch was that fwd_m contained the row
marker, but sliced_m did not. The sstable does contain the marker, so
both reads should return it.
After reverting the commit which introduced dynamic adjustments, the
test passes, but both mutations are missing the marker, both are
wrong!
They are wrong because the promoted index contians entries whose
starting positions violate the ordering, so binary search gets confused
and selects the row tombstone's position, which is emitted after the
marker, thus skipping over the row marker.
The explanation for why the test started to fail after dynamic
adjustements is the following. The promoted index cursor works by
incrementally parsing buffers fed by the file input stream. It first
parses the whole block and then does a binary search within the parsed
array. The entries which cursor touches during binary search depend on
the size of the block read from the file. The commit which enabled
dynamic adjustements causes the block size to be different for
subsequent reads, which allows one of the reads to walk over the
corrupted entries and read the correct data by selecting the entry
corresponding to the row marker.
Fixes#8324
Message-Id: <20210322235812.1042137-1-tgrabiec@scylladb.com>
Clustering rows are now stored in intrusive btree, cells are
now stored in radix tree, but scylla-gdb tries to walk the
intrusive_set and vector/set union respectively.
For the former case -- the btree wrapper is introduced.
For the latter -- compiler optimizes-away too many important
bits and walking the tree turns into a bunch of hard-coded
hacks and reiterpret-casts. Untill better solution is found,
just print the address of the tree root.
* xemul/br-gdb-btree-rows:
gdb: Show address of the row::_cells tree (or "empty" mark)
gdb: Add support for intrusive B tree
gdb: Use helper to get rows from mutation_partition
Currently clang optimizes-out lots of critical stuff from
compact radix tree. Untill we find out the way to walk the
tree in gdb, it's better to at least show where it is in
memory.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Rows inside partition are now stored in an intrusive B-tree,
so here's the helper class that wraps this collection.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
validate_partial() is declared in the internal namespace, but defined
outside it. This causes calls to validate_partial() to be ambiguous
on architectures that haven't been SIMD-optimized yet (e.g. s390x).
Fix by defining it in the internal namespace.
Closes#8268
The manifest manipulation commands stopped working with podman 3;
the containers-storage: prefix now throws errors.
Switch to `buildah manifest`; since we're building with buildah,
we might as well maintain the manifest with buildah as well.
Closes#8231
crc has some code to reverse endianness on big-endian machines, but does
not handle the case of a 1-byte object (which doesn't need any adjustement).
This causes clang to complain that the switch statement doesn't handle that
case.
Fix by adding a no-op case.
Closes#8269
Although a seastar::smart_ptr is trivial to dereference manually, so is
adding support for it to dereference_smart_ptr(), avoiding the annoying
(but brief) detour which is currently needed.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210322150149.84534-1-bdenes@scylladb.com>
Concepts are easier to read and result in better error messages.
This change also tightens the constraint from "std::is_fundamental" to
"std::integral". The differences are floating point values, nullptr_t,
and void. The latter two are illegal/useless to write, and nobody uses
floating point values for list lengths, so everything still compiles.
Closes#8326
This series implements leader stepdown extension. See patch 4 for
justification for its existence. First three patches either implement
cleanups to existing code that future patch will touch or fix bugs
that need to be fixed in order for stepdown test to work.
* 'raft-leader-stepdown-v3' of github.com:scylladb/scylla-dev:
raft: add test for leader stepdown
raft: introduce leader stepdown procedure
raft: fix replication when leader is not part of current config
raft: do not update last election time if current leader is not a part of current configuration
raft: move log limiting semaphore into the leader state
"
Scylla suffers with aggressive compaction after repair-based operation has initiated. That translates into bad latency and slowness for the operation itself.
This aggressiveness comes from the fact that:
1) new sstables are immediately added to the compaction backlog, so reducing bandwidth available for the operation.
2) new sstables are in bad shape when integrated into the main sstable set, not conforming to the strategy invariant.
To solve this problem, new sstables will be incrementally reshaped, off the compaction strategy, until finally integrated into the main set.
The solution takes advantage there's only one sstable per vnode range, meaning sstables generated by repair-based operations are disjoint.
NOTE: off-strategy for repair-based decommission and removenode will follow this series and require little work as the infrastructure is introduced in this series.
Refs #5226.
"
* 'offstrategy_v7' of github.com:raphaelsc/scylla:
tests: Add unit test for off-strategy sstable compaction
table: Wire up off-strategy compaction on repair-based bootstrap and replace
table: extend add_sstable_and_update_cache() for off-strategy
sstables/compaction_manager: Add function to submit off-strategy work
table: Introduce off-strategy compaction on maintenance sstable set
table: change build_new_sstable_list() to accept other sstable sets
table: change non_staging_sstables() to filter out off-strategy sstables
table: Introduce maintenance sstable set
table: Wire compound sstable set
table: prepare make_reader_excluding_sstables() to work with compound sstable set
table: prepare discard_sstables() to work with compound sstable set
table: extract add_sstable() common code into a function
sstable_set: Introduce compound sstable set
reshape: STCS: preserve token contiguity when reshaping disjoint sstables
Section 3.10 of the PhD describes two cases for which the extension can
be helpful:
1. Sometimes the leader must step down. For example, it may need to reboot
for maintenance, or it may be removed from the cluster. When it steps
down, the cluster will be idle for an election timeout until another
server times out and wins an election. This brief unavailability can be
avoided by having the leader transfer its leadership to another server
before it steps down.
2. In some cases, one or more servers may be more suitable to lead the
cluster than others. For example, a server with high load would not make
a good leader, or in a WAN deployment, servers in a primary datacenter
may be preferred in order to minimize the latency between clients and
the leader. Other consensus algorithms may be able to accommodate these
preferences during leader election, but Raft needs a server with a
sufficiently up-to-date log to become leader, which might not be the
most preferred one. Instead, a leader in Raft can periodically check
to see whether one of its available followers would be more suitable,
and if so, transfer its leadership to that server. (If only human leaders
were so graceful.)
The patch here implements the extension and employs it automatically
when a leader removes itself from a cluster.
When a leader orchestrates its own removal from a cluster there is a
situation where the leader is still responsible for replication, but it
is no longer part of active configuration. Current code skips replication
in this case though. Fix it by always replicating in the leader state.
Since we use external failure detector instead of relying on empty
AppendRequests from a leader there can be a situation where a node
is no longer part of a certain raft group but is still alive (and also
may be part of other raft groups). In such case last election time
should not be updated even if the node is alive. It is the same as if
it would have stopped to send empty AppendRequests in original raft.
Since we switched scylla-python3 build directory to tools/python3/build
on Jenkins, we nolonger need compat-python3 targets, drop them.
Related scylladb/scylla-pkg#1554
Closes#8328
"
Due to bad interaction of recent changes (913d970 and 4c8ab10) inctive
readers that are not admitted have managed to completely fly under the
radar, avoiding any sort of limitation. The reason is that pre-admission
the permits don't forward their resource cost to the semaphore, to
prevent them possibly blocking their own admission later. However this
meant that if such a reader is registered as inactive, it completely
avoids the normal resource based eviction mechanism and can accumulate
without bounds.
The real solution to this is to move the semaphore before the cache and
make all reads pass admission before they get started (#4758). Although
work has been started towards this, it is still a while until it lands.
In the meanwhile this patchset provides a workaround in the form of a
new inactive state, which -- like admitted -- causes the permit to
forward its cost to the semaphore, making sure these un-admitted
inactive reads are accounted for and evicted if there is too much of
them.
Fixes: #8258
Tests: unit(release), dtest(oppartitions_test.py:TestTopPartitions.test_read_by_gause_key_distribution_for_compound_primary_key_and_large_rows_number)
"
* 'reader-concurrency-semaphore-limit-inactive-reads/v4' of https://github.com/denesb/scylla:
test: mutation_reader_test: add test for permit cleanup
test: querier_cache_test: add memory based cache eviction test
reader_permit: add inactive state
querier: insert(): account immediately evicted querier as resource based eviction
reader_concurrency_semaphore: fix clear_inactive_reads()
reader_concurrency_semaphore: make inactive_read_handle a weak reference
reader_concurrency_semaphore: make evict() noexcept
reader_concurrency_semaphore: update out-of-date comments
After commit 0bd201d3ca ("cql3: Skip indexed
column for CK restrictions") fixed issue #7888, the test
cassandra_tests/validation/entities/frozen_collections_test.py::testClusteringColumnFiltering
began passing, as expected. So we can remove its "xfail" label.
Refs #7888.
cassandra_tests/validation/entities/frozen_collections_test.py::testClusteringColumnFiltering
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210321080522.1831115-1-nyh@scylladb.com>
* seastar ea5e529f30...83339edb04 (21):
> cmake: filter out -Wno-error=#warnings from pkgconfig (seastar.pc)
> Merge 'utils/log.cc: fix nested_exception logging (again)' from Vlad Zolotarov
Fixes#8327.
> file: Add option to refuse the append-challenged file
> Merge "Teach io-tester to work on block device" from Pavel E
> Merge "Cleanup files code" from Pavel E
> install-dependencies: Support rhel-8.3
> install-dependencies: Add some missing rh packages
> file, reactor: reinstate RWF_NOWAIT support
> file: Prevent fsxattr.fsx_extsize from overflow
> cmake: enable clang's -Wno-error=#warnings if supported
> cmake: harden seastar_supports_flag aginst inputs with spaces or #
> cmake: fix seastar_supports_flag failing after first invocation
> thread: Stop backtraces in main() on s390x architecture
> intent: Explicitly declare constructors for references
> test: file_io_test: parallel_overwrite: use testing::local_random_engine
> util: log-impl: rework log_buf::inserter_iterator
> rwlock: pass timeout parameter to get_units
> concepts: require lib support to enable concepts
> rpc: print more info on bad protocol magic
> seastar-addr2line: strip input line to restore multiline support
> log: skip on unknown nested mixing instead of stopping the logging
Ref #8327.
This is a translation of Cassandra's CQL unit test source file
validation/entities/SecondaryIndexOnMapEntriesTest.java into our
our cql-pytest framework.
This test file checks various features of indexing (with secondary index)
individual entries of maps. All these tests pass on Cassandra, but fail on
Scylla because of issue #2962 - we do not yet support indexing of the content
of unfrozen collections. The failing test currently fail as soon as they
try to create the index, with the message:
"Cannot create secondary index on non-frozen collection or UDT column v".
Refs #2962.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210310124638.1653606-1-nyh@scylladb.com>
Thrift used to be quite unsafe with regard to its retry mechanism, which caused very rapid use of resources, namely the number of file descriptors. It was also prone to use-after-free due to spawning futures without guarding the captured objects with anything.
The mechanism is now cleaned up, and a simple exponential backoff replaced previous constant backoff policy.
Fixes#8317
Tests: unit(dev), manual(see #8317 for a simple reproducer)
Closes#8318
* github.com:scylladb/scylla:
thrift: add exponential backoff for retries
thrift: fix and simplify retry logic
When a tuple value is serialized, we go through every element type and
use it to serialize element values. But an element type can be
reversed, which is artificially different from the type of the value
being read. This results in a server error due to the type mismatch.
Fix it by unreversing the element type prior to comparing it to the
value type.
Fixes#7902
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8316
When querying an index table, we assemble clustering-column
restrictions for that query by going over the base table token,
partition columns, and clustering columns. But if one of those
columns is the indexed column, there is a problem; the indexed column
is the index table's partition key, not clustering key. We end up
with invalid clustering slice, which can cause problems downstream.
Fix this by skipping the indexed column when assembling the clustering
restrictions.
Tests: unit (dev)
Fixes#7888
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8320
A trichotomic comparator returning an int an easily be mistaken
for a less comparator as the return types are convertible.
Use the new std::strong_ordering instead.
A caller in cql3's update_parameters.hh is also converted, following
the path of least resistance.
Ref #1449.
Test: unit (dev)
Closes#8323
Constrain inject() with a requires clause rather than enable_if,
simplifying the code and compiler diagnostics.
Note that the second instance could not have been called, since
the template argument does not appear in the function parameter
list and thus could not be deduced. This is corrected here.
Closes#8322
time_point_to_string ensures its input is a time_point with
millisecond resolution (though it neglects to verify the epoch
is what it expects). Change the test from a clunky enable_if to
a nicer concept.
Closes#8321
Currently send_snapshot is the only two-way RPC used by Raft.
However, the sender (the leader) does not look at the receiver's
reply, other than checks it's not an error. This has the following
issues:
- if the follower has a newer term and rejects the snapshot for
that reason, the leader will not learn about a newer follower
term and will not step down
- the send_snapshot message doesn't pass through a single-endpoint
fsm::step() and thus may not follow the general Raft rules
which apply for all messages.
- making a general purpose transport that simply calls fsm::step()
for every message becomes impossible.
Fix it by actually responding with snapshot_reply to send_snapshot
RPC, generating this reply in fsm::step() on the follower,
and feeding into fsm::step() on the leader.
* scylla-dev/raft-send-snapshot-v2:
raft: pass snapshot_reply into fsm::step()
raft: respond with snapshot_reply to send_snapshot RPC
raft: set follower's next_idx when switching to SNAPSHOT mode
raft: set the current leader upon getting InstallSnapshot
The original backoff mechanism which just retries after 1ms
may still lead to rapid resource depletion.
Instead, an exponential backoff is used, with a cap of ~2s.
Tests: manual, with cassandra-stress and browsing logs
The retry logic for Thrift frontend had two bugs:
1. Due to missing break in a switch statement,
two retry calls were always performed instead of one,
which acts a little bit like a Seastar forkbomb
2. The delayed action was not guarded with any gate,
so it was theoretically possible to access a captured `this`
pointer of an object which already got deallocated.
In order to fix the above, the logic is simplified to always
retry with backoff - it makes very little sense to skip the backoff
and immediate retries are not needed by anyone, while they cause
severe overload risk.
Tests: manual - a simple cassandra-stress invocation was able to crash
scylla with a segfault:
$ cassandra-stress write -mode thrift -rate threads=2000
Fixes#8317
enable_if is hard to understand, especially its error messages. Convert
enable_if in sstable code to concepts.
A new concept is introduced, self_describing, for the case of a type
that follows the obj.describe_type() protocol. Otherwise this is quite
straightforward.
Closes#8315
* github.com:scylladb/scylla:
sstables: vector write: convert to concepts
sstables: check_truncated_and_assign: convert to concept
sstables: convert write() to concepts
sstables: convert write_vint() to concepts
sstables: vector parse(): convert to concept
sstables: convert parse() for a self-describing type to concept
sstables: read_vint(): convert enable_if to concepts
sstables: add concept for self-describing type
We have an integral and a non-integral overload, each constrained
with enable_if. We use std::integral to constrain the integral
overload and leave the other unconstrained, as C++ will choose the
more constrained version when applicable.
There are three variants: integral, enum, and self-describing
(currently expressed as not integral and not enum). Convert to
concepts by using the standard concepts or the new self_describing
concept.