commitlog was changed to use fragmented_temporary_buffer::ostream (db::commitlog::output).
So if there are discontiguous small memory blocks, they can be used to satisfy
an allocation even if no contiguous memory blocks are available.
To prevent that, as Avi suggested, this change allocates in 128K blocks
and frees the last one to succeed (so that we won't fail on allocating continuations).
Fixes#8028
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210203100333.862036-1-bhalevy@scylladb.com>
This is an implementation of `raft::failure_detector` for Scylla
that uses gms::gossiper to query `is_alive` state for a given
raft server id.
Server ids are translated to `gms::inet_address` to be consumed
by `gms::gossiper` with the help of `raft_rpc` class,
which manages the mapping.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210129223109.2142072-1-pa.solodovnikov@scylladb.com>
This series provides additional RPC verbs and corresponding methods in
`messaging_service` class, as well as a scylla-specific Raft RPC module
implementation that uses `netw::messaging_service` under the hood to
dispatch RPC messages.
* https://github.com/ManManson/scylla/commits/raft-api-rpc-impl-v6:
raft: introduce `raft_rpc` class
raft: add Raft RPC verbs to `messaging_service` and wire up the RPC calls
configure.py: compile serializer.cc
Currently we call firstNvmeSize before checking that we have enough
(at least 1) ephemeral disks. When none are found, we hit the following
error (see #7971):
```
File "/opt/scylladb/scripts/libexec/scylla_io_setup", line 239, in
if idata.is_recommended_instance():
File "/opt/scylladb/scripts/scylla_util.py", line 311, in is_recommended_instance
diskSize = self.firstNvmeSize
File "/opt/scylladb/scripts/scylla_util.py", line 291, in firstNvmeSize
firstDisk = ephemeral_disks[0]
IndexError: list index out of range
```
This change reverses the order and first checks that we found
enough disks before getting the fist disk size.
Fixes#7971
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#8027
`mutation::consume()` is used by range scans to convert the immediate
`reconcilable_result` to the final `query::result` format. When the
range scan is in reverse, `mutation::consume()` has to feed the
clustering fragments to the consumer in reverse order, but currently
`mutation::consume()` always uses the natural order, breaking reverse
range scans.
This patch fixes this by adding a `consume_in_reverse` parameter to
`mutation::consume()`, and consequently support for consuming clustering
fragments in reverse order.
Fixes: #8000
Tests: unit(release, debug),
dtest(thrift_tests.py:TestMutations.test_get_range_slice)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210203081659.622424-1-bdenes@scylladb.com>
"
Currently inactive readers are stored in two different places:
* reader concurrency semaphore
* querier cache
With the latter registering its inactive readers with the former. This
is an unnecessarily complex (and possibly surprising) setup that we want
to move away from. This series solves this by moving the responsibility
if storing of inactive reads solely to the reader concurrency semaphore,
including all supported eviction policies. The querier cache is now only
responsible for indexing queriers and maintaining relevant stats.
This makes the ownership of the inactive readers much more clear,
hopefully making Benny's work on introducing close() and abort() a
little bit easier.
Tests: unit(release, debug:v1)
"
* 'unify-inactive-readers/v2' of https://github.com/denesb/scylla:
reader_concurrency_semaphore: store inactive readers directly
querier_cache: store readers in the reader concurrency semaphore directly
querier_cache: retire memory based cache eviction
querier_cache: delegate expiry to the reader_concurrency_semaphore
reader_concurrency_semaphore: introduce ttl for inactive reads
querier_cache: use new eviction notify mechanism to maintain stats
reader_concurrency_semaphore: add eviction notification facility
reader_concurrency_semaphore: extract evict code into method evict()
This is the continuaiton of the row-cache performance
improvements, this time -- the rework of clustering keys part.
The goal is to solve the same set of problems:
- logN eviction complexity
- deep and sparse tree
Unlike partitions, this cache has one big feature that makes it
impossible to just use existing B+ tree:
There's no copyable key at hands. The clustering key is the
managed_bytes() that is not nothrow-copy-constructibe, neither
it's hash-able for lookup due to prefix lookup.
Thus the choice is the B-tree, which is also N-ary one, but
doesn't copy keys around.
B-trees are like B+, but can have key:data pairs in inner nodes,
thus those nodes may be significantly bigger then B+ ones, that
have data-s only in leaf trees. Not to make the memory footprint
worse, the tree assumes that keys and data live on the same object
(the rows_entry one), and the tree itself manages only the key
pointers.
Not to invalidate iterators on insert/remove the tree nodes keep
pointers on keys, not the keys themselves.
The tree uses tri-compare instead of less-compare. This makes the
.find and .lower_bound methods do ~10% less comparisons on random
insert/lookup test.
Numbers:
- memory_footprint: B-tree master
rows_entry size: 216 232
1 row
in-cache: 968 960 (because of dummy entry)
in-memtable: 1006 1022
100 rows
in-cache: 50774 50856
in-memtable: 50620 50918
- mutation_test: B-tree master
tps.average: 891177 833896
- simple_query: B-tree master
tps.median: 71807 71656
tps.maximum: 71847 71708
* xemul/clustering-cache-over-btree-4:
mutation_partition: Save one keys comparison
partition_snapshot_row_cursor: Remove rows pointer
mutation_partition: Use B-tree insertion sugar
perf-test : Print B-tree sizes
mutation_partition: Switch cache of rows onto B-tree
partition_snapshot_reader: Rename cmp to less for explicity
mutation_partition: Make insertion bullet-proof
mutation_partition: Use tri-compare in non-set places
flat_mutation_reader: Use clear() in destroy_current_mutation()
rows_entry: Generalize compare
utils: Intrusive B-tree (with tests)
tests: Generalize bptree compaction test
tests: Generalize bptree stress test
Fixes#7615
Makes the CL writer interface N-valued (though still 1 for the "old" paths). Adds a new write path to input N mutations -> N rp_handles.
Guarantees that all entries are written or none are, and that they will be flushed to disk together.
Small test included.
Closes#7616
* github.com:scylladb/scylla:
commitlog_test: Add multi-entry write test
commitlog: Add "add_entries" call to allow inputting N mutations
commitlog: Make commitlog entries optionally multi-entry
commitlog: Move entry_writer definition to cc file
Whenever we push a fragment, we check whether the buffer is
full and return proceed::no if so, so that the state machine pauses
and lets the consumer continue. This patch adds an additional
condition - if preemption is needed, we also return proceed::no.
This drops us back to the outer loop
(in sstable_mutation_reader::fill_buffer), which will yield to
the reactor as part of seastar::do_until().
Two cases (partition_start and partition_end) did not have the
check for is_buffer_full(); it is added now. This can trigger
is the partition has no rows.
Unlike the previous attempt, push_ready_fragments() is not touched.
The extra preemption opportunities triggered a preexisting bug in
clustering_ranges_walker; it is fixed in the first patch of the series.
I tested this by reading from a large partition with a simple
schema (pk int, ck int, primary key(pk, ck)) with BYPASS CACHE.
However, even without the patch I only got sporadic stalls
with the detector set to 1ms, so it's possible I'm not testing
correctly.
Test: unit (dev, debug, release)
Fixes#7883.
Closes#7928
* github.com:scylladb/scylla:
sstable: reader: preempt after every fragment
clustering_range_walker: fix false discontiguity detected after a static row
Rather than asserting, as seen in #7977.
This shouldn't crash the server in production.
Add unit test that reproduces this scenario
and verifies the internal error exception.
Fixes#7977
Test: unit(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210201163051.1775536-1-bhalevy@scylladb.com>
Fixes#7615
Allows N mutations to be written "atomically" (i.e. in the same
call). Either all are added to segement, or none.
Returns rp_handle vector corresponding to the call vector.
Allows writing more than one blob of data using a single
"add" call into segment. The old call sites will still
just provide a single entry.
To ensure we can determine the health of all the entries
as a unit, we need to wrap them in a "parent" entry.
For this, we bump the commitlog segment format and
introduce a magic marker, which if present, means
we have entries in entry, totalling "size" bytes.
We checksum the entra header, and also checksum
the individual checksums of each sub-entry (faster).
This is added as a post-word.
When parsing/replaying, if v2+ and marker, we have to
read all entries + checksums into memory, verify, and
_then_ we can actually send the info to caller.
Fedora version of systemd macros does not work correctly on CentOS7,
since CentOS7 does not support "file trigger" feature.
To fix the issue we need to stop using systemd macros, call systemctl
directly.
See scylladb/scylla-jmx#94
Closes#8005
"
This series extends the scylla_metadata sstable component
to hold an optional testual description of the sstable origin.
It describes where the sstables originated from
(e.g. memtable, repair, streaming, compaction, etc.)
The origin string is provided by the sstable writer via
sstable_writer_config, written to the scylla_metadata component,
and loaded on sstable::load().
A get_origin() method was added to class sstable to retrieve
its origin. It returns an empty string by default if the origin
is missing.
Compaction now logs the sstable origin for each sstable it
compacts, and it generates the sstable origin for all sstables
in generates. Regular compaction origin is simply set to "compaction"
while other compaction types are mentioned by name, as
"cleanup", "resharding", "reshaping", etc.
A unit test was added to test the sstable_origin by writing either
an empty origin and a random string, and then comparing
the origin retrieved by sstable::load to the one written.
Test: unit(release)
Fixes#7880
"
* tag 'sstable-origin-v2' of github.com:bhalevy/scylla:
compaction: log sstable origin
sstables: scylla_metadata: add support for sstable_origin
sstables: sstable_writer_config: add origin member
The apply_monotonically checks if the cursor is behind the source
position to decide whether or not to push it forward (with the
lower_bound call). The 2nd comparison is done to check if either
the cursor was ahead or if lower_bound result actually hit the key.
This 2nd comparison can be avoided:
- the 1st case needs B-tree lower_bound API extention that reports
if the bound is match or not.
- the 2nd one is covered with reusing tri-compare result from the
1st comparison
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The pointer is needed to erase an element by its iterator from the
rows container. The B-tree has this method on iterator and it does
NOT need to walk up the tree to find its root, so the complexity
is still amortized constant.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After the switch from BST to B-tree the memory foorprint includes inner/leaf nodes
from the B-tree, so it's useful to know their sizes too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The switch is pretty straightforward, and consists of
- change less-compare into tri-compare
- rename insert/insert_check into insert_before_hint
- use tree::key_grabber in mutation_partition::apply_monotonically to
exception-safely transfer a row from one tree to another
- explicitly erase the row from tree in rows_entry::on_evicted, there's
a O(1) tree::iterator method for this
- rewrite rows_entry -> cache_entry transofrmation in the on_evicted to
fit the B-tree API
- include the B-tree's external memory usage into stats
That's it. The number of keys per node was is set to 12 with linear search
and linear extention of 20 because
- experimenting with tree shows that numbers 8 through 10 keys with linear
search show the best performance on stress tests for insert/find-s of
keys that are memcmp-able arrays of bytes (which is an approximation of
current clustring key compare). More keys work slower, but still better
than any bigger value with any type of search up to 64 keys per node
- having 12 keys per nodes is the threshold at which the memory footprint
for B-tree becomes smaller than for boost::intrusive::set for partitions
with 32+ keys
- 20 keys for linear root eats the first-split peak and still performs
well in linear search
As a result the footpring for B tree is bigger than the one for BST only for
trees filled with 21...32 keys by 0.1...0.7 bytes per key.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The bi::intrusive::set::insert-s are non-throwing, so it's safe to add
new entry like this
auto* ne = new entry;
set.insert(ne);
and not worry about memory leak. B-tree's insert will be throwing, so we
need some way to free the new entries in case of exception. There's alreay
a way for this:
std::unique_ptr<entry> ne = std::make_unique<entry>();
set.insert(*ne);
ne.release();
so make every insertion into the set work this way in advance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The mutation_partition::_rows will be switched on B-tree with tri
comparator, so to clearly identify not affected by it places, switch
them onto tri-compare in advance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the code uses a look of unlink_leftmost_without_rebalance
calls. B-tree does have it, but plain clearing of the tree is a bit
faster with clear().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Turn the rows_entry less-comparator's calls into a template as
they are nothing but wrappers on top of rows_entyry tri-comparator.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The design of the tree goes from the row-cache needs, which are
1. Insert/Remove do not invalidate iterators
2. Elements are LSA-manageable
3. Low key overhead
4. External tri-comparator
5. As little actions on insert/remove as possible
With the above the design is
Two types of nodes -- inner and leaf. Both types keep pointer on parent nodes
and N pointers on keys (not keys themselves). Two differences: inner nodes have
array of pointers on kids, leaf nodes keep pointer on the tree (to update left-
and rightmost tree pointers on node move).
Nodes do not keep pointers/references on trees, thus we have O(1) move of any
object, but O(logN) to get the tree size. Fortunately, with big keys-per-node
value this won't result in too many steps.
In turn, the tree has 3 pointers -- root, left- and rightmost leaves. The latter
is for constant-time begin() and end().
Keys are managed by user with the help of embeddable member_hook instance,
which is 1 pointer in size.
The code was copied from the B+ tree one, then heavily reworked, the internal
algorythms turned out to differ quite significantly.
For the sake of mutation_partition::apply_monotonically(), which needs to move
an element from one tree into another, there's a key_grabber helping wrapper
that allows doing this move respecting the exception-safety requirement.
As measured by the perf_collections test the B-tree with 8 keys is faster, than
the std::set, but slower than the B+tree:
vs set vs b+tree
fill: +13% -6%
find: +23% -35%
Another neat thing is that 1-key insertion-removal is ~40% faster than
for BST (the same number of allocations, but the key object is smaller,
less pointers to set-up and less instructions to execute when linking
node with root).
v4:
- equip insertion methods with on_alloc_point() calls to catch
potential exception guarantees violations eariler
- add unlink_leftmost_without_rebalance. The method is borrowed from
boost intrusive set, and is added to kill two birds -- provide it,
as it turns out to be popular, and use a bit faster step-by-step
tree destruction than plain begin+erase loop
v3:
- introduce "inline" root node that is embedded into tree object and in
which the 1st key is inserted. This greatly improves the 1-key-tree
performance, which is pretty common case for rows cache
v2:
- introduce "linear" root leaf that grows on demand
This improves the memory consumption for small trees. This linear node may
and should over-grow the NodeSize parameter. This comes from the fact that
there are two big per-key memory spikes on small trees -- 1-key root leaf
and the first split, when the tree becomes 1-key root with two half-filled
leaves. If the linear extention goes above NodeSize it can flatten even the
2nd peak
- mitigate the keys indirection a bit
Prefetching the keys while doing the intra-node linear scan and the nodes
while descending the tree gives ~+5% of fill and find
- generalize stress tests for B and B+ trees
- cosmetic changes
TODO:
- fix few inefficincies in the core code (walks the sub-tree twice sometimes)
- try to optimize the leaf nodes, that are not lef-/righmost not to carry
unused tree pointer on board
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Whenever we push a fragment, we check whether the buffer is
full and return proceed::no if so, so that the state machine pauses
and lets the consumer continue. This patch adds an additional
condition - if preemption is needed, we also return proceed::no.
This drops us back to the outer loop
(in sstable_mutation_reader::fill_buffer), which will yield to
the reactor as part of seastar::do_until().
Two cases (partition_start and partition_end) did not have the
check for is_buffer_full(); it is added now. This can trigger
is the partition has no rows.
Unlike the previous attempt, push_ready_fragments() is not touched.
I tested this by reading from a large partition with a simple
schema (pk int, ck int, primary key(pk, ck)) with BYPASS CACHE.
However, even without the patch I only got sporadic stalls
with the detector set to 1ms, so it's possible I'm not testing
correctly.
Test: unit (dev)
Fixes#7883.
clustering_range_walker detects when we jump from one row range to another. When
a static row is included in the query, the constructor sets up the first before/after
bounds to be exactly that static row. That creates an artificial range crossing if
the first clustering range is contiguous with the static row.
This can cause the index to be consulted needlessly if we happen to fall back
to sstable_mutation_reader after reading the static row.
A unit test is added.
Ref #7883.
The database lass have to duplicated functions keyspaces() and
get_keyspaces(). Drop the former since it is used in one place only.
Message-Id: <20210201135333.GA1403508@scylladb.com>
Support configuration changes based on joint consensus.
When a user adds a configuration entry, commit an interim "joint
consensus" configuration to the log first, and transition to the
final configuration once both C_old and C_new configurations
accept the joint entry.
Misc cleanups.
* scylla-dev/raft-config-changes-v2:
raft: update README.md
raft: add a simple test for configuration changes
raft: joint consensus, wire up configuration changes in the API
raft: joint consensus, count votes using joint config
raft: joint consensus, wire up configuration changes in FSM
raft: joint consensus, update progress tracker with joint configuration
raft: joint consensus, don't store configuration in FSM
raft: joint consensus, keep track of the last confchange index in the log
raft: joint consensus, implement helpers in class configuration
raft: joint consensus, use unordered_set for server_address list
raft: joint consensus, switch configuration to joint
raft: rename check_committed() to maybe_commit()
raft: fix spelling and add comments
Add new scylla_metadata_type::SSTableOrigin.
Store and retrive a sstring to the scylla metadata component.
Pass sstable_writer_config::origin from the mx sstable writer
and ignore it in the k_l writer.
Add unit test to verify the sstable_origin extension
using both empty and a random string.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a string describing where the sstables originated
from (e.g. memtable, repair, streaming, compaction, etc.)
If configure_writer is called with a nullptr, the origin
will be equal to an empty string.
Introduce test_env_sstables_manager that provides an overload
of configure_writer with no parmeters that calls the base-class'
configure_writer with "test" origin. This was to reduce the
code churn in this patch and to keep the tests simple.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch adds a test for the different units which are supposed to
be usable for assigning a "duration" type in CQL. It turns out that
all documented units are supported correctly except µs (with a unicode
mu), so the test reproduces issue #8001.
The test xfails on Scylla (because µs is not supported) and passes
on Cassandra.
Refs: #8001.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210131192220.407481-1-nyh@scylladb.com>
Since makeself script changes current umask, scylla_setup causes
"scylla does not work with current umask setting (0077)" error.
To fix that we need use latest version of makeself, and specfiy --keep-umask
option.
Fixes#6243Closes#6244
* github.com:scylladb/scylla:
dist/offline_redhat: fix umask error
dist/offline_installer/redhat: support cross build