Commit Graph

25041 Commits

Author SHA1 Message Date
Benny Halevy
338c190842 reader_concurrency_semaphore: inactive_read_handle: mark methods noexcept
All are trivially noexcept.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210204113327.1027792-1-bhalevy@scylladb.com>
2021-02-04 13:57:42 +02:00
Benny Halevy
ba4b8dd6e5 sstables: row.hh: no need to include reader_concurrency_semaphore.hh
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210204113413.1027893-1-bhalevy@scylladb.com>
2021-02-04 13:42:06 +02:00
Tomasz Grabiec
2e3f6a9622 tests: perf_fast_forward: Print outpout directory
Message-Id: <20210203180053.230627-1-tgrabiec@scylladb.com>
2021-02-04 10:39:41 +02:00
Tomasz Grabiec
e0ceb454c0 tests: perf_fast_forward: Print error hints to stdout
They point to lines printed to stdout, so should be aligned with them.
Message-Id: <20210203180016.230547-1-tgrabiec@scylladb.com>
2021-02-04 10:39:41 +02:00
Avi Kivity
fcd48adcc4 Update seastar submodule
* seastar b5b2ee53d...4c7c5c7c4 (1):
  > Merge "add support for printing backtraces on one line" from Benny

Fixes #5464.
2021-02-03 14:01:45 +02:00
Benny Halevy
ca6f5cb0bc test: commitlog_test: test_allocation_failure: fill memory using smaller allocations
commitlog was changed to use fragmented_temporary_buffer::ostream (db::commitlog::output).
So if there are discontiguous small memory blocks, they can be used to satisfy
an allocation even if no contiguous memory blocks are available.

To prevent that, as Avi suggested, this change allocates in 128K blocks
and frees the last one to succeed (so that we won't fail on allocating continuations).

Fixes #8028

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210203100333.862036-1-bhalevy@scylladb.com>
2021-02-03 12:21:20 +02:00
Pavel Solodovnikov
856b0b3a58 raft: introduce raft_gossip_failure_detector class
This is an implementation of `raft::failure_detector` for Scylla
that uses gms::gossiper to query `is_alive` state for a given
raft server id.

Server ids are translated to `gms::inet_address` to be consumed
by `gms::gossiper` with the help of `raft_rpc` class,
which manages the mapping.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210129223109.2142072-1-pa.solodovnikov@scylladb.com>
2021-02-03 10:45:18 +01:00
Tomasz Grabiec
f8ae46f294 Merge "raft: RPC module implementation" from Pavel Solodovnikov
This series provides additional RPC verbs and corresponding methods in
`messaging_service` class, as well as a scylla-specific Raft RPC module
implementation that uses `netw::messaging_service` under the hood to
dispatch RPC messages.

* https://github.com/ManManson/scylla/commits/raft-api-rpc-impl-v6:
  raft: introduce `raft_rpc` class
  raft: add Raft RPC verbs to `messaging_service` and wire up the RPC calls
  configure.py: compile serializer.cc
2021-02-03 10:43:58 +01:00
Benny Halevy
55e3df8a72 dist: scylla_util: prevent IndexError when no ephemeral_disks were found
Currently we call firstNvmeSize before checking that we have enough
(at least 1) ephemeral disks.  When none are found, we hit the following
error (see #7971):
```
File "/opt/scylladb/scripts/libexec/scylla_io_setup", line 239, in
if idata.is_recommended_instance():
File "/opt/scylladb/scripts/scylla_util.py", line 311, in is_recommended_instance
diskSize = self.firstNvmeSize
File "/opt/scylladb/scripts/scylla_util.py", line 291, in firstNvmeSize
firstDisk = ephemeral_disks[0]
IndexError: list index out of range
```

This change reverses the order and first checks that we found
enough disks before getting the fist disk size.

Fixes #7971

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #8027
2021-02-03 11:30:18 +02:00
Avi Kivity
10606aadb5 Update tools/java submodule
* tools/java 78c8ef4f54...0187829d5e (1):
  > nodetool: alternate way to specify table name which includes a dot

Fixes #6521.
2021-02-03 11:27:33 +02:00
Botond Dénes
46b795b5fd mutation: consume(): add reverse mode
`mutation::consume()` is used by range scans to convert the immediate
`reconcilable_result` to the final `query::result` format. When the
range scan is in reverse, `mutation::consume()` has to feed the
clustering fragments to the consumer in reverse order, but currently
`mutation::consume()` always uses the natural order, breaking reverse
range scans.
This patch fixes this by adding a `consume_in_reverse` parameter to
`mutation::consume()`, and consequently support for consuming clustering
fragments in reverse order.

Fixes: #8000

Tests: unit(release, debug),
dtest(thrift_tests.py:TestMutations.test_get_range_slice)

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210203081659.622424-1-bdenes@scylladb.com>
2021-02-03 11:00:47 +02:00
Piotr Sarna
c03363b520 README: fix a dead link for building instructions
The link was outdated, since its destination was moved to
a subdirectory.

Message-Id: <b0e0eedaea4f26acf050a91ab9eed1ca37a838bb.1612338584.git.sarna@scylladb.com>
2021-02-03 10:59:50 +02:00
Avi Kivity
913d970c64 Merge "Unify inactive readers" from Botond
"
Currently inactive readers are stored in two different places:
* reader concurrency semaphore
* querier cache
With the latter registering its inactive readers with the former. This
is an unnecessarily complex (and possibly surprising) setup that we want
to move away from. This series solves this by moving the responsibility
if storing of inactive reads solely to the reader concurrency semaphore,
including all supported eviction policies. The querier cache is now only
responsible for indexing queriers and maintaining relevant stats.
This makes the ownership of the inactive readers much more clear,
hopefully making Benny's work on introducing close() and abort() a
little bit easier.

Tests: unit(release, debug:v1)
"

* 'unify-inactive-readers/v2' of https://github.com/denesb/scylla:
  reader_concurrency_semaphore: store inactive readers directly
  querier_cache: store readers in the reader concurrency semaphore directly
  querier_cache: retire memory based cache eviction
  querier_cache: delegate expiry to the reader_concurrency_semaphore
  reader_concurrency_semaphore: introduce ttl for inactive reads
  querier_cache: use new eviction notify mechanism to maintain stats
  reader_concurrency_semaphore: add eviction notification facility
  reader_concurrency_semaphore: extract evict code into method evict()
2021-02-03 10:59:04 +02:00
Piotr Sarna
d395305ddd api: fix retrieving replied RPC messages
The API call referred to a nonexistent callback,
which is now renamed to better match the API path
and actually implemented.

Message-Id: <3d0dbb42f67e1584999a58da9aa9cc722487fda1.1612279443.git.sarna@scylladb.com>
2021-02-03 09:42:17 +02:00
Pekka Enberg
5670276163 Update seastar submodule
* seastar cb3aaf07...b5b2ee53 (1):
  > perftune.py: fix assignment after extend and add asserts
Fixes #8008
2021-02-02 15:27:13 +02:00
Tomasz Grabiec
873e732042 Merge "Switch partition rows onto B-tree" from Pavel Emelyanov
This is the continuaiton of the row-cache performance
improvements, this time -- the rework of clustering keys part.

The goal is to solve the same set of problems:
- logN eviction complexity
- deep and sparse tree

Unlike partitions, this cache has one big feature that makes it
impossible to just use existing B+ tree:

  There's no copyable key at hands. The clustering key is the
  managed_bytes() that is not nothrow-copy-constructibe, neither
  it's hash-able for lookup due to prefix lookup.

Thus the choice is the B-tree, which is also N-ary one, but
doesn't copy keys around.

B-trees are like B+, but can have key:data pairs in inner nodes,
thus those nodes may be significantly bigger then B+ ones, that
have data-s only in leaf trees. Not to make the memory footprint
worse, the tree assumes that keys and data live on the same object
(the rows_entry one), and the tree itself manages only the key
pointers.

Not to invalidate iterators on insert/remove the tree nodes keep
pointers on keys, not the keys themselves.

The tree uses tri-compare instead of less-compare. This makes the
.find and .lower_bound methods do ~10% less comparisons on random
insert/lookup test.

Numbers:

- memory_footprint: B-tree       master
  rows_entry size:  216          232

  1 row
   in-cache:        968          960     (because of dummy entry)
   in-memtable:     1006         1022

  100 rows
   in-cache:        50774        50856
   in-memtable:     50620        50918

- mutation_test:    B-tree       master
   tps.average:     891177       833896

- simple_query:     B-tree       master
   tps.median:      71807        71656
   tps.maximum:     71847        71708

* xemul/clustering-cache-over-btree-4:
  mutation_partition: Save one keys comparison
  partition_snapshot_row_cursor: Remove rows pointer
  mutation_partition: Use B-tree insertion sugar
  perf-test : Print B-tree sizes
  mutation_partition: Switch cache of rows onto B-tree
  partition_snapshot_reader: Rename cmp to less for explicity
  mutation_partition: Make insertion bullet-proof
  mutation_partition: Use tri-compare in non-set places
  flat_mutation_reader: Use clear() in destroy_current_mutation()
  rows_entry: Generalize compare
  utils: Intrusive B-tree (with tests)
  tests: Generalize bptree compaction test
  tests: Generalize bptree stress test
2021-02-02 12:26:02 +01:00
Tomasz Grabiec
75eb97b12c Merge 'Commitlog multi-entry write' from Calle Wilund
Fixes #7615

Makes the CL writer interface N-valued (though still 1 for the "old" paths). Adds a new write path to input N mutations -> N rp_handles.
Guarantees that all entries are written or none are, and that they will be flushed to disk together.

Small test included.

Closes #7616

* github.com:scylladb/scylla:
  commitlog_test: Add multi-entry write test
  commitlog: Add "add_entries" call to allow inputting N mutations
  commitlog: Make commitlog entries optionally multi-entry
  commitlog: Move entry_writer definition to cc file
2021-02-02 12:23:19 +01:00
Tomasz Grabiec
7b17969a6e Merge 'sstable: reader: preempt after every fragment' from Avi Kivity
Whenever we push a fragment, we check whether the buffer is
full and return proceed::no if so, so that the state machine pauses
and lets the consumer continue. This patch adds an additional
condition - if preemption is needed, we also return proceed::no.
This drops us back to the outer loop
(in sstable_mutation_reader::fill_buffer), which will yield to
the reactor as part of seastar::do_until().

Two cases (partition_start and partition_end) did not have the
check for is_buffer_full(); it is added now. This can trigger
is the partition has no rows.

Unlike the previous attempt, push_ready_fragments() is not touched.

The extra preemption opportunities triggered a preexisting bug in
clustering_ranges_walker; it is fixed in the first patch of the series.

I tested this by reading from a large partition with a simple
schema (pk int, ck int, primary key(pk, ck)) with BYPASS CACHE.
However, even without the patch I only got sporadic stalls
with the detector set to 1ms, so it's possible I'm not testing
correctly.

Test: unit (dev, debug, release)

Fixes #7883.

Closes #7928

* github.com:scylladb/scylla:
  sstable: reader: preempt after every fragment
  clustering_range_walker: fix false discontiguity detected after a static row
2021-02-02 12:21:58 +01:00
Benny Halevy
0fecc78d88 user_function: throw on_internal_error if executed outside a seastar thread
Rather than asserting, as seen in #7977.
This shouldn't crash the server in production.

Add unit test that reproduces this scenario
and verifies the internal error exception.

Fixes #7977

Test: unit(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210201163051.1775536-1-bhalevy@scylladb.com>
2021-02-02 13:03:39 +02:00
Calle Wilund
720a47fe8a commitlog_test: Add multi-entry write test 2021-02-02 10:41:08 +00:00
Calle Wilund
c5f6125039 commitlog: Add "add_entries" call to allow inputting N mutations
Fixes #7615

Allows N mutations to be written "atomically" (i.e. in the same
call). Either all are added to segement, or none.

Returns rp_handle vector corresponding to the call vector.
2021-02-02 10:41:08 +00:00
Calle Wilund
5fcc2066ed commitlog: Make commitlog entries optionally multi-entry
Allows writing more than one blob of data using a single
"add" call into segment. The old call sites will still
just provide a single entry.

To ensure we can determine the health of all the entries
as a unit, we need to wrap them in a "parent" entry.
For this, we bump the commitlog segment format and
introduce a magic marker, which if present, means
we have entries in entry, totalling "size" bytes.
We checksum the entra header, and also checksum
the individual checksums of each sub-entry (faster).
This is added as a post-word.

When parsing/replaying, if v2+ and marker, we have to
read all entries + checksums into memory, verify, and
_then_ we can actually send the info to caller.
2021-02-02 10:41:08 +00:00
Calle Wilund
6bef3f9cc3 commitlog: Move entry_writer definition to cc file
Should not be public/visible
2021-02-02 10:32:44 +00:00
Juliusz Stasiewicz
29e4737a9b transport: Fix abort on certain configurations of native_transport_port(_ssl)
The reason was accessing the `configs` table out of index. Also,
native_transport_port-s can no longer be disabled by setting to 0,
as per the table below.

Rules for port/encryption (the same apply to shard_aware counterpart):

np  := native_transport_port.is_set()
nps := native_transport_port_ssl.is_set()
ceo := ceo.at("enabled") == "true"
eq  := native_transport_port_ssl() == native_transport_port()

+-----+-----+-----+-----+
|  np | nps | ceo |  eq |
+-----+-----+-----+-----+
|  0  |  0  |  0  |  *  |   =>   listen on native_transport_port, unencrypted
|  0  |  0  |  1  |  *  |   =>   listen on native_transport_port, encrypted
|  0  |  1  |  0  |  *  |   =>   nonsense, don't listen
|  0  |  1  |  1  |  *  |   =>   listen on native_transport_port_ssl, encrypted
|  1  |  0  |  0  |  *  |   =>   listen on native_transport_port, unencrypted
|  1  |  0  |  1  |  *  |   =>   listen on native_transport_port, encrypted
|  1  |  1  |  0  |  *  |   =>   listen on native_transport_port, unencrypted
|  1  |  1  |  1  |  0  |   =>   listen on native_transport_port, unencrypted + native_transport_port_ssl, encrypted
|  1  |  1  |  1  |  1  |   =>   native_transport_port(_ssl), encrypted
+-----+-----+-----+-----+

Fixes #7783
Fixes #7866

Closes #7992
2021-02-02 11:32:31 +02:00
Avi Kivity
285303b131 Update tools/jmx submodule
* tools/jmx 2c95650...949cefc (2):
  > dist/redhat: stop using systemd macros, call systemctl directly
  > Remove obsolete FIXME

See scylladb/scylla-jmx#94.
2021-02-02 11:29:36 +02:00
Takuya ASADA
7b310c591e dist/redhat: stop using systemd macros, call systemctl directly
Fedora version of systemd macros does not work correctly on CentOS7,
since CentOS7 does not support "file trigger" feature.
To fix the issue we need to stop using systemd macros, call systemctl
directly.

See scylladb/scylla-jmx#94

Closes #8005
2021-02-02 11:28:07 +02:00
Avi Kivity
da4fa0629a Merge "sstables: add sstable_origin to scylla_metadata" from Benny
"
This series extends the scylla_metadata sstable component
to hold an optional testual description of the sstable origin.
It describes where the sstables originated from
(e.g. memtable, repair, streaming, compaction, etc.)

The origin string is provided by the sstable writer via
sstable_writer_config, written to the scylla_metadata component,
and loaded on sstable::load().

A get_origin() method was added to class sstable to retrieve
its origin.  It returns an empty string by default if the origin
is missing.

Compaction now logs the sstable origin for each sstable it
compacts, and it generates the sstable origin for all sstables
in generates.  Regular compaction origin is simply set to "compaction"
while other compaction types are mentioned by name, as
"cleanup", "resharding", "reshaping", etc.

A unit test was added to test the sstable_origin by writing either
an empty origin and a random string, and then comparing
the origin retrieved by sstable::load to the one written.

Test: unit(release)

Fixes #7880
"

* tag 'sstable-origin-v2' of github.com:bhalevy/scylla:
  compaction: log sstable origin
  sstables: scylla_metadata: add support for sstable_origin
  sstables: sstable_writer_config: add origin member
2021-02-02 10:35:11 +02:00
Pavel Emelyanov
54ddb5a70a mutation_partition: Save one keys comparison
The apply_monotonically checks if the cursor is behind the source
position to decide whether or not to push it forward (with the
lower_bound call). The 2nd comparison is done to check if either
the cursor was ahead or if lower_bound result actually hit the key.

This 2nd comparison can be avoided:

- the 1st case needs B-tree lower_bound API extention that reports
  if the bound is match or not.
- the 2nd one is covered with reusing tri-compare result from the
  1st comparison

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
4ccce97396 partition_snapshot_row_cursor: Remove rows pointer
The pointer is needed to erase an element by its iterator from the
rows container. The B-tree has this method on iterator and it does
NOT need to walk up the tree to find its root, so the complexity
is still amortized constant.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
8e7c1e049b mutation_partition: Use B-tree insertion sugar
The B-tree .insert methods accept unique pointers and release them

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
a92eb2f7a9 perf-test : Print B-tree sizes
After the switch from BST to B-tree the memory foorprint includes inner/leaf nodes
from the B-tree, so it's useful to know their sizes too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
5c0f9a8180 mutation_partition: Switch cache of rows onto B-tree
The switch is pretty straightforward, and consists of

- change less-compare into tri-compare

- rename insert/insert_check into insert_before_hint

- use tree::key_grabber in mutation_partition::apply_monotonically to
  exception-safely transfer a row from one tree to another

- explicitly erase the row from tree in rows_entry::on_evicted, there's
  a O(1) tree::iterator method for this

- rewrite rows_entry -> cache_entry transofrmation in the on_evicted to
  fit the B-tree API

- include the B-tree's external memory usage into stats

That's it. The number of keys per node was is set to 12 with linear search
and linear extention of 20 because

- experimenting with tree shows that numbers 8 through 10 keys with linear
  search show the best performance on stress tests for insert/find-s of
  keys that are memcmp-able arrays of bytes (which is an approximation of
  current clustring key compare). More keys work slower, but still better
  than any bigger value with any type of search up to 64 keys per node

- having 12 keys per nodes is the threshold at which the memory footprint
  for B-tree becomes smaller than for boost::intrusive::set for partitions
  with 32+ keys

- 20 keys for linear root eats the first-split peak and still performs
  well in linear search

As a result the footpring for B tree is bigger than the one for BST only for
trees filled with 21...32 keys by 0.1...0.7 bytes per key.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
165255e2bd partition_snapshot_reader: Rename cmp to less for explicity
This is less comparator, cmp is used as a sign of tri-compare in this set.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
ee9e104541 mutation_partition: Make insertion bullet-proof
The bi::intrusive::set::insert-s are non-throwing, so it's safe to add
new entry like this

  auto* ne = new entry;
  set.insert(ne);

and not worry about memory leak. B-tree's insert will be throwing, so we
need some way to free the new entries in case of exception. There's alreay
a way for this:

   std::unique_ptr<entry> ne = std::make_unique<entry>();
   set.insert(*ne);
   ne.release();

so make every insertion into the set work this way in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
926f748a3d mutation_partition: Use tri-compare in non-set places
The mutation_partition::_rows  will be switched on B-tree with tri
comparator, so to clearly identify not affected by it places, switch
them onto tri-compare in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
bfcd6a4bb7 flat_mutation_reader: Use clear() in destroy_current_mutation()
Currently the code uses a look of unlink_leftmost_without_rebalance
calls. B-tree does have it, but plain clearing of the tree is a bit
faster with clear().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
306c40939b rows_entry: Generalize compare
Turn the rows_entry less-comparator's calls into a template as
they are nothing but wrappers on top of rows_entyry tri-comparator.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
2f7c03d84c utils: Intrusive B-tree (with tests)
The design of the tree goes from the row-cache needs, which are

1. Insert/Remove do not invalidate iterators
2. Elements are LSA-manageable
3. Low key overhead
4. External tri-comparator
5. As little actions on insert/remove as possible

With the above the design is

Two types of nodes -- inner and leaf. Both types keep pointer on parent nodes
and N pointers on keys (not keys themselves). Two differences: inner nodes have
array of pointers on kids, leaf nodes keep pointer on the tree (to update left-
and rightmost tree pointers on node move).

Nodes do not keep pointers/references on trees, thus we have O(1) move of any
object, but O(logN) to get the tree size. Fortunately, with big keys-per-node
value this won't result in too many steps.

In turn, the tree has 3 pointers -- root, left- and rightmost leaves. The latter
is for constant-time begin() and end().

Keys are managed by user with the help of embeddable member_hook instance,
which is 1 pointer in size.

The code was copied from the B+ tree one, then heavily reworked, the internal
algorythms turned out to differ quite significantly.

For the sake of mutation_partition::apply_monotonically(), which needs to move
an element from one tree into another, there's a key_grabber helping wrapper
that allows doing this move respecting the exception-safety requirement.

As measured by the perf_collections test the B-tree with 8 keys is faster, than
the std::set, but slower than the B+tree:

            vs set        vs b+tree
   fill:     +13%           -6%
   find:     +23%          -35%

Another neat thing is that 1-key insertion-removal is ~40% faster than
for BST (the same number of allocations, but the key object is smaller,
less pointers to set-up and less instructions to execute when linking
node with root).

v4:
- equip insertion methods with on_alloc_point() calls to catch
  potential exception guarantees violations eariler

- add unlink_leftmost_without_rebalance. The method is borrowed from
  boost intrusive set, and is added to kill two birds -- provide it,
  as it turns out to be popular, and use a bit faster step-by-step
  tree destruction than plain begin+erase loop

v3:
- introduce "inline" root node that is embedded into tree object and in
  which the 1st key is inserted. This greatly improves the 1-key-tree
  performance, which is pretty common case for rows cache

v2:
- introduce "linear" root leaf that grows on demand

  This improves the memory consumption for small trees. This linear node may
  and should over-grow the NodeSize parameter. This comes from the fact that
  there are two big per-key memory spikes on small trees -- 1-key root leaf
  and the first split, when the tree becomes 1-key root with two half-filled
  leaves. If the linear extention goes above NodeSize it can flatten even the
  2nd peak

- mitigate the keys indirection a bit

  Prefetching the keys while doing the intra-node linear scan and the nodes
  while descending the tree gives ~+5% of fill and find

- generalize stress tests for B and B+ trees

- cosmetic changes

TODO:

- fix few inefficincies in the core code (walks the sub-tree twice sometimes)
- try to optimize the leaf nodes, that are not lef-/righmost not to carry
  unused tree pointer on board

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:29 +03:00
Pavel Emelyanov
6d63bdbefe tests: Generalize bptree compaction test
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:28:59 +03:00
Pavel Emelyanov
8bdad0bb28 tests: Generalize bptree stress test
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:28:57 +03:00
Avi Kivity
db4b9215dd sstable: reader: preempt after every fragment
Whenever we push a fragment, we check whether the buffer is
full and return proceed::no if so, so that the state machine pauses
and lets the consumer continue. This patch adds an additional
condition - if preemption is needed, we also return proceed::no.
This drops us back to the outer loop
(in sstable_mutation_reader::fill_buffer), which will yield to
the reactor as part of seastar::do_until().

Two cases (partition_start and partition_end) did not have the
check for is_buffer_full(); it is added now. This can trigger
is the partition has no rows.

Unlike the previous attempt, push_ready_fragments() is not touched.

I tested this by reading from a large partition with a simple
schema (pk int, ck int, primary key(pk, ck)) with BYPASS CACHE.
However, even without the patch I only got sporadic stalls
with the detector set to 1ms, so it's possible I'm not testing
correctly.

Test: unit (dev)

Fixes #7883.
2021-02-01 19:32:07 +02:00
Avi Kivity
7634a90dd2 clustering_range_walker: fix false discontiguity detected after a static row
clustering_range_walker detects when we jump from one row range to another. When
a static row is included in the query, the constructor sets up the first before/after
bounds to be exactly that static row. That creates an artificial range crossing if
the first clustering range is contiguous with the static row.

This can cause the index to be consulted needlessly if we happen to fall back
to sstable_mutation_reader after reading the static row.

A unit test is added.

Ref #7883.
2021-02-01 19:32:07 +02:00
Pavel Solodovnikov
9d17a654a6 raft: use null_sharder for raft tables
Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210201105300.110210-1-pa.solodovnikov@scylladb.com>
2021-02-01 18:52:04 +02:00
Gleb Natapov
382ee066bf database: drop duplicated function
The database lass have to duplicated functions keyspaces() and
get_keyspaces(). Drop the former since it is used in one place only.

Message-Id: <20210201135333.GA1403508@scylladb.com>
2021-02-01 18:52:04 +02:00
Tomasz Grabiec
eac9c1d80a Merge "raft: configuration changes with joint consensus" from Kostja
Support configuration changes based on joint consensus.
When a user adds a configuration entry, commit an interim "joint
consensus" configuration to the log first, and transition to the
final configuration once both C_old and C_new configurations
accept the joint entry.

Misc cleanups.

* scylla-dev/raft-config-changes-v2:
  raft: update README.md
  raft: add a simple test for configuration changes
  raft: joint consensus, wire up configuration changes in the API
  raft: joint consensus, count votes using joint config
  raft: joint consensus, wire up configuration changes in FSM
  raft: joint consensus, update progress tracker with joint configuration
  raft: joint consensus, don't store configuration in FSM
  raft: joint consensus, keep track of the last confchange index in the log
  raft: joint consensus, implement helpers in class configuration
  raft: joint consensus, use unordered_set for server_address list
  raft: joint consensus, switch configuration to joint
  raft: rename check_committed() to maybe_commit()
  raft: fix spelling and add comments
2021-02-01 18:52:04 +02:00
Benny Halevy
4b309e0829 compaction: log sstable origin
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-01 16:45:52 +02:00
Benny Halevy
77328a936a sstables: scylla_metadata: add support for sstable_origin
Add new scylla_metadata_type::SSTableOrigin.
Store and retrive a sstring to the scylla metadata component.
Pass sstable_writer_config::origin from the mx sstable writer
and ignore it in the k_l writer.

Add unit test to verify the sstable_origin extension
using both empty and a random string.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-01 16:45:52 +02:00
Benny Halevy
22f6023ac3 sstables: sstable_writer_config: add origin member
Add a string describing where the sstables originated
from (e.g. memtable, repair, streaming, compaction, etc.)

If configure_writer is called with a nullptr, the origin
will be equal to an empty string.

Introduce test_env_sstables_manager that provides an overload
of configure_writer with no parmeters that calls the base-class'
configure_writer with "test" origin.  This was to reduce the
code churn in this patch and to keep the tests simple.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-01 16:45:52 +02:00
Nadav Har'El
75a4281bff cql-pytest: test the units supposed to be usable for "duration" type
This patch adds a test for the different units which are supposed to
be usable for assigning a "duration" type in CQL. It turns out that
all documented units are supported correctly except µs (with a unicode
mu), so the test reproduces issue #8001.

The test xfails on Scylla (because µs is not supported) and passes
on Cassandra.

Refs: #8001.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210131192220.407481-1-nyh@scylladb.com>
2021-02-01 11:05:10 +01:00
Avi Kivity
bb202db1ff Merge 'dist/offline_installer/redhat: fix umask error' from Takuya ASADA
Since makeself script changes current umask, scylla_setup causes
"scylla does not work with current umask setting (0077)" error.
To fix that we need use latest version of makeself, and specfiy --keep-umask
option.

Fixes #6243

Closes #6244

* github.com:scylladb/scylla:
  dist/offline_redhat: fix umask error
  dist/offline_installer/redhat: support cross build
2021-01-31 18:47:27 +02:00