Check if a node is in NORMAL or SHUTDOWN status which means the node is part of
the token ring from the gossip point of view and operates in normal status or
was in normal status but is shutdown.
Refs #8013
This new method is preferred over all() for iterations purposes, because
all() may have to copy sstables into a temporary.
For example, all() implementation of the upcoming compound_sstable_set
will have no choice but to merge all sstables from N managed sets into
a temporary.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210311163009.42210-1-raphaelsc@scylladb.com>
"
Currently the sstable reader code is scattered across several source
files as following (paths are relative to sstables/):
* partition.cc - generic reader code;
* row.hh - format specific code related to building mutation fragments
from cells;
* mp_row_consumer.hh - format specific code related to parsing the raw
byte stream;
This is a strange organization scheme given that the generic sstable
reader is a template and as such it doesn't itself depend on the other
headers where the consumer and context implementations live. Yet these
are all included in partition.cc just so the reader factory function can
instantiate the sstable reader template with the format specific
objects.
This patchset reorganizes this code such that the generic sstable reader
is exposed in a header. Furthermore, format specific code is moved to
the kl/ and mx/ directories respectively. Each directory has a
reader.hh with a single factory function which creates the reader, all
the format specific code is hidden from sight. The added benefit is that
now reader code specific to a format is centralized in the format
specific folder, just like the writer code.
This patchset only moves code around, no logical changes are made.
Tests: unit(dev)
"
* 'sstable-reader-separation/v1' of https://github.com/denesb/scylla:
sstables: get rid of mp_row_consumer.{hh,cc}
sstables: get rid of row.hh
sstables/mp_row_consumer.hh: remove unused struct new_mutation
sstables: move mx specific context and consumer to mx/reader.cc
sstables: move kl specific context and consumer to kl/reader.cc
sstables: mv partition.cc sstable_mutation_reader.hh
Move stuff contained therein to `sstable_mutation_reader.{hh,cc}` which
will serve as the collection point of utility stuff needed by all reader
implementations.
Move stuff contained therein to `sstable_mutation_reader.{hh,cc}` which
will serve as the collection point of utility stuff needed by all reader
implementations.
Move all the mx format specific context and consumer code to
mx/reader.cc and add a factory function `mx::make_reader()` which takes
over the job of instantiating the `sstable_mutation_reader` with the mx
specific context and consumer.
Move all the kl format specific context and consumer code to
kl/reader* and add a factory function `kl::make_reader()` which takes
over the job of instantiating the `sstable_mutation_reader` with the kl
specific context and consumer. Code which is used by test is moved to
kl/reader_impl.hh, while code that can be hidden us moved to
kl/reader.cc. Users who just want to create a reader only have to
include kl/reader.hh.
The sstable reader currently knows the definition of all the different
consumers and contexts. But it doesn't really need to, as it is a
template. Exploit this and prepare for a organization scheme where the
consumers and contexts live hidden in a cc file which includes and
instantiates the sstable reader template. As a first step expose
`sstable_mutation_reader` in a header.
- 3 nodes in the cluster with rf = 3
- run repair on node1 with ignore_nodes to ignore node2 and node3
- node1 has no followers to repair with
However, currently node1 will walk through the repair procedure to read
data from disk and calculate hashes which are unnecessary.
This patch fixes this issue, so that in case there are no followers, we
skip the range and avoid the unnecessary work.
Before:
$ curl -X POST http://127.0.0.1:10000/storage_service/repair_async/myks3?ignore_nodes="127.0.0.2,127.0.0.3"
repair - repair id [id=1, uuid=ff39151b-2ce9-4885-b7e9-89158b14b5c2] on shard 0 stats:
repair_reason=repair, keyspace=myks3, tables={standard1},
ranges_nr=769, sub_ranges_nr=769, round_nr=1456,
round_nr_fast_path_already_synced=1456,
round_nr_fast_path_same_combined_hashes=0,
round_nr_slow_path=0, rpc_call_nr=0, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.19 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.0.1, 2822972}},
row_from_disk_nr={{127.0.0.1, 6218}},
row_from_disk_bytes_per_sec={{127.0.0.1, 14.1695}} MiB/s,
row_from_disk_rows_per_sec={{127.0.0.1, 32726.3}} Rows/s,
tx_row_nr_peer={}, rx_row_nr_peer={}
Data was read from disk.
After:
$ curl -X POST http://127.0.0.1:10000/storage_service/repair_async/myks3?ignore_nodes="127.0.0.2,127.0.0.3"
repair - repair id [id=1, uuid=c6df8b23-bd3b-4ebc-8d4c-a11d1ebcca39] on shard 0 stats:
repair_reason=repair, keyspace=myks3, tables={standard1}, ranges_nr=769,
sub_ranges_nr=0, round_nr=0, round_nr_fast_path_already_synced=0,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=0, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.0 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={},
row_from_disk_nr={},
row_from_disk_bytes_per_sec={} MiB/s,
row_from_disk_rows_per_sec={} Rows/s,
tx_row_nr_peer={}, rx_row_nr_peer={}
No data was read from disk.
Fixes#8256Closes#8257
Instead of using the `restrictions` class hierarchy, calculate the clustering slice using the `expr::expression` representation of the WHERE clause. This will allow us to eventually drop the `restrictions` hierarchy altogether.
Tests: unit (dev, debug)
Closes#8227
* github.com:scylladb/scylla:
cql3: Make get_clustering_bounds() use expressions
cql3/expr: Add is_multi_column()
cql3/expr: Add more operators to needs_filtering
cql3: Replace CK-bound mode with comparison_order
cql3/expr: Make to_range globally visible
cql3: Gather slice-defining WHERE expressions
cql3: Add statement_restrictions::_where
test: Add unit tests for get_clustering_bounds
Use expressions instead of _clustering_columns_restrictions. This is
a step towards replacing the entire restrictions class hierarchy with
expressions.
Update some expected results in unit tests to reflect the new code.
These new results are equivalent to the old ones in how
storage_proxy::query() will process them (details:
bound_view::from_range() returns the same result for an empty-prefix
singular as for (-inf,+inf)).
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Omitting these operators didn't cause bugs, because needs_filtering()
is never invoked on them. But that will likely change in the future,
so add them now to prevent problems down the road.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Instead of defining this enum in multi_column_restriction::slice, put
it in the expr namespace and add it to binary_operator. We will need
it when we switch bounds calculation from multi_column_restriction to
expr classes.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
It will be used in statement_restrictions for calculating clustering
bounds. And it will come in handy elsewhere in the future, I'm sure.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Add statement_restrictions::_clustering_prefix_restrictions and fill
it with relevant expressions. Explain how to find all such
expressions in the WHERE clause.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Fixes#8212
Some snapshotting operations call in on a single table at a time.
When checking for existing snapshots in this case, we should not
bother with snapshots in other tables. Add an optional "filter"
to check routine, which if non-empty includes tables to check.
Use case is "scrub" which calls with a limited set of tables
to snapshot.
Closes#8240
Currently, if the data_size is greater than
max_chunk_size - sizeof(chunk), we end up
allocating up to max_chunk_size + sizeof(chunk) bytes,
exceeding buf.max_chunk_size().
This may lead to allocation failures, as seen in
https://github.com/scylladb/scylla/issues/7950,
where we couldn't allocate 131088 (= 128K + 16) bytes.
This change adjusted the expose max_chunk_size()
to be max_alloc_size (128KB) - sizeof(chunk)
so that the allocated chunks would normally be allocated
in 128KB chunks in the write() path.
Added a unit test - test_large_placeholder that
stresses the chunk allocation path from the
write_place_holder(size) entry point to make
sure it handles large chunk allocations correctly.
Refs #7950
Refs #8081
Test: unit(release), bytes_ostream_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210303143413.902968-1-bhalevy@scylladb.com>
On the environment hard limit of coredump is set to zero, coredump test
script will fail since the system does not generate coredump.
To avoid such issue, set ulimit -c 0 before generating SEGV on the script.
Note that scylla-server.service can generate coredump even ulimit -c 0
because we set LimitCORE=infinity on its systemd unit file.
Fixes#8238Closes#8245
* 'preparatory_work_for_compound_set' of github.com:raphaelsc/scylla:
sstable_set: move all() implementation into sstable_set_impl
sstable_set: preparatory work to change sstable_set::all() api
sstables: remove bag_sstable_set
This test checks that `mutation_partition::difference()` works correctly.
One of the checks it does is: m1 + m2 == m1 + (m2 - m1).
If the two mutations are identical but have compactable data, e.g. a
shadowable tombstone shadowed by a row marker, the apply will collapse
these, causing the above equality check to fail (as m2 - m1 is null).
To prevent this, compact the two input mutations.
Fixes: #8221
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210310141118.212538-1-bdenes@scylladb.com>
The main motivation behind this is that by moving all() impl into
sstable_set_impl, sstable_set no longer needs to maintain a list
with all sstables, which in turn may disagree with the respective
sstable_set_impl.
This will be very important for compound_sstable_set_impl which
will be built from existing sets, and will implement all() by
combining the all() of its managed sets.
Without this patch, we'd have to insert the same sstable at
both compound set and also the set managed by it, to guarantee
all() of compound set would return the correct data, which would
be expensive and error prone.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
users of sstable_set::all() rely on the set itself keeping a reference
to the returned list, so user can iterate through the list assuming
that it is alive all the way through.
this will change in the future though, because there will be a
compound set impl which will have to merge the all() of multiple
managed sets, and the result is a temporary value.
so even range-based loops on all() have to keep a ref to the returned
list, to avoid the list from being prematurely destroyed.
so the following code
for (auto& sst : *sstable_set.all()) { ...}
becomes
for (auto sstables = sstable_set.all(); auto& sst : *sstables) { ... }
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"
This class is basically a wrapper around a unique pointer and a few
short convenience methods, but is otherwise a distraction in trying to
untangle the maze that is the sstable reader class hierachy.
So this patchset folds it into its only real user: the sstable reader.
"
* 'data_consume_context_bye' of https://github.com/denesb/scylla:
sstable: move data_consume_* factory methods to row.hh
sstables: fold data_consume_context: into its users
sstables: partition.cc: remove data_consume_* forward declarations
The indentation level is significantly reduced, and so is the number
of allocations. The function signature is changed from taking an rvalue
ref to taking the unique_ptr by value, because otherwise the coroutine
captures the request as a reference, which results in use-after-free.
Tests: unit(dev)
Closes#8249
* github.com:scylladb/scylla:
alternator: drop read_content_and_verify_signature
alternator: coroutinize handle_api_request
The indentation level is significantly reduced, and so is the number
of allocations.
The function signature is changed from taking an rvalue ref to taking
the unique_ptr by value, because otherwise the coroutine captures
the request as a reference, which results in use-after-free.
`data_consume_context` is a thin wrapper over the real context object
and it does little more than forward method calls to it. The few
methods doing more then mere forwarding can be folded into its single
real user: `sstable_reader`.
Alternator request sizes can be up to 16 MB, but the current implementation
had the Seastar HTTP server read the entire request as a contiguous string,
and then processed it. We can't avoid reading the entire request up-front -
we want to verify its integrity before doing any additional processing on it.
But there is no reason why the entire request needs to be stored in one big
*contiguous* allocation. This always a bad idea. We should use a non-
contiguous buffer, and that's the goal of this patch.
We use a new Seastar HTTPD feature where we can ask for an input stream,
instead of a string, for the request's body. We then begin the request
handling by reading lthe content of this stream into a
vector<temporary_buffer<char>> (which we alias "chunked_content"). We then
use this non-contiguous buffer to verify the request's signature and
if successful - parse the request JSON and finally execute it.
Beyond avoiding contiguous allocations, another benefit of this patch is
that while parsing a long request composed of chunks, we free each chunk
as soon as its parsing completed. This reduces the peak amount of memory
used by the query - we no longer need to store both unparsed and parsed
versions of the request at the same time.
Although we already had tests with requests of different lengths, most
of them were short enough to only have one chunk, and only a few had
2 or 3 chunks. So we also add a test which makes a much longer request
(a BatchWriteItem with large items), which in my experiment had 17 chunks.
The goal of this test is to verify that the new signature and JSON parsing
code which needs to cross chunk boundaries work as expected.
Fixes#7213.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210309222525.1628234-1-nyh@scylladb.com>
The test populates the cache, then invalidates it, then tries to push
huge (10x times the segment size) chunks into seastar memory hoping that
the invalid entries will be evicted. The exit condition on the last
stage is -- total memory of the region (sum of both -- used and free)
becomes less than the size of one chunk.
However, the condition is wrong, because cache usually contains a dummy
entry that's not necessarily on lru and on some test iteration it may
happen that
evictable size < chunk size < evictable size + dummy size
In this case test fails with bad_alloc being unable to evict the memory
from under the dummy.
fixes: #7959
tests: unit(row_cache_test), unit(the failing case with the triggering
seed from the issue + 200 times more with random seeds)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210309134138.28099-1-xemul@scylladb.com>
Change test/raft directory to Boost test type.
Run replication_test cases with their own test.
RAFT_TEST_CASE macro creates 2 test cases, one with random 20% packet
loss named name_drops.
The directory test/raft is changed to host Boost tests instead of unit.
While there improve the documentation.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
In raft the UUID 0 is a special case so server ids start at 1.
Add two helper functions. Convert local 0-based id to raft 1-based
UUID. And from UUID to raft_id.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Change global map of disconnected servers to a more intuitive class
connected. The class is callable for the most common case
connected(id).
Methods connect(), disconnect(), and all() are provided for readability
instead of directly calling map methods (insert, erase, clear). They
also support both numerical (0 based) and server_id (UUID, 1 based) ids.
The actual shared map is kept in a lw_shared_ptr.
The class is passed around to be copy-constructed which is practically
just creating a new lw_shared_ptr.
Internally it tracks disconnected servers but externally it's more
intuitive to use connect instead of disconnect. So it reads
"connected id" and "not disconnected id", without double negatives.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>