This adds a "help" build target, which prints out important build
targets. The printing is done in a separate shell script, becaue "ninja"
insists on print out the "command" before executing it, which makes the
help text unreadable.
install.sh supports two different ways of redirecting paths:
--root for creating a chroot-style tree, and --prefix for changing
the installed file location. Document them.
Closes#8389
std::is_fundamental isn't a good constraint since it include nullptr_t
and void. Replace with std::integral which is sufficient. Use a concept
instead of enable_if to simplify the code.
Closes#8450
Currently, scripts/refresh-submodules.sh always refreshes all
submodules, i.e., takes the latest version of all of all of them and
commits it. But sometimes, a committer only wants to refresh a specific
submodule, and doesn't want to deal with the implications of updating
a different one.
As a recent example, for issue #8230, I wanted to update the tools/java
submodule, which included a fix for sstableloader, without updating the
Seastar submodule - which contained completely irrelevant changes.
So in this patch we add the ability to override the default list of
submodules that refresh-submodules.sh uses, with one or more command
line parameters. For example:
scripts/refresh-submodules.sh tools/java
will update only tools/java.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210411151421.309483-1-nyh@scylladb.com>
This was triggered by the test_total_space_limit_of_commitlog dtest.
When it passes a very large commitlog_segment_size_in_mb (1/6th of the
free memory size, in mb), segment_manager constructor limits max_size
to std::numeric_limits<position_type>::max() which is 0xffffffff.
This causes allocate_segment_ex to loop forever when writing the segment
file since `dma_write` returns 0 when the count is unaligned (seen 4095).
The fix here is to select a sligtly small maxsize that is aligned
down to a multiple of 1MB.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210407121059.277912-1-bhalevy@scylladb.com>
There are some users of that tri_comparator which are also
converted to strong_ordering. Most of the code using those
is, in turn, already handling return values interchangeably.
The bound_view::tri_compare, which's used by the guy, is
still returning int.
tests: unit(dev)
* xemul/br-position-tri-compare:
code: Relax position_in_partition::tri_compare users
position_in_partition: Convert tri_compare to strong_ordering
test: Convert clustering_fragment_summary::tri_cmp to strong_ordering
repair: Convert repair_sync_boundary::tri_compare to strong_ordering
view: Don't expect int from position_in_partition::tri_compare
There are some pieces left doing res <=> 0 with the
res now being a strong_ordering itself. All these can
be just dropped.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All its users are now ready to accept both - int and
the strong_ordering value, so the change is pretty
straightforward.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The set puts the partition_snapshot_row_cursor on a
diet: 320 -> 224 bytes, makes use of btree API sugar
to save some CPU cycles and compacts the code.
tests: unit(dev)
* xemul/br-row-cursor-cleanups-2:
partition_snapshot_row_cursor: Rewrite row() with consume_row()
partition_snapshot_row-cursor: Add const consume_row() version
partition_snapshot_row_cursor: Add concept to .consume_row()
partition_snapshot_row_cursor: Don't carry end iterators
partition_snapshot_row_cursor: Move cells hash creation to reader
partition_snapshot_row_cursor: Move read_partition into test
partition_snapshot_row_cursor: Move is_in_latest_version inline
partition_snapshot_row_cursor: Use is_in_latest_version where appropriate
partition_snapshot_row_cursor: Less dereferences in key() method
partition_snapshot_row_cursor: Update change mark in prepare_heap
partition_snapshot_row_cursor: Clear current row when recreating
partition_snapshot_row_cursor: Use btree::lower_bound sugar
partition_snapshot_row_cursor: Factor out next() and erase_and_advance()
partition_snapshot_row_cursor: Relax vector of iterators
btree: Add operator bool()
clustering_row: Add new .apply() overload
It's the same as the existing one, but doesn't modify
anything (cursor and pointing rows_entry's) and calls
consumer with const row reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The btree's iterator can be checked to reach the tree's end
without holding the ending iterator itself. This makes the
whole p_s_r_c 20% smaller (288 bytes -> 224 bytes) since it
now keeps 4 extra iterators on-board -- inside small vectors
for heap and current_row.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now call to .row() method may create hash on row's cells.
It's counterintuitive to see a const method that transparently
changes something it points to. Since the only caller of a row()
who knows whether the hash creation is required is the cache
reader, it's better to move the call to prepare_hash() into it.
Other than making the .row() less surprising this also helps to
get rid of the whole method by the next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method in question is test-only helper, there's no
need in keeping it as a part of the API.
Another reason to move is that the method is O(number of
rows) and doesn't preempt while looping, but cursor code
users try hard not to stall the reactor. So even though
this method has a meaningful semantics within the class,
it will better be reinvented if needed in core code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method is currently defined outside of the class which
gives compiler less chances to really inline it when needed.
Also, keeping this simple piece of code inline is less code
to read (and compile).
Mark the guy noexcept while at it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
appropriate
Checking for _current_row[0].version being 0 (or not being 0)
is better understood if done with a well named existing helper.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The valid cursor's key is kept on the _position as well,
but getting it from there is 1 defererence less:
_current_row -(*)-> row -> key
_position -(**)-> std::optional -> key
* iterator's -> is pointer dereference
** std::optional is designed not to be a pointer
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The heap's iterators validity is checked with the change mark,
which is updated every time heap is recreated. Factor these
updates out and keep the mark together with the heap it protects.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The cursor keeps current row in a separate vector of iterators
and reconstructs it in a dedicated method, which _expects_ that
the vector is empty on entry.
It's better to keep the logic of current row construction in one
place.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When checking if the lower-bound entry matched the search
key it's possible to avoid extra comparison with the help
of the collection used to store the rows (btree).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Both helpers do the same -- advance the cursor to the next row.
The latter may additionally remove the row from the uniquely
owned version.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The cursor maintains a vector of iterators that correspond to
each of the versions scanned. However, only the iterator in
the latest one is really needed, so the whole vector can be
reduced down to an optional.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The btree's iterators allow for simple checking for '== tree.end()'
condition. For this check neither the tree itself, nor the ending
iterator is required. One just need to check if the _idx value is
the npos.
One additional change to make it work is required -- when removing
an entry from the inline node the _idx should be set to npos.
This change is, well, a bugfix. An iterator left with 0 in _idx is
treated as a valid one. However, the bug is non-triggerable. If such
an "invalid" iterator is compared against tree.end() the check would
return true, because the tree pointers would conside.
So this patch adds an operator bool() to btree iterator to facilitate
simpler checking if it reached the end of the collection or not.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The clustering_row is a wrapper over the deletable_row and
facilitates the apply-creation of the latter from some other
objects. Soon it will accept the deletable_row itself for
apply()-ing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It is dangerous to print a formatted string as is, like
sslog.warn(err.c_str()) since it might hold curly braces ('{}')
and those require respective runtime args.
Instead, it should be logged as e.g. sslog.warn("{}", err.c_str()).
This will prevent issues like #8436.
Refs #8436
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210408173048.124417-2-bhalevy@scylladb.com>
The filter passed to `min_position_reader_queue`, which was used by
`clustering_order_reader_merger`, would incorrectly include sstables as
soon as they passed through the PK (bloom) filter, and would include
sstables which didn't pass the PK filter (if they passed the CK
filter). Fortunately this wouldn't cause incorrect data to be returned,
but it would cause sstables to be opened unnecessarily (these sstables
would immediately return eof), resulting in a performance drop. This commit
fixes the filter and adds a regression test which uses statistics to
check how many times the CK filter was invoked.
Fixes#8432.
Closes#8433
On current implementation, we may re-run ntp configuration even it
already configured.
Also, the system may configured with non-default ntp client, we just
ignoring that and configure with default ntp client.
This patch minimize unnecessary re-configuration of ntp client.
It run in following order:
1. Check NTP client is already running. If it running, skip setup
2. Check NTP client is alrady installed. If it installed, use it
3. If there is non of NTP client package installed,
- if it's CentOS, install chrony
- if it's on other distributions, install systemd-timesyncd
Closes#8431
* github.com:scylladb/scylla:
scylla_ntp_setup: detect already installed ntp client
scylla_util.py: return bool value on systemd_unit.is_active()
On current implementation, we may re-run ntp configuration even it
already configured.
Also, the system may configured with non-default ntp client, we just
ignoring that and configure with default ntp client.
This patch minimize unnecessary re-configuration of ntp client.
It run in following order:
1. Check NTP client is already running. If it running, skip setup
2. Check NTP client is alrady installed. If it installed, use it
3. If there is non of NTP client package installed,
- if it's CentOS, install chrony
- if it's on other distributions, install systemd-timesyncd
Related with #8344, #8339
The series contains code cleanups and fixes for stepdown process
and quorum check code.
Note this is re-send of already posted patches lumped together for
convenience.
* scylla-dev/raft-fixes-v1:
raft: add test for check quorum on a leader
raft: fix quorum check code for joint config and non-voting members
raft: do not hang on waiting for entries on a leader that was removed from a cluster
raft: add more tracing to stepdown code
raft: use existing election_elapsed() function instead of redo the calculation
raft: test: add test case for stepdown process
raft: check that a node is still the leader after initiating stepdown process
Currently, 'if unit.is_active():' is always True since is_active()
returns result in string (active, inactive, unknown).
To avoid such scripting bug, change return value in bool.
1) start n1, n2, n3
2) decommission n3
3) remove /var/lib/scylla for n3
4) start n4 with the same ip address as n3 to replace n3
5) replace will be successful
If a node has left the ring, we should reject the replace operation.
This patch makes the check during replace operation more strict and
rejects the replace if the node has left the ring.
After the patch, we will see
ERROR 2021-04-07 08:02:14,099 [shard 0] init - Startup failed:
std::runtime_error (Cannot replace_adddress 127.0.0.3 because it has left
the ring, status=LEFT)
Fixes#8419Closes#8420
Fixes#8363Fixes#8376
Delete segements has two issues when running with size-limited
commit log and strict adherence to said limit.
1.) It uses parallel processing, with deferral. This means that
the disk usage variables it looks at might not be fully valid
- i.e. we might have already issued a file delete that will
reduce disk footprint such that a segment could instead be
recycled, but since vars are (and should) only updated
_post_ delete, we don't know.
2.) It does not take into account edge conditions, when we only
delete a single segment, and this segment is the border segment
- i.e. the one pushing us over the limit, yet allocation is
desperately waiting for recycling. In this case we should
allow it to live on, and assume that next delete will reduce
footprint. Note: to ensure exact size limit, make sure
total size is a multiple of segment size.
if we had an error in recycling (disk rename?), and no elements
are available, we could have waiters hoping they will get segements.
abort the queue (not permanent, but wakes up waiters), and let them
retry. Since we did deletions instead, disk footprint should allow
for new allocs at least. Or more likely, everything is broken, but
we will at least make more noise.
Closes#8372
* github.com:scylladb/scylla:
commitlog: Add signalling to recycle queue iff we fail to recycle
commitlog: Fix race and edge condition in delete_segments
commitlog: coroutinize delete_segments
commitlog_test: Add test for deadlock in recycle waiter
The "most important" major changes are:
1. storage_service: simplify CDC generation management during node replace
Previously, when node A replaced node B, it would obtain B's
generation timestamp from its application state (gossiped by other
nodes) and start gossiping it immediately on bootstrap.
But that's not necessary:
- if this is the timestamp of the last (current) generation, we would
obtain it from other nodes anyway (every node gossips the last known
timestamp),
- if this is the timestamp of an earlier generation, we would forget
it immediately and start gossiping the last timestamp (obtained from
other nodes).
This commit simplifies the bootstrap code (in node-replace case) a bit:
the replacing node no longer attempts to retrieve the CDC generation
timestamp from the node being replaced.
2. tree-wide: introduce cdc::generation_id type
Each CDC generation has a timestamp which denotes a logical point in time
when this generation starts operating. That same timestamp is
used to identify the CDC generation. We use this identification scheme
to exchange CDC generations around the cluster.
However, the fact that a generation's timestamp is used as an ID for
this generation is an implementation detail of the currently used method
of managing CDC generations.
Places in the code that deal with the timestamp, e.g. functions which
take it as an argument (such as handle_cdc_generation) are often
interested in the ID aspect, not the "when does the generation start
operating" aspect. They don't care that the ID is a `db_clock::time_point`.
They may sometimes want to retrieve the time point given the ID (such as
do_handle_cdc_generation when it calls `cdc::metadata::insert`),
but they don't care about the fact that the time point actually IS the ID.
In the future we may actually change the specific type of the ID if we
modify the generation management algorithms.
This commit is an intermediate step that will ease the transition in the
future. It introduces a new type, `cdc::generation_id`. Inside it contains
the timestamp, so:
- if a piece of code doesn't care about the timestamp, it just passes
the ID around
- if it does care, it can access it using the `get_ts` function.
The fact that `get_ts` simply accesses the ID's only field is an
implementation detail.
3. cdc: handle missing generation case in check_and_repair_cdc_streams
check_and_repair_cdc_streams assumed that there is always at least
one generation being gossiped by at least one of the nodes. Otherwise it
would enter undefined behavior.
I'm not aware of any "real" scenario where this assumption wouldn't be
satisfied at the moment where check_and_repair_cdc_streams makes it
except perhaps some theoretical races. But it's best to stay on the safe
side.
---
Additionally the PR does some simplifications, stylistic improvements,
removes some dead code, coroutinizes some functions, uncoroutinizes others
(due to miscompiles), adds additional logging, updates some stale comments.
Read commit messages for more details.
Closes#8283
* github.com:scylladb/scylla:
cdc: log a message when creating a new CDC generation
cdc: handle missing generation case in check_and_repair_cdc_streams
tree-wide: introduce cdc::generation_id type
tree-wide: rename "cdc streams timestamp" to "cdc generation id"
cdc: remove some functions from generation.hh
storage_service: make set_gossip_tokens a static free-function
db: system_keyspace: group cdc functions in single place
cdc: get rid of "get_local_streams_timestamp"
sys_dist_ks: update comment at quorum_if_many
storage_service: simplify CDC generation management during node replace
check_and_repair_cdc_streams assumed that there is always at least
one generation being gossiped by at least one of the nodes. Otherwise it
would enter undefined behavior.
I'm not aware of any "real" scenario where this assumption wouldn't be
satisfied at the moment where check_and_repair_cdc_streams makes it
except perhaps some theoretical races. But it's best to stay on the safe
side.
This is a follow-up to the previous commit.
Each CDC generation has a timestamp which denotes a logical point in time
when this generation starts operating. That same timestamp is
used to identify the CDC generation. We use this identification scheme
to exchange CDC generations around the cluster.
However, the fact that a generation's timestamp is used as an ID for
this generation is an implementation detail of the currently used method
of managing CDC generations.
Places in the code that deal with the timestamp, e.g. functions which
take it as an argument (such as handle_cdc_generation) are often
interested in the ID aspect, not the "when does the generation start
operating" aspect. They don't care that the ID is a `db_clock::time_point`.
They may sometimes want to retrieve the time point given the ID (such as
do_handle_cdc_generation when it calls `cdc::metadata::insert`),
but they don't care about the fact that the time point actually IS the ID.
In the future we may actually change the specific type of the ID if we
modify the generation management algorithms.
This commit is an intermediate step that will ease the transition in the
future. It introduces a new type, `cdc::generation_id`. Inside it contains
the timestamp, so:
1. if a piece of code doesn't care about the timestamp, it just passes
the ID around
2. if it does care, it can simply access it using the `get_ts` function.
The fact that `get_ts` simply accesses the ID's only field is an
implementation detail.
Using the occasion, we change the `do_handle_cdc_generation_intercept...`
function to be a standard function, not a coroutine. It turns out that -
depending on the shape of the passed-in argument - the function would
sometimes miscompile (the compiled code would not copy the argument to the
coroutine frame).