"
A big problem with scylla tool executables is that they include the
entire scylla codebase and thus they are just as big as the scylla
executable itself, making them impractical to deploy on production
machines. We could try to combat this by selectively including only the
actually needed dependencies but even ignoring the huge churn of
sorting out our depedency hell (which we should do at one point anyway),
some tools may genuinely depend on most of the scylla codebase.
A better solution is to host the tool executables in the scylla
executable itself, switching between the actual main function to run
some way. The tools themselves don't contain a lot of code so
this won't cause any considerable bloat in the size of the scylla
executable itself.
This series does exactly this, folds all the tool executables into the
scylla one, with main() switching between the actual main it will
delegate to based on a argv[1] command line argument. If this is a known
tool name, the respective tool's main will be invoked.
If it is "server", missing or unrecognized, the scylla main is invoked.
Originally this series used argv[0] as the mean to switch between the
main to run. This approach was abandoned for the approach mentioned above
for the following reasons:
* No launcher script, hard link, soft link or similar games are needed to
launch a specific tool.
* No packaging needed, all tools are automatically deployed.
* Explicit tool selection, no surprises after renaming scylla to
something else.
* Tools are discoverable via scylla's description.
* Follows the trend set by modern command line multi-command or multi-app
programs, like git.
Fixes: #7801
Tests: unit(dev)
"
* 'tools-in-scylla-exec-v5' of https://github.com/denesb/scylla:
main,tools,configure.py: fold tools into scylla exec
tools: prepare for inclusion in scylla's main
main: add skeleton switching code on argv[1]
main: extract scylla specific code into scylla_main()
The split of <seastar/net/ip.hh> will be useful for reducing the build
time (ip.hh is huge and we don't need to include most of it)
Refs #1
* seastar 8d15e8e6...655078df (13):
> net: split <seastar/net/ip.hh>
> Merge "Rate-limited IO capacity management" from Pavel E
> util: closeable/stoppable: Introduce cancel()
> loop: Improve concepts to match requirements
> Merge "scoped_critical_alloc_section make conditional and volatile" from Benny
> Added variadic version of when_any
> websocket: define CryptoPP::byte for older cryptopp
> tests: fix build (when libfmt >= 8) by adding fmt::runtime()
> foreign_ptr: destroy_on: fixup indentation
> foreign_ptr: expose async destroy method
> when_all: when_all_state::wait_all move scoped_critical_alloc_section to captures
> json: json_return_type: provide copy constructor and assignment operator
> json: json_element: mark functions noexcept
Fixes#9798
If an exception in allocate_segment_ex is (sub)type of std::system_error,
commit_error_handler might _not_ cause throw (doh), in which case the error
handling code would forget the current exception and return an unusable
segment.
Now only used as an exception pointer replacer.
Closes#9870
test/rest_api has a "--ssl" option to use encrypted CQL. It's not clear
to me why this is useful (it doesn't actually test encryption of the
REST API!), but as long as we have such an option, it should work.
And it didn't work because of a typo - we set a "check_cql" variable to the
right function, but then forgot to use it and used run.check_cql instead
(which is just for unencrypted cql).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220102123202.1052930-1-nyh@scylladb.com>
When 'scylla fiber' calls _walk the latter can validly return back None
pointer (see 74ffafc8a7 scylla-gdb.py: scylla fiber: add actual return
to early return). This None is not handled by the caller but is unpacked
as if it was a valid tuple.
fixes: #9860
tests: scylla-gdb(release, failure not reproduced though)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211231094311.2495-1-xemul@scylladb.com>
The capacity accounting was changed, scylla-gdb.py should know
the new layout. On error -- fall back to current state.
tests: scylla-gdb(release, current and patched seastar)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211231073427.32453-1-xemul@scylladb.com>
The patch adds the `SUPPORTS_RAFT_CLUSTER_MANAGEMENT`
and `USES_RAFT_CLUSTER_MANAGEMENT` gossiper features.
These features provide a way to organize the automatic
switch to raft-based cluster management.
The scheme is as follows:
1. Every new node declares support for raft-based cluster ops.
2. At the moment, no nodes in the cluster can actually use
raft for cluster management, until the `SUPPORTS*` feature is enabled
(i.e. understood by every node in the cluster).
3. After the first `SUPPORTS*` feature is enabled, the nodes
can declare support for the second, `USES*` feature, which
means that the node can actually switch to use raft-based cluster
ops.
The scheme ensures that even if some nodes are down while
transitioning to new bootstrap mechanism, they can easily
switch to the new procedure, not risking to disrupt the
cluster.
The features are not actually wired to anything yet,
providing a framework for the integration with `raft_group0`
code, which is subject for a follow-up series.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20211220081318.274315-1-pa.solodovnikov@scylladb.com>
alloc_buf() calls new_buf_active() when there is no active segment to
allocate a new active segment. new_buf_active() allocates memory
(e.g. a new segment) so may cause memory reclamation, which may cause
segment compaction, which may call alloc_buf() and re-enter
new_buf_active(). The first call to new_buf_active() would then
override _buf_active and cause the segment allocated during segment
compaction to be leaked.
This then causes abort when objects from the leaked segment are freed
because the segment is expected to be present in _closed_segments, but
isn't. boost::intrusive::list::erase() will fail on assertion that the
object being erased is linked.
Introduced in b5ca0eb2a2.
Fixes#9821Fixes#9192Fixes#9825Fixes#9544Fixes#9508
Refs #9573
Message-Id: <20211229201443.119812-1-tgrabiec@scylladb.com>
Add describtion about how SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT cluster
feature is used and note that only coordinators check it. Decision made
by a coordinator is immutable for the whole request and can be checked
by looking at page_size field. If it's set to 0 or unset then we're
handling the struct in the old way. Otherwise, new way is used.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Closes#9855
This change makes row cache support reverse reads natively so that reversing wrappers are not needed when reading from cache and thus the read can be executed efficiently, with similar cost as the forward-order read.
The database is serving reverse reads from cache by default after this. Before, it was bypassing cache by default after 703aed3277.
Refs: #1413
Tests:
- unit [dev]
- manual query with build/dev/scylla and cache tracing on
Closes#9454
* github.com:scylladb/scylla:
tests: row_cache: Extend test_concurrent_reads_and_eviction to run reverse queries
row_cache: partition_snapshot_row_cursor: Print more details about the current version vector
row_cache: Improve trace-level logging
config: Use cache for reversed reads by default
config: Adjust reversed_reads_auto_bypass_cache description
row_cache: Support reverse reads natively
mvcc: partition_snapshot: Support slicing range tombstones in reverse
test: flat_mutation_reader_assertions: Consume expected range tombstones before end_of_partition
row_cache: Log produced range tombstones
test: Make produces_range_tombstone() report ck_ranges
tests: lib: random_mutation_generator: Extract make_random_range_tombstone()
partition_snapshot_row_cursor: Support reverse iteration
utils: immutable-collection: Make movable
intrusive_btree: Make default-initialized iterator cast to false
The default for get_unlimited_query_max_result_size() is 100MB (adjustable through config), whereas query::result_memory_limiter::maximum_result_size is 1MB (hard coded, should be enough for everybody)
This limit is then used by the replica to decide when to break pages and, in case of reversed clustering order reads, when to fail the read when accumulated data crosses the threshold. The latter behavior stems from the fact that reversed reads had to accumulate all the data (read in forward order) before they can reverse it and return the result. Reverse reads thus need a higher limit so that they have a higher chance of succeeding.
Most readers are now supporting reading in reverse natively, and only reversing wrappers (make_reversing_reader()) inserted on top of ka/la sstable readers need to accumulate all the data. In other cases, we could break pages sooner. This should lead to better stability (less memory usage) and performance (lower page build latency, higher read concurrency due to less memory footprint).
Tests: unit(dev)
Closes#9815
* github.com:scylladb/scylla:
storage_proxy: Send page_size in the read_command
gms: add SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT feature
result_memory_accounter: use new max_result_size::get_page_size in check_local_limit
max_result_size: Add page_size field
Some code assumes that lowres_clock::duration is milliseconds, but public
documentation never claimed that. Harden the code for a change in the
definition by removing the assumptions.
Closes#9850
* github.com:scylladb/scylla:
loading_cache: fix mixup of std::chrono::milliseconds and lowres_clock::duration
service: storage_proxy: fix lowres_clock::duration assumption
service: misc_services: fix lowres_clock::duration assumption
gossip: fix lowres_clock::duration assumption
calculate_delay() implicitly converts a lowres_clock::duration to
std::chrono::microseconds. This fails if lowres_clock::duration has
higher resolution than microseconds.
Fix by using an explicit conversion, which always works.
recalculate_hitrates() is defined to return future<lowres_clock::duration>
but actually returns future<std::chrono::milliseconds>. This fails
if the two types are not the same.
Fix by returning the declared type.
The variable diff is assigned a type of std::chrono::milliseconds
but later used to store the difference between two
lowres_clock::time_point samples. This works now because the two
types are the same, but fails if lowres_clock::duration changes.
Remove the assumption by using lowres_clock::duration.
When the whole cluster is already supporting
separate_page_size_and_safety_limit,
start sending page_size in read_command. This new value will be used
for determining the page size instead of hard_limit.
Fixes#9487Fixes#7586
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This new feature will be used to determined whether the whole cluster
is ready to use additional page_size field in max_result_size.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This means when page_size is sent together with read_command it will be
used for paged queries instead of the hard_limit.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
With this new field comes a new member function called get_page_size.
This new function will be used by the result_memory_accounter to decide
when to cut a page.
The behaviour of get_page_size depends on whether page_size field is
set. This is distinguished by page size being equal to 0 or not. When
page_size is equal to 0 then it's not set and hard_limit will be
returned from get_page_size. Otherwise, get_page_size will return
page_size field.
When read_command is received from an old node, page_size will be equal
to 0 and hard_limit will be used to determine the page size. This is
consistent with the behaviour on the old nodes.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
As of now, 'systemd_unit.available' works ok only when provided
unit is present.
It raises Exception instead of returning boolean
when provided systemd unit is absent.
So, make it return boolean in both cases.
Fixes https://github.com/scylladb/scylla/issues/9848Closes#9849
Move saving features to `system.local#supported_features`
to the point after passing all remote feature checks in
the gossiper, right before joining the ring.
This makes `system.local#supported_features` column to store
advertised feature set. Leave a comment in the definition of
`system.local` schema to reflect that.
Since the column value is not actually used anywhere for now,
it shouldn't affect any tests or alter the existing behavior.
Later, we can optimize the gossip communication between nodes
in the cluster, removing the feature check altogether
in some cases (since the column value should now be monotonic).
* manmanson/save_adv_features_v2:
db: save supported features after passing gossip feature check
db: add `save_local_supported_features` function
Commit dcc73c5d4e introduced a semaphore
for excluding concurrent recalculations - _reserve_recalculation_guard.
Unfortunately, the two places in the code which tried to take this
guard just called get_units() - which returns a future<units>, not
units - and never waited for this future to become available.
So this patch adds the missing "co_await" needed to wait for the
units to become available.
Fixes#9770.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211214122612.1462436-1-nyh@scylladb.com>
On newer version of systemd-coredump, coredump handled in
systemd-coredump@.service, and may causes timeout while running the
systemd unit, like this:
systemd[1]: systemd-coredump@xxxx.service: Service reached runtime time limit. Stopping.
To prevent that, we need to override TimeoutStartSec=infinity.
Fixes#9837Closes#9841
On CentOS8, mdmonitor.service does not works correctly when using
mdadm-4.1-15.el8.x86_64 and later versions.
Until we find a solution, let's pinning the package version to older one
which does not cause the issue (4.1-14.el8.x86_64).
Fixes#9540Closes#9782
Along the way, our flat structure for docs was changed
to categorize the documents, but service_levels.md was forward-ported
later and missed the created directory structure, so it was created
as a sole document in the top directory. Move it to where the other
similar docs live.
Message-Id: <68079d9dd511574ee32fce15fec541ca75fca1e2.1640248754.git.sarna@scylladb.com>
"
The token metadata and features should be kept on the query_processor itself,
so finally the "storage" API would look like this:
6 .query()
5 .get_max_result_size()
2 .mutate_with_triggers()
2 .cas()
1 .truncate_blocking()
The get_max_result_size() is probably also worth moving away from storage,
it seem to have nothing to do with it.
tests: unit(dev)
"
* 'br-query-processor-in-cql-statements' of https://github.com/xemul/scylla:
cql3: Generalize bounce-to-shard result creation
cql3: Get data dictionary directly from query_processor
create_keyspace_statement: Do not use proxy.shared_from_this()
cas_request: Make read_command() accept query_processor
select_statement: Replace all proxy-s with query_processor
create_|alter_table_statement: Make check_restricted_table_properties() accept query_processor
create_|alter_keyspace_statement: Make check_restricted_replication_strategy() accept query_processor
role_management_statement: Make validate_cluster_support() accept query_processor
drop_index_statement: Make lookup_indexed_table() accept query_processor
batch_|modification_statement: Make get_mutations accept query_processor
modification_statement: Replace most of proxy-s with query_processor
batch_statement: Replace most of proxy-s with query_processor
cql3: Make create_arg_types()/prepare_type() accept query_processor
cql3: Make .validate_while_executing() accept query_processor
cql3: Make execution stages carry query_processor over
cql3: Make .validate() and .check_access() accept query_processor
The current implementation of the Alternator expiration (TTL) feature
has each node scan for expired partitions in its own primary ranges.
This means that while a node is down, items in its primary ranges will
not get expired.
But we note that doesn't have to be this way: If only a single node is
down, and RF=3, the items that node owns are still readable with QUORUM -
so these items can still be safely read and checked for expiration - and
also deleted.
This patch implements a fairly simple solution: When a node completes
scanning its own primary ranges, also checks whether any of its *secondary*
ranges (ranges where it is the *second* replica) has its primary owner
down. For such ranges, this node will scan them as well. This secondary
scan stops if the remote node comes back up, but in that case it may
happen that both nodes will work on the same range at the same time.
The risks in that are minimal, though, and amount to wasted work and
duplicate deletion records in CDC. In the future we could avoid this by
using LWT to claim ownership on a range being scanned.
We have a new dtest (see a separate patch), alternator_ttl_tests.py::
TestAlternatorTTL::test_expiration_with_down_node, which reproduces this
and verifies this fix. The test starts a 5-node cluster, with 1000 items
with random tokens which are due to be expired immediately. The test
expects to see all items expiring ASAP, but when one of the five nodes
is brought down, this doesn't happen: Some of the items are not expired,
until this patch is used.
Fixes#9787
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211222131933.406148-1-nyh@scylladb.com>
Move saving features to `system.local#supported_features`
to the point after passing all remote feature checks in
the gossiper, right before joining the ring.
This makes `system.local#supported_features` column to store
advertised feature set. Leave a comment in the definition of
`system.local` schema to reflect that.
Since the column value is not actually used anywhere for now,
it shouldn't affect any tests or alter the existing behavior.
Later, we can optimize the gossip communication between nodes
in the cluster, removing the feature check altogether
in some cases (since the column value should now be monotonic).
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The main intention is actually to free the qp.proxy() from the
need to provide the get_stats() method.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After previous patches there's a whole bunch of places that do
qp.proxy().data_dictionary()
while the data_dictionary is present on the query processor itself
and there's a public method to get one. So use it everywhere.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The prepare_schema_mutations is not sleeping method, so there's no
point in getting call-local shared pointer on proxy. Plain reference
is more than enough.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is the largest user of proxy argument. Fix them all and
their callers (all sit in the same .cc file).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This completes the batch_ and modification_statement rework.
Also touch the private batch_statement::read_command while at it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are some internal methods that use proxy argument. Replace
most of them with query_processor, next patch will fix the rest --
those that interact with batch statement.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are some proxy arguments left in the batch_statement internals.
Fix most of them to be query_processors. Few remainders will come
later as they rely on other statements to be fixed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The schema_altering_statement declares this pure virtual method. This
patch changes its first argument from proxy into query processor and
fixes what compiler errors about.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The batch_ , modification_ and select_ statements get proxy from
query processor just to push it through execution stage. Simplify
that by pushing the query processor itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>