This is already the case, since main.cc calls it from shard 0 and
relies on it to spread the information to the other shards. We will
turn this branch - which is always taken - into an assert for the
sake of future-proofing and soon add even more code that relies on this
being executed in shard 0.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We are about to use the auto compaction property during the
populate/reshard process. If the user toggles it, the database can be
left in a bad state.
There should be no reason why a user would want to set that up this
early. So we'll disallow it.
To do that property, it is better if the check of whether or not
the storage service is ready to accomodate this request is local
to the storage service itself. We then move the logic of set_tables_autocompaction
from api to the storage service. The API layer now merely translates
the table names and pass it along.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
SSTables created for the upload directory should be using its custom error
handler.
There is one user of the custom error handler in tree, which is the current
upload directory function. As we will use a free function instead of a lambda
in our implementation we also use the opportunity to fix it for consistency.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
I just noticed while working on the reshape patches that there
is an extra format bracket in two of the debug message. As they
are debug I've seen them less often than the others and that slipped.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This adds support for configuring whether to run dbuild with 'docker' or
'podman' via a new environment variable, DBUILD_TOOL. While at it, check
if 'podman' exists, and prefer that by default as the tool for dbuild.
In this patch I rewrote the explanations in both README.md and HACKING.md
about Scylla's dependencies, and about dbuild.
README.md used to mention only dbuild. It now explains better (I think)
why dbuild is needed in the first place, and that the alternative is
explained in HACKING.md.
HACKING.md used to explain *only* install-dependencies.sh - and now explains
why it is needed, what install-dependencies.sh and that it ONLY works on
very recent distributions (e.g., Fedora older than 32 are not supported),
and now also mentions the alternative - dbuild.
Mentions of incorrect requirements (like "gcc > 8.1") were fixed or dropped.
Mention of the archaic 'scripts/scylla_current_repo' script, which we used
to need to install additional packages on non-Fedora systems, was dropped.
The script itself is also removed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200616100253.830139-1-nyh@scylladb.com>
nonwrapping_range<T> and related templates represent mathematical
intervals, and are different from C++ ranges. This causes confusion,
especially when C++ ranges and the range templates are used together.
As the first step to disentable this, introduce a new interval.hh
header with the contents of the old range.hh header, renaming as
follows:
range_bound -> interval_bound
nonwrapping_range -> nonwrapping_interval
wrapping_range -> wrapping_interval
Range -> Interval (concepts)
The range alias, which previously aliased wrapping_range, did
not get renamed - instead the interval alias now aliases
nonwrapping_interval, which is the natural interval type. I plan
to follow up making interval the template, and nonwrapping_interval
the alias (or perhaps even remove it).
To avoid churn, a new range.hh header is provided with the old names
as aliases (range, nonwrapping_range, wrapping_range, range_bound,
and Range) with the same meaning as their former selves.
Tests: unit (dev)
This series contains two improvements to hint file replay logic
in hints manager:
- During replay of a hint file, keeping track of the first hint that fails
to be sent is now done via a simple std::optional variable instead of an
unordered_set. This slightly reduces complexity of next replay position
calculation.
- A corner case is handled: if reading commitlog fails, but there won't be an
error related to sending hints, starting position wouldn't be updated. This
could cause us to replay more hints than necessary.
Tests:
- unit(dev)
- dtest(hintedhandoff_additional_test, dev)
* piodul-hints-manager-handle-commitlog-failure-in-replay-position-calculation:
hinted handoff: use bool instead of send_state_set
hinted handoff: update replay position on commitlog failure
hinted handoff: remove rps_set, use first_failed_rp instead
* seastar 81242ccc3f...8f0858cfd7 (18):
> Merge 'future, future-utils: stop returning a variadic future from when_all_succeed'
> file: introduce layered_file_impl, a helper for layered files
> net: packet: mark move assignment operator as noexcept
> core: weak_ptr, weakly_referencable: implement empty default constructor
> circular_buffer: Fix build with gcc 11 (avoid template parameters in d'tor declaration)
> test: weak_ptr_test: fix static asserts about nothrow constructibility
> coroutines: Fix clang build
> cmake: Delete SEASTAR_COROUTINES_TS
> Merge "future-util: Mark a few more functions as noexcept" from Rafael
> tests: add a perf test to measure the fair_queue performance
> Merge "iostream: make iostream stack nothrow move constructible" from Benny
> future: Move most of rethrow_with_nested out of line.
> future_test: Add test for nested exceptions in finally
> core: Add noexcept to unaligned members functions
> Merge "core: make weak_ptr and checked_ptr default and move nothrow constructible" from Benny
> core: file: Fix typo in a comment
> byteorder: Mark functions as noexcept
> future: replace CanInvoke concepts with std::invocable
Avi says:
"The backlog is a large number that changes slowly, so float
might not have enough resolution to track small changes.
For example, if the backlog is 800GB and changes less than 100kB, then
we won't see a change (float resolution is 2^23 ~ 1:8,000,000).
This is outside the normal range of values (usually the backlog changes
a lot more than 100kB per 15-second period), so it will work, but better
to be more careful."
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200615150621.17543-1-raphaelsc@scylladb.com>
But not compaction.
When reclaiming segments to seastar non-empty segments are copied
as-is to some other place. Instead of doing this reclaimer can copy
only allocated objects and leave the freed holes behing, i.e. -- do
the regular compaction. This would be the same or better from the
timing perspective, and will help to avoid yet another compaction
pass over the same set of objects in the future.
Current migration code checks for the free segments reserve to be
above minimum to proceed with migration, so does the code after this
patch, thus the segment compaction is called with non-empty free
segments set and thus it's guaranteed not to fail the new segment
allocation (if it will be required at all).
Plus some bikeshedding patches for the run-up.
tests: unit(dev)
* https://github.com/xemul/scylla/tree/br-logalloc-compact-on-reclaim-2:
logalloc: Compact segments on reclaim instead of migration
logallog: Introduce RAII allocation lock
logalloc: Shuffle code around region::impl::compact
logalloc: Do not lock reclaimer twice
logalloc: Do not calculate object size twice
logalloc: Do not convert obj_desc to migrator back and forth
SSTable_set is now an optional, and if we don't want to expire data
it will be empty. We need to check that it is not empty before dereferencing
it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200610170647.142817-1-glauber@scylladb.com>
This is relevant only when using partition or clustering keys which
have a representation in memory which is larger than 12.8 KB (10% of
LSA segment size).
There are several places in code (cache, background garbage
collection) which may need to linearize keys because of performing key
comparison, but it's not done safely:
1) the code does not run with the LSA region locked, so pointers may
get invalidated on linearization if it needs to reclaim memory. This
is fixed by running the code inside an allocating section.
2) LSA region is locked, but the scope of
with_linearized_managed_bytes() encloses the allocating section. If
allocating section needs to reclaim, linearization context will
contain invalidated pointers. The fix is to reorder the scopes so
that linearization context lives within an allocating section.
Example of 1 can be found in
range_populating_reader::handle_end_of_stream() where it performs a
lookup:
auto prev = std::prev(it);
if (prev->key().equal(*_cache._schema, *_last_key->_key)) {
it->set_continuous(true);
but handle_end_of_stream() is not invoked under allocating section.
Example of 2 can be found in mutation_cleaner_impl::merge_some() where
it does:
return with_linearized_managed_bytes([&] {
...
return _worker_state->alloc_section(region, [&] {
Fixes#6637.
Refs #6108.
Tests:
- unit (all)
Message-Id: <1592218544-9435-1-git-send-email-tgrabiec@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/6551
from Juliusz Stasiewicz:
The command regenerates streams when:
generations corresponding to a gossiped timestamp cannot be
fetched from system_distributed table,
or when generation token ranges do not align with token metadata.
In such case the streams are regenerated and new timestamp is
gossiped around. The returned JSON is always empty, regardless of
whether streams needed regeneration or not.
Fixes#6498
Accompanied by: scylladb/scylla-jmx#109, scylladb/scylla-tools-java#172
Since scylla-cpupower.service isn't installed by .rpm package, but created
in the setup script, it's better to not use /usr/lib directory, use /etc.
We already doing same way for scylla-server.service.d/*.conf, *.mount, and
*.swap created by setup scripts.
In decommission operation, current code requires a node in local dc to
sync data with. This requirement is too strong for a non network topology
strategy. For example, consider:
n1 dc1
n2 dc1
n3 dc2
n2 runs decommission operation. For a keyspace with simple strategy and
RF = 2, it is possible n3 is the new owner but n3 is not in the same dc
as n2.
To fix, perform the dc check only for the network topology strategy.
Fixes#6564
"
This series Adds a pseudo-floating-point histogram implementation.
The histogram is used for time_estimated_histogram a histogram for latency tracking and then used in storage_proxy as a more efficient with a higher resolution histogram.
Follow up series would use the new histogram in other places in the system and will add an implementation that supports lower values.
Fixes#5815Fixes#4746
"
* amnonh-quicker_estimated_histogram:
storage_proxy: use time_estimated_histogram for latencies
test/boost/estimated_histogram_test
utils/histogram_metrics_helper Adding histogram converter
utils/estimated_histogram: Adding approx_exponential_histogram
5ceb20c439 switched --enable-dpdk
to a tristate switch, but forgot that add_tristate() prepends
--enable and --disable itself; so now the switch looks like
--enable-enable-dpdk and --disable-enable-dpdk.
Fix by removing the "enable-" prefix.
This patch adds a helper converter function to convert from a
approx_exponential_histogram histogram to a seastar::metrics::histogram
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds an efficient histogram implementation.
The implementation chooses efficiency over flexibility.
That is why templates are used.
How the approx_exponential_histogram pseudo floating point histogram
works: It split the range [MIN, MAX] into log2(MAX/MIN) ranges it then
split each of that ranges linearly according to a given resolution.
For example, using resolution of 4, would be similar to using an
exponentially growing histogram with a coefficient of 1.2.
All values are uint64. To prevent handling of corner cases, it is not
allowed to set the MIN to be lower than the resolution.
The approx_exponential_histogram will probably not be used directly,
the first used is by time_estimated_histogram. A histogram for durations.
It should be compared to the estimated_histogram.
Performance comparison:
Comparison was done by inserting 2^20 values into
time_estimated_histogram and estimated_histogram.
In debug mode on a local machine insert operation took an average of
26.0 nanoseconds vs 342.2 nanoseconds.
In release mode insert operation took an average of 1.90 vs 8.28 nanoseconds
Fixes#5815
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The main goal of this series is to implement FilterExpression - the
newer syntax for filtering results of Query and Scan requests.
This feature itself is just one simple patch - it just needs to have the
already-existing filtering code call the already-existing expression
evaluation code. However, before we can do this, we need a patch to
refactor the expression-evaluation interface (this patch also fixes
pre-existing bugs). Then we need three additional patches to fix pre-
existing bugs in the various corner cases of expressions (this bugs
already existed in ConditionExpression but now became visible in
tests for FilerExpression). Finally, in the end of the series, we also
do a bit of code cleanup.
After this series, the FilterExpression feature is complete, and all
tests for this feature pass.
Tests: unit(dev)
* 'alternator-filterexpression' of git://github.com/nyh/scylla:
alternator: avoid unnecessary conversion to string
alternator: move some code out of executor.cc
alternator: implement FilterExpression
alternator: improve error path of attribute_type() function
alternator: fix begins_with() error path
alternator: fix corner case of contains() function in conditions
alternator: refactor resolving of references in expressions
"
This is part of the work for replacing global sstring variables with
constexpr std::string_view ones.
To have std::string_view values we have to convert a few APIs to take
std::string_view instead of sstring references.
The API conversions are complicated by the fact that
std::unordered_map doesn't support heterogeneous lookup, so we need
another hash map.
The one provided by abseil seems like a natural choice since it has an
API that looks like what is being proposed for c++
(http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2019/p1690r0.html)
but is also much faster.
A nice side effect is that this series is a 0.46% win in
perf_simple_query --duration 16 --smp 1 -m4G
Over 500 runs with randomized section layout and environment on each
run.
"
* 'espindola/absl-v10' of https://github.com/espindola/scylla:
database: Use a flat_hash_map for _ks_cf_to_uuid
database: Use flat_hash_map for _keyspaces
Add absl wrapper headers
build: Link with abseil
cofigure: Don't overwrite seastar_cflags
Add abseil as a submodule
Given that the key is a std::pair, we have to explicitly mark the hash
and eq types as transparent for heterogeneous lookup to work.
With that, pass std::string_view to a few functions that just check if
a value is in the map.
This increases the .text section by 11 KiB (0.03%).
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This changes the hash map used for _keyspaces. Using a flat_hash_map
allows using std::string_view in has_keyspace thanks to the
heterogeneous lookup support.
This add 200 KiB to .text, since this is the first use of absl and
brings in files from the .a.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Using these instead of using the absl headers directly adds support
for heterogeneous lookup with sstring as key.
The is no gain from having the hash function inline, so this
implements it in a .cc file.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
It is a pity we have to list so many libraries, but abseil doesn't
provide a .pc file.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The variable seastar_cflags was being used for flags passed to seastar
and for flags extracted from the seastar.pc file.
This introduces a new variable for the flags extracted from the
seastar.pc file.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This adds the https://abseil.io library as a submodule. The patch
series that follows needs a hash table that supports heterogeneous
lookup, and abseil has a really good hash table that supports that
(https://abseil.io/blog/20180927-swisstables).
The library is still not available in Fedora, but it is fairly easy to
use it directly from a submodule.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
"
Use table_id instead of table_name in row level repair to find a table. It
guarantees we repair the same table even if a table is dropped and a new
table is created with the same name.
Refs: #5942
"
* asias-repair_use_table_id_instead_of_table_name:
repair: Do not pass table names to repair_info
repair: Add table_id to row_level_repair
repair: Use table id to find a table in get_sharder_for_tables
repair: Add table_ids to repair_info
repair: Make func in tracker::run run inside a thread
This backlog metric holds the sum of backlog for all the tables
in the system. This is very useful for understanding the behavior
of the backlog trackers. That's how we managed to fix most of
backlog bugs like #6054, #6021, etc.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200612194908.39909-1-raphaelsc@scylladb.com>
"
The cql_server and thrift are "owned" by storage_service for
the sake of managing those, i.e. starting and stopping. Since
other services (still) need the storage_service this creates
dependencies loops.
This set makes the client services independent from the storage
service. As a consequence of it the auth service is also removed
from storage_service and put standalone. This, in turn, sets
some tests free from the need to start and stop auth and makes
one step towards NOT join_cluster()-ing in unit tests.
Also the set fixes few wierd races on scylla start and stop
that can trigger local_is_initialized() asserts, and one case of
unclear aborted shutdown when client services remain running
till the scylla process exit.
Yet another benefit is localization of "isolating" functionality
that sits deeper in storage_service than it should.
One thing that's not completely clean after it is the need for cql
server to continue referencing the service_memory_limiter semaphore
from the storage_service, but this will go away with one of the
next sets.
tests: unit(debug), manual start-stop,
nodetool check of cql/thrift start/stop
"
* 'br-split-transport-1' of https://github.com/xemul/scylla:
storage_service: Isolate isolator
auth: Move away from storage_service
auth: Move start-stop code into main
main: Don't forget to stop cql/thrift when start is aborted
thrift_controller: Switch on standalone
thrift_controller: Pass one through management API
thrift_controller: Move the code into thrift/
thrift_controller: Introduce own lock for management
thrift: Wrap start/stop/is_running code into a class
cql_controller: Switch on standalone
cql_controller: Pass one through management API
cql_controller: Move the code into transport/
cql_controller: Introduce own lock for management
cql: Wrap start/stop/is_running code into a class
api: Tune reg/unreg of client services control endpoints
A lot is going on when calculating effective ownership.
For each node in the cluster, we need to go over all the ranges belong
to that node and see if that node is the owner or not.
This patch uses futurized loops with do_for_each so it would preempt if
needed.
The patch replaces the current for-loops with do_for_each and do_with
but keeps the logic.
Fixes#6380
In a couple of places, where we already have a std::string_view, there
is no need to convert to to a std::string (which requires allocation).
One cool observation (by Piotr Sarna) is that map over std::string_view
is fine, when the strings in the map are always string constants.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The source file alternator/executor.cc has grown too much, reaching almost
4,000 lines. In this patch I move about 400 lines out of executor.cc:
1. Some functions related to serialization of sets and lists were moved to
serialization.cc,
2. Functions related to evaluating parsed expressions were moved to
expressions.cc.
The header file expressions_eval.hh was also removed - the calculate_value()
functions now live in expressions.cc, so we can just define them in
expressions.hh, no need for a separate header files.
This patch just moves code around. It doesn't make any functional changes.
Refs #5783.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch provides a complete implementation for the FilterExpression
parameter - the newer syntax for filtering the results of the Query or
Scan operations.
The implementation is pretty straightforward - we already added earlier
a result-filtering framework to Alternator, and used it for the older
filtering syntax - QuryFilter and ScanFilter. All we had to do now was
to run the FilterExpression (which has the same syntax as a
ConditionExpression) on each individual items. The previous cleanup
patches were important to reduce the friction of running these expressions
on the items.
After the previous patches fixing small esoteric bugs in a few expression
functions, with this patch *all* the tests in test_filter_expression.py
now pass, and so do the two FilterExpression tests in test_query.py and
test_scan.py. As far as I know (and of course minus any bugs we'll discover
later), this marks the FilterExpression feature complete.
Fixes#5038.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The attribute_type() function, which can be used in expressions like
ConditionExpression and FilterExpression, is supposed to generate an
error if its second parameter is not one of the known types. What we
did until now was to just report a failed check in this case.
We already had a reproducing test with FilterExpression, but in this patch
we also add a test with ConditionExpression - which fails before this
patch and passes afterwards (and of course, passes with DynamoDB).
Fixes#6641.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The begins_with() function should report an error if a constant is
passed to it which isn't one of the supported types - string or bytes
(e.g., a number).
The code we had to check this had wrong logic, though. If the item
attribute was also a number, we silently returned false, and didn't
go on to detect that the second parameter - a constant - was a number
too and should generate an error - not be silent.
Fixed and added a reproducing test case and another test to validate
my understanding of the type of parameters that begins_with() accepts.
Fixes#6640.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
It turns out that the contains() functions in the new syntax of
conditions (ConditionExpression, FilterExpression) is not identical
to the CONTAINS operator in the old-syntax conditions (Expected).
In the new syntax, one can check whether *any* constant object is contained
in a list. In the old syntax, the constant object must be of specific
types.
So we need to move the testing out of the check_CONTAINS() functions
that both implementations used, and into just the implementation of
the old syntax (in conditions.cc).
This bug broke one of the FilterExpression tests, but this patch also
adds new tests for the different behaviour of ConditionExpression and
Expected - tests which also reproduce this issue and verify its fix.
Fixes#6639.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In the DynamoDB API, expressions (e.g., ConditionExpression and many more)
may contain references to column names ("#name") or to values (":val")
given in a separate part of the request - ExpressionAttributeNames and
ExpressionAttributeValues respectively.
Before this patch, we resolved these references as part of the expression's
evaluation. This approach had two downsides:
1. It often misdiagnosed (both false negatives and false positives) cases
of unused names and values in expressions. We already had two xfailing
tests with examples - which pass after this patch. This patch also
adds two additional tests, which failed before this patch and pass
with it.
2. In one of the following patches we will add support for FilterExpression,
where the same expression is used repeatedly on many items. It is a waste
(as well as makes the code uglier) to resolve the same references again
and again each time the expression is evaluated. We should be able
to do it just once.
So this patch introduces an intermediate step between parsing and evaluating
an expression - "resolving" the expression. The new resolve_*() functions
modify the already parsed expression, replacing references to attribute
names and constant values by the actual names and values taken from the
request. The resolve_*() functions also keep track which references were
used, making it very easy to check (as DynamoDB does) if there are any
unused names or values, before starting the evaluation.
The interface of evaluate() functions become much simpler - they no longer
need to know the original request (which was previously needed for
ExpressionAttributeNames/Values), the table's schema (which was previously
needed only for some error checking), keep track of which references were
used. This simplification is helpful for using the expressions in contexts
where these things (request and schema) are no longer conveniently available,
namely in FilterExpression.
A small side-benefit of this patch is that it moves a bit of code, which
handled resolving of references in expressions, from executor.cc to
expressions.cc. This is just the first step in a bigger effort to
reduce the size of executor.cc by moving code to smaller source files.
There is no attempt in this patch to move as much code as we can.
We will move more code in a separate patch in this series.
Fixes#6572.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
LCS and SCTS already have their own files, reducing the clutter in
compaction_strategy.cc. Do the same for TWCS. I am doing this in
preparation to add more functions.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200611230906.409023-6-glauber@scylladb.com>
There is a code that isolates a node on disk error. After all the previous
changes this code can be collected in one place (better to move it away from
storage_service at all, but still).
This simplifies the stop_transport(): now it can avoid rescheduling itself
on shard 0 for the 2nd time.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>