Currently, we just passes entire output of perftune.py when getting CPU
mask from the script, but it may cause parse error since the script may
also print warning message.
To avoid that, we need to extract CPU mask from the output.
Fixes#10082Closes#10107
This patch adds a reproducer for the JSON encoding in issue #9061.
The bug was already fixed (it was a Seastar bug, and Seastar was
updated in commit 5d4213e1b8), but
I verified that the test fails before that patch - and passes today.
It is useful to have such a test for regressions, as well as for
testing backports.
Unfortunately, the test isn't pretty. The test uses the toppartitions
API, which instead of having a "start" and "stop" request has a single
synchronous "start for a given duration" request, and we need to run
it with some fixed duration (we took 1 second), and in parallel, one
request.
Refs #9061.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220323180855.3307931-1-nyh@scylladb.com>
The way our boot-time service "controllers" are written, if a
controller's start_server() finds an error and throws, it cannot
the caller (main.cc) to call stop_server(), and must clean up
resources already created (e.g., sharded services) before returning
or risk crashes on assertion failures.
This patch fixes such a mistake in Alternator's initialization.
As noted in issue #10025, if the Alternator TLS configuration is
broken - especially the certificate or key files are missing -
Scylla would crash on an assertion failure, instead of reporting
the error as expected. Before this patch such a misconfiguration
will result in the unintelligible:
<alternator::server>::~sharded() [Service = alternator::server]:
Assertion `_instances.empty()' failed. Aborting on shard 0.
After this patch we get the right error message:
ERROR 2022-03-21 15:25:07,553 [shard 0] init - Startup failed:
std::_Nested_exception<std::runtime_error> (Failed to set up Alternator
TLS credentials): std::_Nested_exception<std::runtime_error> (Could not
read certificate file conf/scylla.crt): std::filesystem::__cxx11::
filesystem_error (error system:2, filesystem error: open failed:
No such file or directory [conf/scylla.crt])
Arguably this error message is a bit ugly, so I opened
https://github.com/scylladb/seastar/issues/1029, but at least it says
exactly what the error is.
Fixes#10025
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220321133323.3150939-1-nyh@scylladb.com>
In commit 964500e47a, in the middle of
a larger series, I fixed a small Alternator bug that I found while working
on that series. The bug was that the ReturnValues=ALL_NEW feature moved out
the read previous_item, which breaks operations that need previous_item,
e.g., an ADD operation. Unfortunately, we never had a regression test for
this fix bug, so in this patch I add one.
This bug was re-discovered on an old branch by a user, at which point
I noticed that we don't have a test for it - so I want to add it now,
even though the bug itself is long gone from Scylla master.
I verified that the new test indeed fails on old versions of Scylla
before the aforementioned commit, and passes when backporting only that
commit.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220327074928.3608576-1-nyh@scylladb.com>
"
There's a static global sharded<local_cache> variable in system keyspace
the keeps several bits on board that other subsystems need to get from
the system keyspace, but what to have it in future<>-less manner.
Some time ago the system_keyspace became a classical sharded<> service
that references the qctx and the local cache. This set removes the global
cache variable and makes its instances be unique_ptr's sitting on the
system keyspace instances.
The biggest obstacle on this route is the local_host_id that was cached,
but at some point was copied onto db::config to simplify getting the value
from sstables manager (there's no system keyspace at hand there at all).
So the first thing this set does is removes the cached host_id and makes
all the users get it from the db::config.
(There's a BUG with config copy of host id -- replace node doesn't
update it. This set also fixes this place)
De-globalizing the cache is the prerequisite for untangling the snitch-
-messaging-gossiper-system_keyspace knot. Currently cache is initialized
too late -- when main calls system_keyspace.start() on all shards -- but
before this time messaging should already have access to it to store
its preferred IP mappings.
tests: unit(dev), dtest.simple_boot_shutdown(dev)
"
* 'br-trade-local-hostid-for-global-cache' of https://github.com/xemul/scylla:
system_keyspace: Make set_local_host_id non-static
system_keyspace: Make load_local_host_id non-static
system_keyspace: Remove global cache instance
system_keyspace: Make it peering service
system_keyspace,snitch: Make load_dc_rack_info non-static
system_keyspace,cdc,storage_service: Make bootstrap manipulations non-static
system_keyspace: Coroutinize set_bootstrap_state
gossiper: Add system keyspace dependency
cdc_generation_service: Add system keyspace dependency
system_keyspace: Remove local host id from local cache
storage_service: Update config.host_id on replace
storage_service: Indentation fix after previous patch
storage_service: Coroutinize prepare_replacement_info()
system_distributed_keyspace: Indentation fix after previous patch
code,system_keyspace: Relax system_keyspace::load_local_host_id() usage
code,system_keyspace: Remove system_keyspace::get_local_host_id()
"
By way of having an implementation of `data_dictionary` and using that.
The schema loader only needs a database to parse cql3 statements, which
are all coordinator-side objects and hence been largely migrated to use
data dictionary instead.
A few hard-dependencies on replica:: objects were found and resolved:
* index::secondary_index_manager
* tombstone_gc
The former was migrated to use `data_dictionary::table` instead of
`replica::table`. This in turn requires disentangling
`replica::data_dictionary_impl` from `replica::database`, as currently
the former can only really be used by the latter.
What all of this achieves us is that we no longer have to instantiate a
`replica::database` object in `tools::load_schema()`. We want to use the
standard allocator in tools, which means they cannot use LSA memory at
all. Database on the other hand creates memtable and row-cache instances
so it had to go.
Refs: #9882
Tests: unit(dev, schema_loader_test:debug,
cql-pytest/test_tools.py:debug)
"
* 'tools-schema-loader-database-impl/v2' of https://github.com/denesb/scylla:
tools/schema_loader: use own data dictionary impl
tombstone_gc: switch to using data dictionary
index/secondary_index_manager: switch to using data dictionary
replica/table: add as_data_dictionary()
replica: disentangle data_dictionary_impl from database
replica: move data_dictionary_impl into own header
In the DynamoDB API, error responses are in JSON format with specific
fields ("__type" and "message" in the x-amz-json-1.0 format currently
used). Alternator tried to be clever and build the string representation
of this JSON itself, instead of using RapidJSON. But this optimization
was a mistake - if the error message contains characters that need
escaping (such as double quotes and newlines), they weren't escaped,
and the resulting JSON was malformed. When the client library boto3
read this malformed JSON it got confused, cosidered the entire error
response to be a string, which resulted in an ugly error message.
The fix is easy - just build the JSON output as usual with RapidJSON
instead of trying to optimize using string operation.
The patch also includes two tests reproducing this bug and checking its
fix. The first test uses boto3 and shows it got confused on the type
of error (not understanding that it is a ValidationException). The
second test bypasses boto3 and shows exactly where the bug happens -
the response is an unparsable JSON.
Fixes#10278
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220327132705.3707979-1-nyh@scylladb.com>
Since our Docker image moved to Ubuntu, we mistakenly copy
dist/docker/etc/sysconfig/scylla-server to /etc/sysconfig, which is not
used in Ubuntu (it should be /etc/default).
So /etc/default/scylla-server is just default configuration of
scylla-server .deb package, --log-to-stdout is 0, same as normal installation.
We don't want keep the duplicated configuration file anyway,
so let's drop dist/docker/etc/sysconfig/scylla-server and configure
/etc/default/scylla-server in build_docker.sh.
Fixes#10270Closes#10280
This reverts commit 37dc31c429. There is no
reason to suppose compacting different tables concurently on different shards
reduces space requirements, apart from non-deterministically pausing
random shards.
However, when data is badly distributed and there are many tables, it will
slow down major compaction considerably. Consider a case where there are
100 tables, each with a 2GB large partition on some shard. This extra
200GB will be compacted on just one shard. With compation rate of 40 MB/s,
this adds more than an hour to the process. With the existing code, these
compactions would overlap if the badly distributed data was not all in one
shard.
It is also counter to tablets, where data is not equally ditributed on
purpose.
Closes#10246
"
Cleanup compaction works by rewriting all sstables that need clean up, one at
a time.
This approach can cause bad write amplification because the output data is
being made incrementally available for regular compaction.
Cleanup is a long operation on large data sets, and while it's happening,
new data can be written to buckets, triggering regular compaction.
Cleanup fighting for resources with regular compaction is a known problem.
With cleanup adding one file at a time to buckets, regular may require multiple
rounds to compact the data in a given bucket B, producing bad writeamp.
To fix this problem, cleanup will be made bucket aware. As each compaction
strategy has its own definition of bucket, strategies will implement their
own method to retrieve cleanup jobs. The method will be implemented such that
all files in a bucket B will be cleaned up together, and on completion,
they'll be made available for regular at once.
For STCS / ICS, a bucket is a size tier.
For TWCS, a bucket is a window.
For LCS, a bucket is a level.
In this way, writeamp problem is fixed as regular won't have to perform
multiple rounds to compact the data in a given bucket. Additionally, cleanup
will now be able to deduplicate data and will become way more efficient at
garbage collecting expired data.
The space requirement shouldn't be an issue, as compacting an entire bucket
happens during regular compaction anyway.
With leveled strategy, compacting an entire level is also not a problem because
files in a level L don't overlap and therefore incremental compaction is
employed to limit the space requirement.
By the time being, only STCS cleanup was made bucket aware. The others will be
using a default method, where one file is cleaned up at a time. Making cleanup
of other strategies bucket aware is relatively easy now and will be done soon.
Refs #10097.
"
* 'cleanup-compaction-revamp/v3' of https://github.com/raphaelsc/scylla:
test: sstable_compaction_test: Add test for strategy cleanup method
compaction: STCS: Implement cleanup strategy
compaction_manager: Wire cleanup task into the strategy cleanup method
compaction_strategy: Allow strategies to define their own cleanup strategy
compaction: Introduce compaction_descriptor::sstables_size
compaction: Move decision of garbage collection from strategy to task type
This implements cleanup strategy for STCS. It will return one descriptor
for each size tier. If a given tier has more than max_threshold
elements, more than 1 job will be returned for that tier. Token
contiguity is preserved by sorting elements of a tier by token.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
As the cleanup process can now be driven by the compaction strategy,
let's move cleanup into a new task type that uses the new
compaction_strategy::get_cleanup_compaction_jobs().
By the time being all strategies are using the default method that
returns one descriptor for each sstable that needs clean up.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
And pass it to the cql3 layer when parsing statements. This allows the
schema loader to cut itself from replica::database, using a local, much
simpler database implementation. This not only makes the code much
simpler but also opens up the way to using the standard allocator in
tools. The real database uses LSA which is incompatible with the
standard allocator (in release builds that is).
The callers are system_keyspace.load_local_host_id and storage service.
The former is non-static since previous patch, the latter has its own
sys.ks. reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
No users of this variable left, all the code relies on system_keyspace
"this" to get it. Respectively, the cache can be a unique_ptr<> on the
system_keyspace instance and the global sharded variable can be removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
And remove a bunch of (_local)?_cache.invoke_on_all() calls. This
is the preparation for removing the global cache instance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's snitch code that needs it. It now takes messaging service
from gossiper, so it can do the same with system keyspace. This
change removes one user of the global sys.ks. cache instance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The users of get_/set_bootstrap_sate and aux helpers are CDC and
storage service. Both have local system_keyspace references and can
just use them. This removes some users of global system ks. cache
and the qctx thing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The gossiper reads peer features from system keyspace. Also the snitch
code needs system keyspace, and since now it gets all its dependencies
from gossiper (will be fixed some day, but not now), it will do the same
for sys.ks.. Thus it's worth having gossiper->system_keyspace explicit
dependency.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
But only on the surface, the only internal function needing the database
(`needs_repair_before_gc()`) still gets a real database because the
replication factor cannot be obtained from the data dictionary
currently. Although this might not look like an improvement, it is
enough to avoid a `real_database()` call for tables that don't have
tombstone gc mode set to repair.
The service uses system keyspace to, e.g., manage the generation id,
thus it depends on the system_keyspace instance and deserves the
explicit reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The config.host_id value is loaded early on start, but when the
storage service prepares to join the cluster to replace a node,
it will change that value (with the host id of the target). This
change only affect the system keyspace, but not the config copy
which is a BUG.
fixes: #10243
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method is nowadays called from several places:
- API
- sys.dist.ks. (to udpate view building info)
- storage service prepare_to_join()
- set up in main
They all, but the last, can use db::config cached value, because
it's loaded earlier than any of them (but the last -- that's the
loading part itself).
Once patched, the load_local_host_id() can avoid checking the cache
for that value -- it will not be there for sure.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The host id is cached on db::config object that's available in
all the places that need it. This allows removing the method in
question from the system_keyspace and not caring that anyone that
needs host_id would have to depend on system_keyspace instance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it a standalone class, instead of private subclass of database.
Unfriend database and instead make wrap/unwrap methods public, so anyone
can use them.
The test runners cql-pytest/run et al. try to automatically find the
last-compile Scylla executable, but this decision can be overriden by
the SCYLLA environment variable. If the user sets by mistake SCYLLA to
something which is not a valid path of an executable, the result was a
long and obscure Python stack trace.
So after this patch, if SCYLLA points to something which is not an
executable, a clear error is produced immediately, directing the user
to set it this variable to a correct executable
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220323164427.3301828-1-nyh@scylladb.com>
Today, all compaction strategies will clean up their files using the
incremental approach of one sstable being rewritten at a time.
Turns out that's not the best approach performance wise. Let's take
STCS for example. As cleanup finishes rewriting one file, the output
file is placed into the sstable set. Regular now can compact that
file with another that was already there (e.g. produced by flush after
cleanup started). Inefficient compactions like this can keep happening
as cleanup incrementally places output file into the candidate list
for regular.
This method will allow strategies to clean up their files in batches.
For example, STCS can clean up all files in smallest tiers in single
round, allowing the output data to be added at once. So next compaction
rounds can be more efficient in terms of writeamp. Another benefit is
that deduplication and GC can happen more efficiently.
The drawback is the space requirement, as we no longer compact one file
a a time. However, the impact is minimized by cleaning up the smallest
tier first. With leveled strategy for example, even though 90% of data
is in highest level, the space requirement is not a problem because
we can apply the incremental compaction on its behalf. The same applies
to ICS. With STCS, the requirement is the size of the tier being
compacted, but that's already expected by its users anyway.
By the time being, all strategies have it unimplemented. so they still
use the old behavior where files are rewritten on at a time.
This will allow us to incrementally implement the cleanup method for
all compaction strategies.
Refs #10097.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Based on perf_simple_query, just bashes data into CL using
normal distribution min/max data chunk size, allowing direct
freeing of segments, _but_ delayed by a normal dist as well,
to "simulate" secondary delay in data persistance.
Needs more stuff.
Some baseline measurements on master:
--min-flush-delay-in-ms 10 --max-flush-delay-in-ms 200
--commitlog-use-hard-size-limit true
--commitlog-total-space-in-mb 10000 --min-data-size 160 --max-data-size 1024
--smp1
median 2065648.59 tps ( 1.1 allocs/op, 0.0 tasks/op, 1482 insns/op)
median absolute deviation: 48752.44
maximum: 2161987.06
minimum: 1984267.90
--min-data-size 256 --max-data-size 16384
median 269385.25 tps ( 2.2 allocs/op, 0.7 tasks/op, 3244 insns/op)
median absolute deviation: 15719.13
maximum: 323574.43
minimum: 228206.28
--min-data-size 4096 --max-data-size 61440
median 67734.22 tps ( 6.4 allocs/op, 2.9 tasks/op, 9153 insns/op)
median absolute deviation: 2070.93
maximum: 82833.17
minimum: 61473.57
--min-data-size 61440 --max-data-size 1843200
median 2281.37 tps ( 79.7 allocs/op, 43.5 tasks/op, 202963 insns/op)
median absolute deviation: 128.87
maximum: 3143.84
minimum: 2140.80
--min-data-size 368640 --max-data-size 6144000
median 679.76 tps (225.5 allocs/op, 116.3 tasks/op, 662700 insns/op)
median absolute deviation: 39.30
maximum: 1148.95
minimum: 586.86
Actual throughput obviously meaningless, as it is run on my slow
machine, but IPS might be relevant.
Note that transaction throughput plummets as we increase median data
sizes above ~200k, since we then more or less always end up replacing
buffers in every call.
Closes#10230