Found by a fragment stream validator added to the mutation-compactor (https://github.com/scylladb/scylladb/pull/11532). As that PR moves very slowly, the fixes for the issues found are split out into a PR of their own.
The first two of these issues seems benign, but it is important to remember that how benign an invalid fragment stream is depends entirely on the consumer of said stream. The present consumer of said streams may swallow the invalid stream without problem now but any future change may cause it to enter into a corrupt state.
The last one is a non-benign problem (again because the consumer reacts badly already) causing problems when building query results for range scans.
Closes#11604
* github.com:scylladb/scylladb:
shard_reader: do_fill_buffer(): only update _end_of_stream after buffer is copied
readers/mutation_readers: compacting_reader: remember injected partition-end
db/view: view_builder::execute(): only inject partition-start if needed
We have added the finished percentage for repair based node operations.
This patch adds the finished percentage for node ops using the old
streaming.
Example output:
scylla_streaming_finished_percentage{ops="bootstrap",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="decommission",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="rebuild",shard="0"} 0.561945
scylla_streaming_finished_percentage{ops="removenode",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="repair",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="replace",shard="0"} 1.000000
In addition to the metrics, log shows the percentage is added.
[shard 0] range_streamer - Finished 2698 out of 2817 ranges for rebuild, finished percentage=0.95775646
Fixes#11600Closes#11601
Commit 8ab57aa added a yield to the buffer-copy loop, which means that
the copy can yield before done and the multishard reader might see the
half-copied buffer and consider the reader done (because
`_end_of_stream` is already set) resulting in the dropping the remaining
part of the buffer and in an invalid stream if the last copied fragment
wasn't a partition-end.
Fixes: #11561
Currently injecting a partition-end doesn't update
`_last_uncompacted_kind`, which will allow for a subsequent
`next_partition()` call to trigger injecting a partition-end, leading to
an invalid mutation fragment stream (partition-end after partition-end).
Fix by changing `_last_uncompacted_kind` to `partition_end` when
injecting a partition-end, making subsequent injection attempts noop.
Fixes: #11608
When resuming a build-step, the view builder injects the partition-start
fragment of the last processed partition, to bring the consumer
(compactor) into the correct state before it starts to consume the
remainder of the partition content. This results in an invalid fragment
stream when the partition was actually over or there is nothing left for
the build step. Make the inject conditional on when the reader contains
more data for the partition.
Fixes: #11607
Update several aspects of the alternator/getting-started.md which were
not up-to-date:
* When the documented was written, Alternator was moving quickly so we
recommended running a nightly version. This is no longer the case, so
we should recommend running the latest stable build.
* The link to the download link is no longer helpful for getting Docker
instructions (it shows some generic download options). Instead point to
our dockerhub page.
* Replace mentions of "Scylla" by the new official name, "ScyllaDB".
* Miscelleneous copy-edits.
Fixes#11218
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11605
We had quite a few tests for Alternator TTL in test/alternator, but most
of them did not run as part of the usual Jenkins test suite, because
they were considered "very slow" (and require a special "--runveryslow"
flag to run).
In this series we enable six tests which run quickly enough to run by
default, without an additional flag. We also make them even quicker -
the six tests now take around 2.5 seconds.
I also noticed that we don't have a test for the Alternator TTL metrics
- and added one.
Fixes#11374.
Refs https://github.com/scylladb/scylla-monitoring/issues/1783Closes#11384
* github.com:scylladb/scylladb:
test/alternator: insert test names into Scylla logs
rest api: add a new /system/log operation
alternator ttl: log warning if scan took too long.
alternator,ttl: allow sub-second TTL scanning period, for tests
test/alternator: skip fewer Alternator TTL tests
test/alternator: test Alternator TTL metrics
The mutation fragment stream validator filter has a detailed debug log
in its constructor. To avoid putting together this message when the log
level is above debug, it is enclosed in an if, activated when log level
is debug or trace... at least that was intended. Actually the if is
activated when the log level is debug or above (info, warn or error) but
is only actually logged if the log level is exactly debug. Fix the logic
to work as intended.
Closes#11603
The token_metadata::calculate_pending_ranges_for_bootstrap() makes a
clone of itself and adds bootstrapping nodes to the clone to calculate
ranges. Currently added nodes lack the dc/rack which affects the
calculations the bad way.
Unfortunately, the dc/rack for those nodes is not available on topology
(yet) and needs pretty heavy patching to have. Fortunately, the only
caller of this method has gossiper at hand to provide the dc/rack from.
fixes: #11531
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11596
applier_fiber could create multiple snapshots between
io_fiber run. The fsm_output.snp variable was
overwritten by applier_fiber and io_fiber didn't drop
the previous snapshot.
In this patch we introduce the variable
fsm_output.snps_to_drop, store in it
the current snapshot id before applying
a new one, and then sequentially drop them in
io_fiber after storing the last snapshot_descriptor.
_sm_events.signal() is added to fsm::apply_snapshot,
since this method mutates the _output and thus gives a
reason to run io_fiber.
The new test test_frequent_snapshotting demonstrates
the problem by causing frequent snapshots and
setting the applier queue size to one.
Closes#11530
For some reason, the test is currently flaky on Jenkins. Apparently the
Python driver does not reconnect to the cluster after the cluster
restarts (well it does, but then it disconnects from one of the nodes
and never reconnects again). This causes the test to hang on "waiting
until driver reconnects to every server" until it times out.
Disable it for now so it doesn't block next promotion.
Fix https://github.com/scylladb/scylladb/issues/11373
- Updated the information on the "Counting all rows in a table is slow" page.
- Added COUNT to the list of selectors of the SELECT statement (somehow it was missing).
- Added the note to the description of the COUNT() function with a link to the KB page for troubleshooting if necessary. This will allow the users to easily find the KB page.
Closes#11417
* github.com:scylladb/scylladb:
doc: add a comment to remove the note in version 5.1
doc: update the information on the Countng all rows page and add the recommendation to upgrade ScyllaDB
doc: add a note to the description of COUNT with a reference to the KB article
doc: add COUNT to the list of acceptable selectors of the SELECT statement
Tools want to be as little disrupting to the environment they run in as possible, because they might be run in a production environment, next to a running scylladb production server. As such, the usual behavior of seastar applications w.r.t. memory is an anti-pattern for tools: they don't want to reserve most of the system memory, in fact they don't want to reserve any amount, instead consuming as much as needed on-demand.
To achieve this, tools want to use the standard allocator. To achieve this they need a seastar option to to instruct seastar to *not* configure and use the seastar allocator and they need LSA to cooperate with the standard allocator.
The former is provided by https://github.com/scylladb/seastar/pull/1211.
The latter is solved by introducing the concept of a `segment_store_backend`, which abstracts away how the memory arena for segments is acquired and managed. We then refactor the existing segment store so that the seastar allocator specific parts are moved to an implementation of this backend concept, then we introduce another backend implementation appropriate to the standard allocator.
Finally, tools configure seastar with the newly introduced option to use the standard allocator and similarly configure LSA to use the standard allocator appropriate backend.
Refs: https://github.com/scylladb/scylladb/issues/9882
This is the last major code piece in scylla for making tools production ready.
Closes#11510
* github.com:scylladb/scylladb:
test/boost: add alternative variant of logalloc test
tools: use standard allocator
utils/logalloc: add use_standard_allocator_segment_pool_backend()
utils/logalloc: introduce segment store backend for standard allocator
utils/logalloc: rebase release segment-store on segment-store-backend
utils/logalloc: introduce segment_store_backend
utils/logalloc: push segment alloc/dealloc to segment_store
test/boost/logalloc_test: make test_compaction_with_multiple_regions exception-safe
Fix https://github.com/scylladb/scylladb/issues/11376
This PR adds the upgrade guide from version 5.0 to 5.1. It involves adding new files (5.0-to-5.1) and language/formatting improvements to the existing content (shared by several upgrade guides).
Closes#11577
* github.com:scylladb/scylladb:
doc: upgrade the command to upgrade the ScyllaDB image from 5.0 to 5.1
doc: add the guide to upgrade ScyllaDB from 5.0 to 5.1
This PR adds the missing upgrade guides for upgrading the ScyllaDB image to a patch release:
- ScyllaDB 5.0: /upgrade/upgrade-opensource/upgrade-guide-from-5.x.y-to-5.x.z/upgrade-guide-from-5.x.y-to-5.x.z-image/
- ScyllaDB Enterprise: /upgrade/upgrade-enterprise/upgrade-guide-from-2021.1-to-2022.1/upgrade-guide-from-2022.1-to-2022.1-image/ (the file name is wrong and will be fixed with another PR)
In addition, the section regarding the recommended upgrade procedure has been improved.
Fixes https://github.com/scylladb/scylladb/issues/11450
Fixes https://github.com/scylladb/scylladb/issues/11452Closes#11460
* github.com:scylladb/scylladb:
doc: update the commands to upgrade the ScyllaDB image
doc: fix the filename in the index to resolve the warnings and fix the link
doc: apply feedback by adding she step fo load the new repo and fixing the links
doc: fix the version name in file upgrade-guide-from-2021.1-to-2022.1-image.rst
doc: rename the upgrade-image file to upgrade-image-opensource and update all the links to that file
doc: update the Enterprise guide to include the Enterprise-onlyimage file
doc: update the image files
doc: split the upgrade-image file to separate files for Open Source and Enterprise
doc: clarify the alternative upgrade procedures for the ScyllaDB image
doc: add the upgrade guide for ScyllaDB Image from 2022.x.y. to 2022.x.z
doc: add the upgrade guide for ScyllaDB Image from 5.x.y. to 5.x.z
It points to a private scylladb repo, which has no place in user-facing
documentation. For now there is no public replacement, but a similar
functionality is in the works for Scylla Manager.
Fixes: #11573Closes#11580
Scylla's Bloom filter implementation has a minimal false-positive rate
that it can support (6.71e-5). When setting bloom_filter_fp_chance any
lower than that, the compute_bloom_spec() function, which writes the bloom
filter, throws an exception. However, this is too late - it only happens
while flushing the memtable to disk, and a failure at that point causes
Scylla to crash.
Instead, we should refuse the table creation with the unsupported
bloom_filter_fp_chance. This is also what Cassandra did six years ago -
see CASSANDRA-11920.
This patch also includes a regression test, which crashes Scylla before
this patch but passes after the patch (and also passes on Cassandra).
Fixes#11524.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11576
Changes done to avoid pitfalls and fix issues of sstable-related unit tests
Closes#11578
* github.com:scylladb/scylladb:
test: Make fake sstables implicitly belong to current shard
test: Make it clearer that sstables::test::set_values() modify data size
The test changes the servers' configuration to include `raft`
in the `experimental-features` list, then restarts them.
It waits until driver reconnects to every server after restarting.
Then it checks that upgrade eventually finishes on every server by
querying `group0_upgrade_state` key in `system.scylla_local`. Finally,
it performs a schema change and verifies that a corresponding entry has
appeared in `system.group0_history`.
The commit also increases the number of clusters in the suite cluster
pool. Since the suite contains only one test at this time this only has
an effect if we run the test multiple times (using `--repeat`).
Closes#11563
* github.com:scylladb/scylladb:
test/topology_raft_disabled: write basic raft upgrade test
test: setup logging in topology suites
Fake SSTables will be implicitly owned by the shard that created them,
allowing them to be called on procedures that assert the SSTables
are owned by the current shard, like the table's one that rebuilds
the sstable set.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
By adding a param with default value, we make it clear in the interface
that the procedure modifies sstable data size.
It can happen one calls this function without noticing it overrides
the data size previously set using a different function.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The test changes the servers' configuration to include `raft`
in the `experimental-features` list, then restarts them.
It waits until driver reconnects to every server after restarting.
Then it checks that upgrade eventually finishes on every server by
querying `group0_upgrade_state` key in `system.scylla_local`. Finally,
it performs a schema change and verifies that a corresponding entry has
appeared in `system.group0_history`.
The commit also increases the number of clusters in the suite cluster
pool. Since the suite contains only one test at this time this only has
an effect if we run the test multiple times (using `--repeat`).
Make it possible to use logging from within tests in the topology
suites. The tests are executed using `pytest`, which uses a `pytest.ini`
file for logging configuration.
Also cleanup the `pytest.ini` files a bit.
In compatibility.md where we refer to the missing ability to add a GSI
to an existing table - let's refer to a new issue specifically about this
feature, instead of the old bigger issue about UpdateItem.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11568
when using ":attrs" attribute' from Nadav Har'El
This PR improves the testing for issue #5009 and fixes most of it (but
not all - see below). Issue #5009 is about what happens when a user
tries to use the name `:attrs` for an attribute - while Alternator uses
a map column with that name to hold all the schema-less attributes of an
item. The tests we had for this issue were partial, and missed the
worst cases which could result in Scylla crashing on specially-crafted
PutItem or UpdateItem requests.
What the tests missed were the cases that `:attrs` is used as a
**non-key**. So in this PR we add additional tests for this case,
several of them fail or even crash Scylla, and then we fix all these
cases.
Issue #5009 remains open because using `:attrs` as the name of a **key**
is still not allowed. But because it results in a clean error message
when attempting to create a table with such a key, I consider this
remaining problem very minor.
Refs #5009.
Closes#11572
* github.com:scylladb/scylladb:
alternator: fix crashes an errors when using ":attrs" attribute
alternator: improve tests for reserved attribute name ":attrs"
Alternator uses a single column, a map, with the deliberately strange
name ":attrs", to hold all the schema-less attributes of an item.
The existing code is buggy when the user tries to write to an attribute
with this strange name ":attrs". Although it is extremely unlikely that
any user would happen to choose such a name, it is nevertheless a legal
attribute name in DynamoDB, and should definitely not cause Scylla to crash
as it does in some cases today.
The bug was caused by the code assuming that to check whether an attribute
is stored in its own column in the schema, we just need to check whether
a column with that name exists. This is almost true, except for the name
":attrs" - a column with this name exists, but it is a map - the attribute
with that name should be stored *in* the map, not as the map. The fix
is to modify that check to special-case ":attrs".
This fix makes the relevant tests, which used to crash or fail, now pass.
This fix solves most of #5009, but one point is not yet solved (and
perhaps we don't need to solve): It is still not allowed to use the
name ":attrs" for a **key** attribute. But trying to do that fails cleanly
(during the table creation) with an appropriate error message, so is only
a very minor compatibility issue.
Refs #5009
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
As explained in issue #5009, Alternator currently forbids the special
attribute name ":attrs", whereas DynamoDB allows any string of approriate
length (including the specific string ":attrs") to be used.
We had only a partial test for this incompatibility, and this patch
improves the testing of this issue. In particular, we were missing a
test for the case that the name ":attrs" was used for a non-key
attribute (we only tested the case it was used as a sort key).
It turns out that Alternator crashes on the new test, when the test tries
to write to a non-key attribute called ":attrs", so we needed to mark
the new test with "skip". Moreover, it turns out that different code paths
handle the attribute name ":attrs" differently, and also crash or fail
in other ways - so we added more than one xfailing and skipped tests
that each fails in a different place (and also a few tests that do pass).
As usual, the new tests we checked to pass on DynamoDB.
Refs #5009
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This tiny series fixes some small error and out-of-date information in Alternator documentation and code comments.
Closes#11547
* github.com:scylladb/scylladb:
alternator ttl: comment fixes
docs/alternator: fix mention of old alternator-test directory
Some tests mark clusters as 'dirty', which makes them non-reusable by
later tests; we don't want to return them to the pool of clusters.
This use-case was covered by the `add_one` function in the `Pool` class.
However, it had the unintended side effect of creating extra clusters
even if there were no more tests that were waiting for new clusters.
Rewrite the implementation of `Pool` so it provides 3 interface
functions:
- `get` borrows an object, building it first if necessary
- `put` returns a borrowed object
- `steal` is called by a borrower to free up space in the pool;
the borrower is then responsible for cleaning up the object.
Both `put` and `steal` wake up any outstanding `get` calls. Objects are
built only in `get`, so no objects are built if none are needed.
Closes#11558
Which intializes LSA with use_standard_allocator_segment_pool_backend()
running the logalloc_test suite on the standard allocator segment pool
backend. To avoid duplicating the test code, the new test-file pulls in
the test code via #include. I'm not proud of it, but it works and we
test LSA with both the debug and standard memory segment stores without
duplicating code.
"
Messaging service checks dc/rack of the target node when creating a
socket. However, this information is not available for all verbs, in
particular gossiper uses RPC to get topology from other nodes.
This generates a chicken-and-egg problem -- to create a socket messaging
service needs topology information, but in order to get one gossiper
needs to create a socket.
Other than gossiper, raft starts sending its APPEND_ENTRY messages early
enough so that topology info is not avaiable either.
The situation is extra-complicated with the fact that sockets are not
created for individual verbs. Instead, verbs are groupped into several
"indices" and socket is created for it. Thus, the "gossiping" index that
includes non-gossiper verbs will create topology-less socket for all
verbs in it. Worse -- raft sends messages w/o solicited topology, the
corresponding socket is created with the assumption that the peer lives
in default dc and rack which doesn't matchthe local nodes' dc/rack and
the whole index group gets the "randomly" configured socket.
Also, the tcp-nodelay tries to implement similar check, but uses wrong
index of 1, so it's also fixed here.
"
* 'br-messaging-topology-ignoring-clients' of https://github.com/xemul/scylla:
messaging_service: Fix gossiper verb group
messaging_service: Mind the absence of topology data when creating sockets
messaging_service: Templatize and rename remove_rpc_client_one
Use the new seastar option to instruct seastar to not initialize and use
the seastar allocator, relying on the standard allocator instead.
Configure LSA with the standard allocator based segment store backend:
* scylla-types reserves 1MB for LSA -- in theory nothing here should use
LSA, but just in case...
* scylla-sstable reserves 100MB for LSA, to avoid excessive trashing in
the sstable index caches.
With this, tools now should allocate memory on demand, without reserving
a large chunk of (or all of) the available memory, as regular seastar
apps do.
Creating a standard-memory-allocator backend for the segment store.
This is targeted towards tools, which want to configure LSA with a
segment store backend that is appropriate for the standard allocator
(which they want to use).
We want to be able to use this in both release and debug mode. The
former will be used by tools and the latter will be used to run the
logalloc tests with this new backend, making sure it works and doesn't
regress. For this latter, we have to allow the release and debug stores
to coexist in the same build and for the debug store to be able to
delegate to the release store when the standard allocator backend is
used.
There's a bunch of helpers for CDC gen service in db/system_keyspace.cc. All are static and use global qctx to make queries. Fortunately, both callers -- storage_service and cdc_generation_service -- already have local system_keyspace references and can call the methods via it, thus reducing the global qctx usage.
Closes#11557
* github.com:scylladb/scylladb:
system_keyspace: De-static get_cdc_generation_id()
system_keyspace: De-static cdc_is_rewritten()
system_keyspace: De-static cdc_set_rewritten()
system_keyspace: De-static update_cdc_generation_id()
- Raise on response not HTTP 200 for `.get_text()` helper
- Fix API paths
- Close and start a fresh driver when restarting a server and it's the only server in the cluster
- Fix stop/restart response as text instead of inspecting (errors are status 500 and raise exceptions)
Closes#11496
* github.com:scylladb/scylladb:
test.py: handle duplicate result from driver
test.py: log server restarts for topology tests
test.py: log actions for topology tests
Revert "test.py: restart stopped servers before...
test.py: ManagerClient API fix return text
test.py: ManagerClient raise on HTTP != 200
test.py: ManagerClient fix paths to updated resource
Rebase the seastar allocator based segment store implementation on the
recently introduced segment store backend which is now abstracts away
how memory for segments is obtained.
This patch also introduces an explicit `segment_npos` to be used for
cases when a segment -> index mapping fails (segment doesn't belong to
the store). Currently the seastar allocator based store simply doesn't
handle this case, while the standard allocator based store uses 0 as the
implicit invalid index.
We want to make it possible to select the segment-store to be used for
LSA -- the seastar allocator based one or the standard allocator based
on -- at runtime. Currently this choice is made at compile time via
preprocessor switches.
The current standard memory based store is specialized for debug build,
we want something more similar to the seastar standard memory allocator
based one. So we introduce a segment store backend for the current
seastar allocator based store, which abstracts how the backing memory
for all segments is allocated/freed, while keeping the segment <-> index
mapping common. In the next patches we will rebase the current seastar
allocator based segment store on this backend and later introduce
another backend for standard allocator, targeted for release builds.
Currently the actual alloc/dealloc of memory for segments is located
outside the segment stores. We want to abstract away how segments are
allocated, so we move this logic too into the segment store. For now
this results in duplicate code in the two segment store implementations,
but this will soon be gone.
Said test creates two vectors, the vector storage being allocated with
the default allocator, while its content being allocated on LSA. If an
exception is thrown however, both are freed via the default allocator,
triggering an assert in LSA code. Move the cleanup into a `defer()` so
the correct cleanup sequence is executed even on exceptions.
When cross-shard barrier is abort()-ed it spawns a background fiber
that will wake-up other shards (if they are sleeping) with exception.
This fiber is implicitly waited by the owning sharded service .stop,
because barrier usage is like this:
sharded<service> s;
co_await s.invoke_on_all([] {
...
barrier.abort();
});
...
co_await s.stop();
If abort happens, the invoke_on_all() will only resolve _after_ it
queues up the waking lambdas into smp queues, thus the subseqent stop
will queue its stopping lambdas after barrier's ones.
However, in debug mode the queue can be shuffled, so the owning service
can suddenly be freed from under the barrier's feet causing use after
free. Fortunately, this can be easily fixed by capturing the shared
pointer on the shared barrier instead of a regular pointer on the
shard-local barrier.
fixes: #11303
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11553
The test is supposed to give a helpful error message when the user forgets to
run --populate before the benchmark. But this must have become broken at some
point, because execute_cql() terminates the program with an unhelpful
("unconfigured table config") message, which doesn't mention --populate.
Fix that by catching the exception and adding the helpful tip.
Closes#11533
The logger is proof against allocation failures, except if
--abort-on-seastar-bad-alloc is specified. If it is, it will crash.
The reclaim stall report is likely to be called in low memory conditions
(reclaim's job is to alleviate these conditions after all), so we're
likely to crash here if we're reclaiming a very low memory condition
and have a large stall simultaneously (AND we're running in a debug
environment).
Prevent all this by disabling --abort-on-seastar-bad-alloc temporarily.
Fixes#11549Closes#11555