Currently, region_group forms a hierarchy. Originally it was a tree,
but previous work whittled it down to a parent-child relationship
(with a single, possible optional parent, and a single child).
The actual behavior of the parent and child are very different, so
it makes sense to split them. The main difference is that the parent
does not contain any regions (memtables), but the child does.
This patch mechanically splits the class. The parent is named
memory_hard_limit (reflecting its role to prevent lsa allocation
above the memtable configured hard limit). The child is still named
region_group.
Details of the transformation:
- each function or data member in region_group is either moved to
memory_hard_limit, duplicated in memory_hard_limit, or left in
region_group.
- the _regions and _blocked_requests members, which were always
empty in the parent, were not duplicated. Any member that only accessed
them was similarly left alone.
- the "no_reclaimer" static member which was only used in the parent
was moved there. Similarly the constructor which accepted it
was moved.
- _child was moved to the parent, and _parent was kept in the child
(more or less the defining change of the split) Similarly
add(region_group*) and del(region_group*) (which manage _child) were moved.
- do_for_each_parent(), which iterated to the top of the tree, was removed
and its callers manually unroll the loop. For the parent, this is just
a single iteration (since we're iterating towards the root), for the child,
this can be two iterations, but the second one is usually simpler since
the parent has many members removed.
- do_update(), introduced in the previous patch, was made a template that
can act on either the parent or the child. It will be further simplified
later.
- some tests that check now-impossible topologies were removed.
- the parent's shutdown() is trivial since it has no _blocked_requests,
but it was kept to reduce churn in the callers.
A mechanical transformation intended to allow reuse later. The function
doesn't really deserve to exist on its own, so it will be swallowed back
by its callers later.
region_group currently fulfills two roles: in one role, when instantiated
as dirty_memory_manager::_virtual_region_group, it is responsible
for holding functions that allocate memtable memory (writes) and only
allowing them to run when enough dirty memory has been flushed from other
memtables. The other role, when instantiated as
dirty_memory_manager::_real_region_group, is to provide a hard stop when
the total amount of dirty memory exceeds the limit, since the other limit
is only estimated.
We want to simplify the whole thing, which means not using the same class
for two different roles (or rather, we can use it for both roles if we
simplify the internals significantly).
As a first step towards clarifying what functionality is used in what
role, move some classes related to holding allocating functions to a new
class allocation_queue. We will gradually move move content there, reducing
the amount of role confusion in region_group.
Type aliases are added to reduce churn.
We only have one parent/child relationship in the region group
hierarchy, so support for more is unneeded complexity. Replace
the subgroup vector with a single pointer, and delete a test
for the removed functionality.
The token_metadata::calculate_pending_ranges_for_bootstrap() makes a
clone of itself and adds bootstrapping nodes to the clone to calculate
ranges. Currently added nodes lack the dc/rack which affects the
calculations the bad way.
Unfortunately, the dc/rack for those nodes is not available on topology
(yet) and needs pretty heavy patching to have. Fortunately, the only
caller of this method has gossiper at hand to provide the dc/rack from.
fixes: #11531
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11596
applier_fiber could create multiple snapshots between
io_fiber run. The fsm_output.snp variable was
overwritten by applier_fiber and io_fiber didn't drop
the previous snapshot.
In this patch we introduce the variable
fsm_output.snps_to_drop, store in it
the current snapshot id before applying
a new one, and then sequentially drop them in
io_fiber after storing the last snapshot_descriptor.
_sm_events.signal() is added to fsm::apply_snapshot,
since this method mutates the _output and thus gives a
reason to run io_fiber.
The new test test_frequent_snapshotting demonstrates
the problem by causing frequent snapshots and
setting the applier queue size to one.
Closes#11530
For some reason, the test is currently flaky on Jenkins. Apparently the
Python driver does not reconnect to the cluster after the cluster
restarts (well it does, but then it disconnects from one of the nodes
and never reconnects again). This causes the test to hang on "waiting
until driver reconnects to every server" until it times out.
Disable it for now so it doesn't block next promotion.
Fix https://github.com/scylladb/scylladb/issues/11373
- Updated the information on the "Counting all rows in a table is slow" page.
- Added COUNT to the list of selectors of the SELECT statement (somehow it was missing).
- Added the note to the description of the COUNT() function with a link to the KB page for troubleshooting if necessary. This will allow the users to easily find the KB page.
Closes#11417
* github.com:scylladb/scylladb:
doc: add a comment to remove the note in version 5.1
doc: update the information on the Countng all rows page and add the recommendation to upgrade ScyllaDB
doc: add a note to the description of COUNT with a reference to the KB article
doc: add COUNT to the list of acceptable selectors of the SELECT statement
Tools want to be as little disrupting to the environment they run in as possible, because they might be run in a production environment, next to a running scylladb production server. As such, the usual behavior of seastar applications w.r.t. memory is an anti-pattern for tools: they don't want to reserve most of the system memory, in fact they don't want to reserve any amount, instead consuming as much as needed on-demand.
To achieve this, tools want to use the standard allocator. To achieve this they need a seastar option to to instruct seastar to *not* configure and use the seastar allocator and they need LSA to cooperate with the standard allocator.
The former is provided by https://github.com/scylladb/seastar/pull/1211.
The latter is solved by introducing the concept of a `segment_store_backend`, which abstracts away how the memory arena for segments is acquired and managed. We then refactor the existing segment store so that the seastar allocator specific parts are moved to an implementation of this backend concept, then we introduce another backend implementation appropriate to the standard allocator.
Finally, tools configure seastar with the newly introduced option to use the standard allocator and similarly configure LSA to use the standard allocator appropriate backend.
Refs: https://github.com/scylladb/scylladb/issues/9882
This is the last major code piece in scylla for making tools production ready.
Closes#11510
* github.com:scylladb/scylladb:
test/boost: add alternative variant of logalloc test
tools: use standard allocator
utils/logalloc: add use_standard_allocator_segment_pool_backend()
utils/logalloc: introduce segment store backend for standard allocator
utils/logalloc: rebase release segment-store on segment-store-backend
utils/logalloc: introduce segment_store_backend
utils/logalloc: push segment alloc/dealloc to segment_store
test/boost/logalloc_test: make test_compaction_with_multiple_regions exception-safe
Fix https://github.com/scylladb/scylladb/issues/11376
This PR adds the upgrade guide from version 5.0 to 5.1. It involves adding new files (5.0-to-5.1) and language/formatting improvements to the existing content (shared by several upgrade guides).
Closes#11577
* github.com:scylladb/scylladb:
doc: upgrade the command to upgrade the ScyllaDB image from 5.0 to 5.1
doc: add the guide to upgrade ScyllaDB from 5.0 to 5.1
This PR adds the missing upgrade guides for upgrading the ScyllaDB image to a patch release:
- ScyllaDB 5.0: /upgrade/upgrade-opensource/upgrade-guide-from-5.x.y-to-5.x.z/upgrade-guide-from-5.x.y-to-5.x.z-image/
- ScyllaDB Enterprise: /upgrade/upgrade-enterprise/upgrade-guide-from-2021.1-to-2022.1/upgrade-guide-from-2022.1-to-2022.1-image/ (the file name is wrong and will be fixed with another PR)
In addition, the section regarding the recommended upgrade procedure has been improved.
Fixes https://github.com/scylladb/scylladb/issues/11450
Fixes https://github.com/scylladb/scylladb/issues/11452Closes#11460
* github.com:scylladb/scylladb:
doc: update the commands to upgrade the ScyllaDB image
doc: fix the filename in the index to resolve the warnings and fix the link
doc: apply feedback by adding she step fo load the new repo and fixing the links
doc: fix the version name in file upgrade-guide-from-2021.1-to-2022.1-image.rst
doc: rename the upgrade-image file to upgrade-image-opensource and update all the links to that file
doc: update the Enterprise guide to include the Enterprise-onlyimage file
doc: update the image files
doc: split the upgrade-image file to separate files for Open Source and Enterprise
doc: clarify the alternative upgrade procedures for the ScyllaDB image
doc: add the upgrade guide for ScyllaDB Image from 2022.x.y. to 2022.x.z
doc: add the upgrade guide for ScyllaDB Image from 5.x.y. to 5.x.z
It points to a private scylladb repo, which has no place in user-facing
documentation. For now there is no public replacement, but a similar
functionality is in the works for Scylla Manager.
Fixes: #11573Closes#11580
Scylla's Bloom filter implementation has a minimal false-positive rate
that it can support (6.71e-5). When setting bloom_filter_fp_chance any
lower than that, the compute_bloom_spec() function, which writes the bloom
filter, throws an exception. However, this is too late - it only happens
while flushing the memtable to disk, and a failure at that point causes
Scylla to crash.
Instead, we should refuse the table creation with the unsupported
bloom_filter_fp_chance. This is also what Cassandra did six years ago -
see CASSANDRA-11920.
This patch also includes a regression test, which crashes Scylla before
this patch but passes after the patch (and also passes on Cassandra).
Fixes#11524.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11576
Changes done to avoid pitfalls and fix issues of sstable-related unit tests
Closes#11578
* github.com:scylladb/scylladb:
test: Make fake sstables implicitly belong to current shard
test: Make it clearer that sstables::test::set_values() modify data size
The test changes the servers' configuration to include `raft`
in the `experimental-features` list, then restarts them.
It waits until driver reconnects to every server after restarting.
Then it checks that upgrade eventually finishes on every server by
querying `group0_upgrade_state` key in `system.scylla_local`. Finally,
it performs a schema change and verifies that a corresponding entry has
appeared in `system.group0_history`.
The commit also increases the number of clusters in the suite cluster
pool. Since the suite contains only one test at this time this only has
an effect if we run the test multiple times (using `--repeat`).
Closes#11563
* github.com:scylladb/scylladb:
test/topology_raft_disabled: write basic raft upgrade test
test: setup logging in topology suites
Fake SSTables will be implicitly owned by the shard that created them,
allowing them to be called on procedures that assert the SSTables
are owned by the current shard, like the table's one that rebuilds
the sstable set.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
By adding a param with default value, we make it clear in the interface
that the procedure modifies sstable data size.
It can happen one calls this function without noticing it overrides
the data size previously set using a different function.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The test changes the servers' configuration to include `raft`
in the `experimental-features` list, then restarts them.
It waits until driver reconnects to every server after restarting.
Then it checks that upgrade eventually finishes on every server by
querying `group0_upgrade_state` key in `system.scylla_local`. Finally,
it performs a schema change and verifies that a corresponding entry has
appeared in `system.group0_history`.
The commit also increases the number of clusters in the suite cluster
pool. Since the suite contains only one test at this time this only has
an effect if we run the test multiple times (using `--repeat`).
Make it possible to use logging from within tests in the topology
suites. The tests are executed using `pytest`, which uses a `pytest.ini`
file for logging configuration.
Also cleanup the `pytest.ini` files a bit.
In compatibility.md where we refer to the missing ability to add a GSI
to an existing table - let's refer to a new issue specifically about this
feature, instead of the old bigger issue about UpdateItem.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11568
when using ":attrs" attribute' from Nadav Har'El
This PR improves the testing for issue #5009 and fixes most of it (but
not all - see below). Issue #5009 is about what happens when a user
tries to use the name `:attrs` for an attribute - while Alternator uses
a map column with that name to hold all the schema-less attributes of an
item. The tests we had for this issue were partial, and missed the
worst cases which could result in Scylla crashing on specially-crafted
PutItem or UpdateItem requests.
What the tests missed were the cases that `:attrs` is used as a
**non-key**. So in this PR we add additional tests for this case,
several of them fail or even crash Scylla, and then we fix all these
cases.
Issue #5009 remains open because using `:attrs` as the name of a **key**
is still not allowed. But because it results in a clean error message
when attempting to create a table with such a key, I consider this
remaining problem very minor.
Refs #5009.
Closes#11572
* github.com:scylladb/scylladb:
alternator: fix crashes an errors when using ":attrs" attribute
alternator: improve tests for reserved attribute name ":attrs"
Alternator uses a single column, a map, with the deliberately strange
name ":attrs", to hold all the schema-less attributes of an item.
The existing code is buggy when the user tries to write to an attribute
with this strange name ":attrs". Although it is extremely unlikely that
any user would happen to choose such a name, it is nevertheless a legal
attribute name in DynamoDB, and should definitely not cause Scylla to crash
as it does in some cases today.
The bug was caused by the code assuming that to check whether an attribute
is stored in its own column in the schema, we just need to check whether
a column with that name exists. This is almost true, except for the name
":attrs" - a column with this name exists, but it is a map - the attribute
with that name should be stored *in* the map, not as the map. The fix
is to modify that check to special-case ":attrs".
This fix makes the relevant tests, which used to crash or fail, now pass.
This fix solves most of #5009, but one point is not yet solved (and
perhaps we don't need to solve): It is still not allowed to use the
name ":attrs" for a **key** attribute. But trying to do that fails cleanly
(during the table creation) with an appropriate error message, so is only
a very minor compatibility issue.
Refs #5009
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
As explained in issue #5009, Alternator currently forbids the special
attribute name ":attrs", whereas DynamoDB allows any string of approriate
length (including the specific string ":attrs") to be used.
We had only a partial test for this incompatibility, and this patch
improves the testing of this issue. In particular, we were missing a
test for the case that the name ":attrs" was used for a non-key
attribute (we only tested the case it was used as a sort key).
It turns out that Alternator crashes on the new test, when the test tries
to write to a non-key attribute called ":attrs", so we needed to mark
the new test with "skip". Moreover, it turns out that different code paths
handle the attribute name ":attrs" differently, and also crash or fail
in other ways - so we added more than one xfailing and skipped tests
that each fails in a different place (and also a few tests that do pass).
As usual, the new tests we checked to pass on DynamoDB.
Refs #5009
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This tiny series fixes some small error and out-of-date information in Alternator documentation and code comments.
Closes#11547
* github.com:scylladb/scylladb:
alternator ttl: comment fixes
docs/alternator: fix mention of old alternator-test directory
Some tests mark clusters as 'dirty', which makes them non-reusable by
later tests; we don't want to return them to the pool of clusters.
This use-case was covered by the `add_one` function in the `Pool` class.
However, it had the unintended side effect of creating extra clusters
even if there were no more tests that were waiting for new clusters.
Rewrite the implementation of `Pool` so it provides 3 interface
functions:
- `get` borrows an object, building it first if necessary
- `put` returns a borrowed object
- `steal` is called by a borrower to free up space in the pool;
the borrower is then responsible for cleaning up the object.
Both `put` and `steal` wake up any outstanding `get` calls. Objects are
built only in `get`, so no objects are built if none are needed.
Closes#11558
Which intializes LSA with use_standard_allocator_segment_pool_backend()
running the logalloc_test suite on the standard allocator segment pool
backend. To avoid duplicating the test code, the new test-file pulls in
the test code via #include. I'm not proud of it, but it works and we
test LSA with both the debug and standard memory segment stores without
duplicating code.
"
Messaging service checks dc/rack of the target node when creating a
socket. However, this information is not available for all verbs, in
particular gossiper uses RPC to get topology from other nodes.
This generates a chicken-and-egg problem -- to create a socket messaging
service needs topology information, but in order to get one gossiper
needs to create a socket.
Other than gossiper, raft starts sending its APPEND_ENTRY messages early
enough so that topology info is not avaiable either.
The situation is extra-complicated with the fact that sockets are not
created for individual verbs. Instead, verbs are groupped into several
"indices" and socket is created for it. Thus, the "gossiping" index that
includes non-gossiper verbs will create topology-less socket for all
verbs in it. Worse -- raft sends messages w/o solicited topology, the
corresponding socket is created with the assumption that the peer lives
in default dc and rack which doesn't matchthe local nodes' dc/rack and
the whole index group gets the "randomly" configured socket.
Also, the tcp-nodelay tries to implement similar check, but uses wrong
index of 1, so it's also fixed here.
"
* 'br-messaging-topology-ignoring-clients' of https://github.com/xemul/scylla:
messaging_service: Fix gossiper verb group
messaging_service: Mind the absence of topology data when creating sockets
messaging_service: Templatize and rename remove_rpc_client_one
Use the new seastar option to instruct seastar to not initialize and use
the seastar allocator, relying on the standard allocator instead.
Configure LSA with the standard allocator based segment store backend:
* scylla-types reserves 1MB for LSA -- in theory nothing here should use
LSA, but just in case...
* scylla-sstable reserves 100MB for LSA, to avoid excessive trashing in
the sstable index caches.
With this, tools now should allocate memory on demand, without reserving
a large chunk of (or all of) the available memory, as regular seastar
apps do.
Creating a standard-memory-allocator backend for the segment store.
This is targeted towards tools, which want to configure LSA with a
segment store backend that is appropriate for the standard allocator
(which they want to use).
We want to be able to use this in both release and debug mode. The
former will be used by tools and the latter will be used to run the
logalloc tests with this new backend, making sure it works and doesn't
regress. For this latter, we have to allow the release and debug stores
to coexist in the same build and for the debug store to be able to
delegate to the release store when the standard allocator backend is
used.
There's a bunch of helpers for CDC gen service in db/system_keyspace.cc. All are static and use global qctx to make queries. Fortunately, both callers -- storage_service and cdc_generation_service -- already have local system_keyspace references and can call the methods via it, thus reducing the global qctx usage.
Closes#11557
* github.com:scylladb/scylladb:
system_keyspace: De-static get_cdc_generation_id()
system_keyspace: De-static cdc_is_rewritten()
system_keyspace: De-static cdc_set_rewritten()
system_keyspace: De-static update_cdc_generation_id()
- Raise on response not HTTP 200 for `.get_text()` helper
- Fix API paths
- Close and start a fresh driver when restarting a server and it's the only server in the cluster
- Fix stop/restart response as text instead of inspecting (errors are status 500 and raise exceptions)
Closes#11496
* github.com:scylladb/scylladb:
test.py: handle duplicate result from driver
test.py: log server restarts for topology tests
test.py: log actions for topology tests
Revert "test.py: restart stopped servers before...
test.py: ManagerClient API fix return text
test.py: ManagerClient raise on HTTP != 200
test.py: ManagerClient fix paths to updated resource
Rebase the seastar allocator based segment store implementation on the
recently introduced segment store backend which is now abstracts away
how memory for segments is obtained.
This patch also introduces an explicit `segment_npos` to be used for
cases when a segment -> index mapping fails (segment doesn't belong to
the store). Currently the seastar allocator based store simply doesn't
handle this case, while the standard allocator based store uses 0 as the
implicit invalid index.
We want to make it possible to select the segment-store to be used for
LSA -- the seastar allocator based one or the standard allocator based
on -- at runtime. Currently this choice is made at compile time via
preprocessor switches.
The current standard memory based store is specialized for debug build,
we want something more similar to the seastar standard memory allocator
based one. So we introduce a segment store backend for the current
seastar allocator based store, which abstracts how the backing memory
for all segments is allocated/freed, while keeping the segment <-> index
mapping common. In the next patches we will rebase the current seastar
allocator based segment store on this backend and later introduce
another backend for standard allocator, targeted for release builds.
Currently the actual alloc/dealloc of memory for segments is located
outside the segment stores. We want to abstract away how segments are
allocated, so we move this logic too into the segment store. For now
this results in duplicate code in the two segment store implementations,
but this will soon be gone.
Said test creates two vectors, the vector storage being allocated with
the default allocator, while its content being allocated on LSA. If an
exception is thrown however, both are freed via the default allocator,
triggering an assert in LSA code. Move the cleanup into a `defer()` so
the correct cleanup sequence is executed even on exceptions.
When cross-shard barrier is abort()-ed it spawns a background fiber
that will wake-up other shards (if they are sleeping) with exception.
This fiber is implicitly waited by the owning sharded service .stop,
because barrier usage is like this:
sharded<service> s;
co_await s.invoke_on_all([] {
...
barrier.abort();
});
...
co_await s.stop();
If abort happens, the invoke_on_all() will only resolve _after_ it
queues up the waking lambdas into smp queues, thus the subseqent stop
will queue its stopping lambdas after barrier's ones.
However, in debug mode the queue can be shuffled, so the owning service
can suddenly be freed from under the barrier's feet causing use after
free. Fortunately, this can be easily fixed by capturing the shared
pointer on the shared barrier instead of a regular pointer on the
shard-local barrier.
fixes: #11303
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11553
The test is supposed to give a helpful error message when the user forgets to
run --populate before the benchmark. But this must have become broken at some
point, because execute_cql() terminates the program with an unhelpful
("unconfigured table config") message, which doesn't mention --populate.
Fix that by catching the exception and adding the helpful tip.
Closes#11533
The logger is proof against allocation failures, except if
--abort-on-seastar-bad-alloc is specified. If it is, it will crash.
The reclaim stall report is likely to be called in low memory conditions
(reclaim's job is to alleviate these conditions after all), so we're
likely to crash here if we're reclaiming a very low memory condition
and have a large stall simultaneously (AND we're running in a debug
environment).
Prevent all this by disabling --abort-on-seastar-bad-alloc temporarily.
Fixes#11549Closes#11555