Currently, scylla_fstrim_setup does not start scylla-fstrim.timer and
just enables it, so the timer starts only after rebooted.
This is incorrect behavior, we start start it during the setup.
Also, unmask is unnecessary for enabling the timer.
Fixes#14249Closes#14252
(cherry picked from commit c70a9cbffe)
Closes#14422
Consider
- 10 repair instances take all the 10 _streaming_concurrency_sem
- repair readers are done but the permits are not released since they
are waiting for view update _registration_sem
- view updates trying to take the _streaming_concurrency_sem to make
progress of view update so it could release _registration_sem, but it
could not take _streaming_concurrency_sem since the 10 repair
instances have taken them
- deadlock happens
Note, when the readers are done, i.e., reaching EOS, the repair reader
replaces the underlying (evictable) reader with an empty reader. The
empty reader is not evictable, so the resources cannot be forcibly
released.
To fix, release the permits manually as soon as the repair readers are
done even if the repair job is waiting for _registration_sem.
Fixes#14676Closes#14677
(cherry picked from commit 1b577e0414)
Fixes https://github.com/scylladb/scylladb/issues/14598
This commit adds the description of minimum_keyspace_rf
to the CREATE KEYSPACE section of the docs.
(When we have the reference section for all ScyllaDB options,
an appropriate link should be added.)
This commit must be backported to branch-5.3, because
the feature is already on that branch.
Closes#14686
(cherry picked from commit 9db9dedb41)
Adds preemption points used in Alternator when:
- sending bigger json response
- building results for BatchGetItem
I've tested manually by inserting in preemptible sections (e.g. before `os.write`) code similar to:
auto start = std::chrono::steady_clock::now();
do { } while ((std::chrono::steady_clock::now() - start) < 100ms);
and seeing reactor stall times. After the patch they
were not increasing while before they kept building up due to no preemption.
Refs #7926Fixes#13689Closes#12351
* github.com:scylladb/scylladb:
alternator: remove redundant flush call in make_streamed
utils: yield when streaming json in print()
alternator: yield during BatchGetItem operation
(cherry picked from commit d2e089777b)
On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name.
If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen.
This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster.
This PR proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore.
I tested this manually, by creating a cluster of 2 OSS nodes, then upgrading one of the nodes to enterprise and verifying (with extra logging) that service level connections are matched to the default statement tenant after the PR and they indeed match to the default scheduling group before.
Fixes: #13841Fixes: #12552Closes#13843
* github.com:scylladb/scylladb:
message: match unknown tenants to the default tenant
message: generalize per-tenant connection types
(cherry picked from commit a7c2c9f92b)
This reverts commit 52e4edfd5e, reversing
changes made to d2d53fc1db. The associated test
fails with about 10% probablity, which blocks other work.
Fixes#14395Closes#14662
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.
This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
e225c88a65 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.
To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.
\Fixes scylladb/scylladb#14182
Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing #14182
\Closes scylladb/scylladb#14183
* github.com:scylladb/scylladb:
docs: dml: add update ordering section
cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
atomic_cell: compare_atomic_cell_for_merge: update and add documentation
compare_atomic_cell_for_merge: compare value last for live cells
mutation_test: test_cell_ordering: improve debuggability
(cherry picked from commit 87b4606cd6)
Closes#14647
View update routines accept mutation objects.
But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to convert the fragment stream into mutations. This is done by piping the stream to mutation_rebuilder_v2.
To keep memory usage limited, the stream for a single partition might have to be split into multiple partial mutation objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error.
This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next mutation object).
The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic.
Backported from c25201c1a3. view_build_test.cc needed some tiny adjustments for the backport.
Closes#14622Fixes#14503
* github.com:scylladb/scylladb:
test: view_build_test: add range tombstones to test_view_update_generator_buffering
test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
view_updating_consumer: make buffer limit a variable
view: fix range tombstone handling on flushes in view_updating_consumer
This patch adds a full-range tombstone to the compacted mutation.
This raises the coverage of the test. In particular, it reproduces
issue #14503, which should have been caught by this test, but wasn't.
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of
mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to
convert the fragment stream into `mutation`s. This is done by piping
the stream to mutation_rebuilder_v2.
To keep memory usage limited, the stream for a single partition might
have to be split into multiple partial `mutation` objects.
view_update_consumer does that, but in improper way -- when the
split/flush happens inside an active range tombstone, the range
tombstone isn't closed properly. This is illegal, and triggers an
internal error.
This patch fixes the problem by closing the active range tombstone
(and reopening in the same position in the next `mutation` object).
The tombstone is closed just after the last seen clustered position.
This is not necessary for correctness -- for example we could delay
all processing of the range tombstone until we see its end
bound -- but it seems like the most natural semantic.
Fixes#14503
Fixes#11017
When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.
Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.
Closes#14294
(cherry picked from commit f18e967939)
In mutation_reader_merger and clustering_order_reader_merger, the
operator()() is responsible for producing mutation fragments that will
be merged and pushed to the combined reader's buffer. Sometimes, it
might have to advance existing readers, open new and / or close some
existing ones, which requires calling a helper method and then calling
operator()() recursively.
In some unlucky circumstances, a stack overflow can occur:
- Readers have to be opened incrementally,
- Most or all readers must not produce any fragments and need to report
end of stream without preemption,
- There has to be enough readers opened within the lifetime of the
combined reader (~500),
- All of the above needs to happen within a single task quota.
In order to prevent such a situation, the code of both reader merger
classes were modified not to perform recursion at all. Most of the code
of the operator()() was moved to maybe_produce_batch which does not
recur if it is not possible for it to produce a fragment, instead it
returns std::nullopt and operator()() calls this method in a loop via
seastar::repeat_until_value.
A regression test is added.
Fixes: scylladb/scylladb#14415
Closes#14452
(cherry picked from commit ee9bfb583c)
Closes#14606
On main.cc, we have early commands which want to run prior to initialize
Seastar.
Currently, perf_fast_forward is breaking this, since it defined
"app_template app" on global variable.
To avoid that, we should defer running app_template's constructor in
scylla_fast_forward_main().
Fixes#13945Closes#14026
(cherry picked from commit 45ef09218e)
There was a bug that caused aggregates to fail when
used on column-sensitive columns.
For example:
```
SELECT SUM("SomeColumn") FROM ks.table;
```
would fail, with a message saying that there
is no column "somecolumn".
This is because the case-sensitivity got lost on the way.
For non case-sensitive column names we convert them to lowercase,
but for case sensitive names we have to preserve the name
as originally written.
The problem was in `forward_service` - we took a column name
and created a non case-sensitive `column_identifier` out of it.
This converted the name to lowercase, and later such column
couldn't be found.
To fix it, let's make the `column_identifier` case-sensitive.
It will preserve the name, without converting it to lowercase.
Fixes: https://github.com/scylladb/scylladb/issues/14307
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit 7fca350075)
This PR fixes the Restore System Tables section of the upgrade guides by adding a command to clean upgraded SStables during rollback or adding the entire section to restore system tables (which was missing from the older documents).
This PR fixes is a bug and must be backported to branch-5.3, branch-5.2., and branch-5.1.
Refs: https://github.com/scylladb/scylla-enterprise/issues/3046
- [x] 5.1-to-2022.2 - update command (backport to branch-5.3, branch-5.2, and branch-5.1)
- [x] 5.0-to-2022.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1)
- [x] 4.3-to-2021.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1)
(see https://github.com/scylladb/scylla-enterprise/issues/3046#issuecomment-1604232864)
Closes#14444
* github.com:scylladb/scylladb:
doc: fix rollback in 4.3-to-2021.1 upgrade guide
doc: fix rollback in 5.0-to-2022.1 upgrade guide
doc: fix rollback in 5.1-to-2022.2 upgrade guide
(cherry picked from commit 8a7261fd70)
`query_partition_range_concurrent` implements an optimization when
querying a token range that intersects multiple vnodes. Instead of
sending a query for each vnode separately, it sometimes sends a single
query to cover multiple vnodes - if the intersection of replica sets for
those vnodes is large enough to satisfy the CL and good enough in terms
of the heat metric. To check the latter condition, the code would take
the smallest heat metric of the intersected replica set and compare them
to smallest heat metrics of replica sets calculated separately for each
vnode.
Unfortunately, there was an edge case that the code didn't handle: the
intersected replica set might be empty and the code would access an
empty range.
This was catched by an assertion added in
8db1d75c6c by the dtest
`test_query_dc_with_rf_0_does_not_crash_db`.
The fix is simple: check if the intersected set is empty - if so, don't
calculate the heat metrics because we can decide early that the
optimization doesn't apply.
Also change the `assert` to `on_internal_error`.
Fixes#14284Closes#14300
(cherry picked from commit 732feca115)
Backport note: the original `assert` was never added to branch-5.3, but
the fix is still applicable, so I backported the fix and the
`on_internal_error` check.
Another node can stop after it joined the group0 but before it advertised itself
in gossip. `get_inet_addrs` will try to resolve all IPs and
`wait_for_peers_to_enter_synchronize_state` will loop indefinitely.
But `wait_for_peers_to_enter_synchronize_state` can return early if one of
the nodes confirms that the upgrade procedure has finished. For that, it doesn't
need the IPs of all group 0 members - only the IP of some nodes which can do
the confirmation.
This commit restructures the code so that IPs of nodes are resolved inside the
`max_concurrent_for_each` that `wait_for_peers_to_enter_synchronize_state` performs.
Then, even if some IPs won't be resolved, but one of the nodes confirms a
successful upgrade, we can continue.
Fixes#13543
(cherry picked from commit a45e0765e4)
This commit fixes the Restore System Tables section
in the 5.2-to-2023.1 upgrade guide by adding a command
to clean upgraded SStables during rollback.
This is a bug (an incomplete command) and must be
backported to branch-5.3 and branch-5.2.
Refs: https://github.com/scylladb/scylla-enterprise/issues/3046Closes#14373
(cherry picked from commit f4ae2c095b)
The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the previous buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between.
The code guranteeing this forward progress had a bug: the comparison between the position after the last buffer-fill and the current last fragment position was done in the wrong direction.
So if the condition that we wanted to achieve was already true, we would continue filling the buffer until partition end which may lead to OOMs such as in #13491.
There was already a fix in this area to handle `partition_start` fragments correctly - #13563 - but it missed that the position comparison was done in the wrong order.
Fix the comparison and adjust one of the tests (added in #13563) to detect this case.
After the fix, the evictable reader starts generating some redundant (but expected) range tombstone change fragments since it's now being paused and resumed. For this we need to adjust mutation source tests which were a bit too specific. We modify `flat_mutation_reader_assertions` to squash the redundant `r_t_c`s.
Fixes#13491Closes#14375
* github.com:scylladb/scylladb:
readers: evictable_reader: don't accidentally consume the entire partition
test: flat_mutation_reader_assertions: squash `r_t_c`s with the same position
(cherry picked from commit 586102b42e)
Fixes https://github.com/scylladb/scylla-enterprise/issues/3036
This commit adds support for Ubuntu 22.04 to the list
of OSes supported by ScyllaDB Enterprise 2021.1.
This commit fixex a bug and must be backported to
branch-5.3 and branch-5.2.
Closes#14372
(cherry picked from commit 74fc69c825)
Fixes https://github.com/scylladb/scylladb/issues/14333
This commit replaces the documentation landing page with
the Open Source-only documentation landing page.
This change is required as now there is a separate landing
page for the ScyllaDB documentation, so the page is duplicated,
creating bad user experience.
(cherry picked from commit f60f89df17)
Closes#14371
Fixes https://github.com/scylladb/scylladb/issues/14084
This commit adds OS support for version 5.3 to the table on the OS Support by Linux Distributions and Version page.
Closes#14228
* github.com:scylladb/scylladb:
doc: remove OS support for outdated ScyllaDB versions 2.x and 3.x
doc: add OS support for ScyllaDB 5.3
(cherry picked from commit aaac455ebe)
Problem can be reproduced easily:
1) wrote some sstables with smp 1
2) shut down scylla
3) moved sstables to upload
4) restarted scylla with smp 2
5) ran refresh (resharding happens, adds sstable to cleanup
set and never removes it)
6) cleanup (tries to cleanup resharded sstables which were
leaked in the cleanup set)
Bumps into assert "Assertion `!sst->is_shared()' failed", as
cleanup picks a shared sstable that was leaked and already
processed by resharding.
Fix is about not inserting shared sstables into cleanup set,
as shared sstables are restricted to resharding and cannot
be processed later by cleanup (nor it should because
resharding itself cleaned up its input files).
Dtest: https://github.com/scylladb/scylla-dtest/pull/3206Fixes#14001.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14147
(cherry picked from commit 156d771101)
Fixes https://github.com/scylladb/scylladb/issues/14097
This commit removes support for Ubuntu 18 from
platform support for ScyllaDB Enterprise 2023.1.
The update is in sync with the change made for
ScyllaDB 5.2.
This commit must be backported to branch-5.2 and
branch-5.3.
Closes#14118
(cherry picked from commit b7022cd74e)
With regards to closing the looked-up querier if an exception is thrown. In particular, this requires closing the querier if a semaphore mismatch is detected. Move the table lookup above the line where the querier is looked up, to avoid having to handle the exception from it. As a consequence of closing the querier on the error path, the lookup lambda has to be made a coroutine. This is sad, but this is executed once per page, so its cost should be insignificant when spread over an
entire page worth of work.
Also add a unit test checking that the mismatch is detected in the first place and that readers are closed.
Fixes: #13784Closes#13790
* github.com:scylladb/scylladb:
test/boost/database_test: add unit test for semaphore mismatch on range scans
partition_slice_builder: add set_specific_ranges()
multishard_mutation_query: make reader_context::lookup_readers() exception safe
multishard_mutation_query: lookup_readers(): make inner lambda a coroutine
(cherry picked from commit 1c0e8c25ca)
Due to a simple programming oversight, one of keyspace_metadata
constructors is using empty user_types_metadata instead of the
passed one. Fix that.
Fixes#14139Closes#14143
(cherry picked from commit 1a521172ec)
A long long time ago there was an issue about removing infinite timeouts
from distributed queries: #3603. There was also a fix:
620e950fc8. But apparently some queries
escaped the fix, like the one in `default_role_row_satisfies`.
With the right conditions and timing this query may cause a node to hang
indefinitely on shutdown. A node tries to perform this query after it
starts. If we kill another node which is required to serve this query
right before that moment, the query will hang; when we try to shutdown
the querying node, it will wait for the query to finish (it's a
background task in auth service), which it never does due to infinite
timeout.
Use the same timeout configuration as other queries in this module do.
Fixes#13545.
Closes#14134
(cherry picked from commit f51312e580)
After c7826aa910, sstable runs are cleaned up together.
The procedure which executes cleanup was holding reference to all
input sstables, such that it could later retry the same cleanup
job on failure.
Turns out it was not taking into account that incremental compaction
will exhaust the input set incrementally.
Therefore cleanup is affected by the 100% space overhead.
To fix it, cleanup will now have the input set updated, by removing
the sstables that were already cleaned up. On failure, cleanup
will retry the same job with the remaining sstables that weren't
exhausted by incremental compaction.
New unit test reproduces the failure, and passes with the fix.
Fixes#14035.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14038
(cherry picked from commit 23443e0574)
cleanup_compaction should resolve only after all
sstables that require cleanup are cleaned up.
Since it is possible that some of them are in staging
and therefore cannot be cleaned up, retry once a second
until they become eligible.
Timeout if there is no progress within 5 minutes
to prevent hanging due to view building bug.
Fixes#9559Closes#13812
* github.com:scylladb/scylladb:
table: signal compaction_manager when staging sstables become eligible for cleanup
compaction_manager: perform_cleanup: wait until all candidates are cleaned up
compaction_manager: perform_cleanup: perform_offstrategy if needed
compaction_manager: perform_cleanup: update_sstables_cleanup_state in advance
sstable_set: add for_each_sstable_gently* helpers
now that scylla-jmx has a dedicated script for detecting the existence
of OpenJDK, and this script is included in the unified package, let's
just leverage it instead of repeating it in `install.sh`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13514
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using `generation_type::int_t` in the helper functions, we just
pass `generation_type` in place of integer.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13931
Without the feature, the system schema doesn't have the table, and the
read will fail with:
Transferring snapshot to ... failed with: seastar::rpc::remote_verb_error (Can't find a column family tablets in keyspace system)
We should not attempt to read tablet metadata in the experimental
feature is not enabled.
Fixes#13946Closes#13947
Currently temporary directories with incomplete sstables and pending deletion log are processed by distributed loader on start. That's not nice, because for s3 backed sstables this code makes no sense (and is currently a no-op because of incomplete implementation). This garbage collecting should be kept in sstable_directory where it can off-load this work onto lister component that is storage-aware.
Once g.c. code moved, it allows to clean the class sstable list of static helpers a bit.
refs: #13024
refs: #13020
refs: #12707Closes#13767
* github.com:scylladb/scylladb:
sstable: Toss tempdir extension usage
sstable: Drop pending_delete_dir_basename()
sstable: Drop is_pending_delete_dir() helper
sstable_directory: Make garbage_collect() non-static
sstable_directory: Move deletion log exists check
distributed_loader: Move garbage collecting into sstable_directory
distributed_loader: Collect garbace collecting in one call
sstable: Coroutinize remove_temp_dir()
sstable: Coroutinize touch_temp_dir()
sstable: Use storage::temp_dir instead of hand-crafted path
When a CQL expression is printed, it can be done using
either the `debug` mode, or the `user` mode.
`user` mode is basically how you would expect the CQL
to be printed, it can be printed and then parsed back.
`debug` mode is more detailed, for example in `debug`
mode a column name can be displayed as
`unresolved_identifier(my_column)`, which can't
be parsed back to CQL.
The default way of printing is the `debug` mode,
but this requires us to remember to enable the `user`
mode each time we're printing a user-facing message,
for example for an invalid_request_exception.
It's cumbersome and people forget about it,
so let's change the default to `user`.
There issues about expressions being printed
in a `strange` way, this fixes them.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#13916
The previous implementation didn't actually do a read barrier, because
the statement failed on an early prepare/validate step which happened
before read barrier was even performed.
Change it to a statement which does not fail and doesn't perform any
schema change but requires a read barrier.
This breaks one test which uses `RandomTables.verify_schema()` when only
one node is alive, but `verify_schema` performs a read barrier. Unbreak
it by skipping the read barrier in this case (it makes sense in this
particular test).
Closes#13933
This implicit link it pretty bad, because feature service is a low-level
one which lots of other services depend on. System keyspace is opposite
-- a high-level one that needs e.g. query processor and database to
operate. This inverse dependency is created by the feature service need
to commit enabled features' names into system keyspace on cluster join.
And it uses the qctx thing for that in a best-effort manner (not doing
anything if it's null).
The dependency can be cut. The only place when enabled features are
committed is when gossiper enables features on join or by receiving
state changes from other nodes. By that time the
sharded<system_keyspace> is up and running and can be used.
Despite gossiper already has system keyspace dependency, it's better not
to overload it with the need to mess with enabling and persisting
features. Instead, the feature_enabler instance is equipped with needed
dependencies and takes care of it. Eventually the enabler is also moved
to feature_service.cc where it naturally belongs.
Fixes: #13837Closes#13172
* github.com:scylladb/scylladb:
gossiper: Remove features and sysks from gossiper
system_keyspace: De-static save_local_supported_features()
system_keyspace: De-static load_|save_local_enabled_features()
system_keyspace: Move enable_features_on_startup to feature_service (cont)
system_keyspace: Move enable_features_on_startup to feature_service
feature_service: Open-code persist_enabled_feature_info() into enabler
gms: Move feature enabler to feature_service.cc
gms: Move gossiper::enable_features() to feature_service::enable_features_on_join()
gms: Persist features explicitly in features enabler
feature_service: Make persist_enabled_feature_info() return a future
system_keyspace: De-static load_peer_features()
gms: Move gossiper::do_enable_features to persistent_feature_enabler::enable_features()
gossiper: Enable features and register enabler from outside
gms: Add feature_service and system_keyspace to feature_enabler
Some state that is used to fill in 'peeers' table is still propagated
over gossiper. When moving a node into the normal state in raft
topology code use the data from the gossiper to populate peers table because
storage_service::on_change() will not do it in case the node was not in
normal state at the time it was called.
Fixes: #13911
Message-Id: <ZGYk/V1ymIeb8qMK@scylladb.com>
The `system_keyspace` has several methods to query the tables in it. These currently require a storage proxy parameter, because the read has to go through storage-proxy. This PR uses the observation that all these reads are really local-replica reads and they only actually need a relatively small code snippet from storage proxy. These small code snippets are exported into standalone function in a new header (`replica/query.hh`). Then the system keyspace code is patched to use these new standalone functions instead of their equivalent in storage proxy. This allows us to replace the storage proxy dependency with a much more reasonable dependency on `replica::database`.
This PR patches the system keyspace code and the signatures of the affected methods as well as their immediate callers. Indirect callers are only patched to the extent it was needed to avoid introducing new includes (some had only a forward-declaration of storage proxy and so couldn't get database from it). There are a lot of opportunities left to free other methods or maybe even entire subsystems from storage proxy dependency, but this is not pursued in this PR, instead being left for follow-ups.
This PR was conceived to help us break the storage proxy -> storage service -> system tables -> storage proxy dependency loop, which become a major roadblock in migrating from IP -> host_id. After this PR, system keyspace still indirectly depends on storage proxy, because it still uses `cql3::query_processor` in some places. This will be addressed in another PR.
Refs: #11870Closes#13869
* github.com:scylladb/scylladb:
db/system_keyspace: remove dependency on storage_proxy
db/system_keyspace: replace storage_proxy::query*() with replica:: equivalent
replica: add query.hh
Commit 8c4b5e4283 introduced an optimization which only
calculates max purgeable timestamp when a tombstone satisfy the
grace period.
Commit 'repair: Get rid of the gc_grace_seconds' inverted the order,
probably under the assumption that getting grace period can be
more expensive than calculating max purgeable, as repair-mode GC
will look up into history data in order to calculate gc_before.
This caused a significant regression on tombstone heavy compactions,
where most of tombstones are still newer than grace period.
A compaction which used to take 5s, now takes 35s. 7x slower.
The reason is simple, now calculation of max purgeable happens
for every single tombstone (once for each key), even the ones that
cannot be GC'ed yet. And each calculation has to iterate through
(i.e. check the bloom filter of) every single sstable that doesn't
participate in compaction.
Flame graph makes it very clear that bloom filter is a heavy path
without the optimization:
45.64% 45.64% sstable_compact sstable_compaction_test_g
[.] utils::filter::bloom_filter::is_present
With its resurrection, the problem is gone.
This scenario can easily happen, e.g. after a deletion burst, and
tombstones becoming only GC'able after they reach upper tiers in
the LSM tree.
Before this patch, a compaction can be estimated to have this # of
filter checks:
(# of keys containing *any* tombstone) * (# of uncompacting sstable
runs[1])
[1] It's # of *runs*, as each key tend to overlap with only one
fragment of each run.
After this patch, the estimation becomes:
(# of keys containing a GC'able tombstone) * (# of uncompacting
runs).
With repair mode for tombstone GC, the assumption, that retrieval
of gc_before is more expensive than calculating max purgeable,
is kept. We can revisit it later. But the default mode, which
is the "timeout" (i.e. gc_grace_seconds) one, we still benefit
from the optimization of deferring the calculation until
needed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13908
sstables_manager::get_component_lister() is used by sstable_directory.
and almost all the "ingredients" used to create a component lister
are located in sstable_directory. among the other things, the two
implementations of `components_lister` are located right in
`sstable_directory`. there is no need to outsource this to
sstables_manager just for accessing the system_keyspace, which is
already exposed as a public function of `sstables_manager`. so let's
move this helper into sstable_directory as a member function.
with this change, we can even go further by moving the
`components_lister` implementations into the same .cc file.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13853