Currently, scylla_fstrim_setup does not start scylla-fstrim.timer and
just enables it, so the timer starts only after rebooted.
This is incorrect behavior, we start start it during the setup.
Also, unmask is unnecessary for enabling the timer.
Fixes#14249Closes#14252
(cherry picked from commit c70a9cbffe)
Closes#14422
Consider
- 10 repair instances take all the 10 _streaming_concurrency_sem
- repair readers are done but the permits are not released since they
are waiting for view update _registration_sem
- view updates trying to take the _streaming_concurrency_sem to make
progress of view update so it could release _registration_sem, but it
could not take _streaming_concurrency_sem since the 10 repair
instances have taken them
- deadlock happens
Note, when the readers are done, i.e., reaching EOS, the repair reader
replaces the underlying (evictable) reader with an empty reader. The
empty reader is not evictable, so the resources cannot be forcibly
released.
To fix, release the permits manually as soon as the repair readers are
done even if the repair job is waiting for _registration_sem.
Fixes#14676Closes#14677
(cherry picked from commit 1b577e0414)
Fixes https://github.com/scylladb/scylladb/issues/14598
This commit adds the description of minimum_keyspace_rf
to the CREATE KEYSPACE section of the docs.
(When we have the reference section for all ScyllaDB options,
an appropriate link should be added.)
This commit must be backported to branch-5.3, because
the feature is already on that branch.
Closes#14686
(cherry picked from commit 9db9dedb41)
Adds preemption points used in Alternator when:
- sending bigger json response
- building results for BatchGetItem
I've tested manually by inserting in preemptible sections (e.g. before `os.write`) code similar to:
auto start = std::chrono::steady_clock::now();
do { } while ((std::chrono::steady_clock::now() - start) < 100ms);
and seeing reactor stall times. After the patch they
were not increasing while before they kept building up due to no preemption.
Refs #7926Fixes#13689Closes#12351
* github.com:scylladb/scylladb:
alternator: remove redundant flush call in make_streamed
utils: yield when streaming json in print()
alternator: yield during BatchGetItem operation
(cherry picked from commit d2e089777b)
On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name.
If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen.
This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster.
This PR proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore.
I tested this manually, by creating a cluster of 2 OSS nodes, then upgrading one of the nodes to enterprise and verifying (with extra logging) that service level connections are matched to the default statement tenant after the PR and they indeed match to the default scheduling group before.
Fixes: #13841Fixes: #12552Closes#13843
* github.com:scylladb/scylladb:
message: match unknown tenants to the default tenant
message: generalize per-tenant connection types
(cherry picked from commit a7c2c9f92b)
This reverts commit 52e4edfd5e, reversing
changes made to d2d53fc1db. The associated test
fails with about 10% probablity, which blocks other work.
Fixes#14395Closes#14662
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.
This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
e225c88a65 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.
To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.
\Fixes scylladb/scylladb#14182
Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing #14182
\Closes scylladb/scylladb#14183
* github.com:scylladb/scylladb:
docs: dml: add update ordering section
cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
atomic_cell: compare_atomic_cell_for_merge: update and add documentation
compare_atomic_cell_for_merge: compare value last for live cells
mutation_test: test_cell_ordering: improve debuggability
(cherry picked from commit 87b4606cd6)
Closes#14647
View update routines accept mutation objects.
But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to convert the fragment stream into mutations. This is done by piping the stream to mutation_rebuilder_v2.
To keep memory usage limited, the stream for a single partition might have to be split into multiple partial mutation objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error.
This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next mutation object).
The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic.
Backported from c25201c1a3. view_build_test.cc needed some tiny adjustments for the backport.
Closes#14622Fixes#14503
* github.com:scylladb/scylladb:
test: view_build_test: add range tombstones to test_view_update_generator_buffering
test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
view_updating_consumer: make buffer limit a variable
view: fix range tombstone handling on flushes in view_updating_consumer
This patch adds a full-range tombstone to the compacted mutation.
This raises the coverage of the test. In particular, it reproduces
issue #14503, which should have been caught by this test, but wasn't.
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of
mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to
convert the fragment stream into `mutation`s. This is done by piping
the stream to mutation_rebuilder_v2.
To keep memory usage limited, the stream for a single partition might
have to be split into multiple partial `mutation` objects.
view_update_consumer does that, but in improper way -- when the
split/flush happens inside an active range tombstone, the range
tombstone isn't closed properly. This is illegal, and triggers an
internal error.
This patch fixes the problem by closing the active range tombstone
(and reopening in the same position in the next `mutation` object).
The tombstone is closed just after the last seen clustered position.
This is not necessary for correctness -- for example we could delay
all processing of the range tombstone until we see its end
bound -- but it seems like the most natural semantic.
Fixes#14503
Fixes#11017
When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.
Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.
Closes#14294
(cherry picked from commit f18e967939)
In mutation_reader_merger and clustering_order_reader_merger, the
operator()() is responsible for producing mutation fragments that will
be merged and pushed to the combined reader's buffer. Sometimes, it
might have to advance existing readers, open new and / or close some
existing ones, which requires calling a helper method and then calling
operator()() recursively.
In some unlucky circumstances, a stack overflow can occur:
- Readers have to be opened incrementally,
- Most or all readers must not produce any fragments and need to report
end of stream without preemption,
- There has to be enough readers opened within the lifetime of the
combined reader (~500),
- All of the above needs to happen within a single task quota.
In order to prevent such a situation, the code of both reader merger
classes were modified not to perform recursion at all. Most of the code
of the operator()() was moved to maybe_produce_batch which does not
recur if it is not possible for it to produce a fragment, instead it
returns std::nullopt and operator()() calls this method in a loop via
seastar::repeat_until_value.
A regression test is added.
Fixes: scylladb/scylladb#14415
Closes#14452
(cherry picked from commit ee9bfb583c)
Closes#14606
On main.cc, we have early commands which want to run prior to initialize
Seastar.
Currently, perf_fast_forward is breaking this, since it defined
"app_template app" on global variable.
To avoid that, we should defer running app_template's constructor in
scylla_fast_forward_main().
Fixes#13945Closes#14026
(cherry picked from commit 45ef09218e)
There was a bug that caused aggregates to fail when
used on column-sensitive columns.
For example:
```
SELECT SUM("SomeColumn") FROM ks.table;
```
would fail, with a message saying that there
is no column "somecolumn".
This is because the case-sensitivity got lost on the way.
For non case-sensitive column names we convert them to lowercase,
but for case sensitive names we have to preserve the name
as originally written.
The problem was in `forward_service` - we took a column name
and created a non case-sensitive `column_identifier` out of it.
This converted the name to lowercase, and later such column
couldn't be found.
To fix it, let's make the `column_identifier` case-sensitive.
It will preserve the name, without converting it to lowercase.
Fixes: https://github.com/scylladb/scylladb/issues/14307
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit 7fca350075)
This PR fixes the Restore System Tables section of the upgrade guides by adding a command to clean upgraded SStables during rollback or adding the entire section to restore system tables (which was missing from the older documents).
This PR fixes is a bug and must be backported to branch-5.3, branch-5.2., and branch-5.1.
Refs: https://github.com/scylladb/scylla-enterprise/issues/3046
- [x] 5.1-to-2022.2 - update command (backport to branch-5.3, branch-5.2, and branch-5.1)
- [x] 5.0-to-2022.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1)
- [x] 4.3-to-2021.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1)
(see https://github.com/scylladb/scylla-enterprise/issues/3046#issuecomment-1604232864)
Closes#14444
* github.com:scylladb/scylladb:
doc: fix rollback in 4.3-to-2021.1 upgrade guide
doc: fix rollback in 5.0-to-2022.1 upgrade guide
doc: fix rollback in 5.1-to-2022.2 upgrade guide
(cherry picked from commit 8a7261fd70)
`query_partition_range_concurrent` implements an optimization when
querying a token range that intersects multiple vnodes. Instead of
sending a query for each vnode separately, it sometimes sends a single
query to cover multiple vnodes - if the intersection of replica sets for
those vnodes is large enough to satisfy the CL and good enough in terms
of the heat metric. To check the latter condition, the code would take
the smallest heat metric of the intersected replica set and compare them
to smallest heat metrics of replica sets calculated separately for each
vnode.
Unfortunately, there was an edge case that the code didn't handle: the
intersected replica set might be empty and the code would access an
empty range.
This was catched by an assertion added in
8db1d75c6c by the dtest
`test_query_dc_with_rf_0_does_not_crash_db`.
The fix is simple: check if the intersected set is empty - if so, don't
calculate the heat metrics because we can decide early that the
optimization doesn't apply.
Also change the `assert` to `on_internal_error`.
Fixes#14284Closes#14300
(cherry picked from commit 732feca115)
Backport note: the original `assert` was never added to branch-5.3, but
the fix is still applicable, so I backported the fix and the
`on_internal_error` check.
Another node can stop after it joined the group0 but before it advertised itself
in gossip. `get_inet_addrs` will try to resolve all IPs and
`wait_for_peers_to_enter_synchronize_state` will loop indefinitely.
But `wait_for_peers_to_enter_synchronize_state` can return early if one of
the nodes confirms that the upgrade procedure has finished. For that, it doesn't
need the IPs of all group 0 members - only the IP of some nodes which can do
the confirmation.
This commit restructures the code so that IPs of nodes are resolved inside the
`max_concurrent_for_each` that `wait_for_peers_to_enter_synchronize_state` performs.
Then, even if some IPs won't be resolved, but one of the nodes confirms a
successful upgrade, we can continue.
Fixes#13543
(cherry picked from commit a45e0765e4)
This commit fixes the Restore System Tables section
in the 5.2-to-2023.1 upgrade guide by adding a command
to clean upgraded SStables during rollback.
This is a bug (an incomplete command) and must be
backported to branch-5.3 and branch-5.2.
Refs: https://github.com/scylladb/scylla-enterprise/issues/3046Closes#14373
(cherry picked from commit f4ae2c095b)
The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the previous buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between.
The code guranteeing this forward progress had a bug: the comparison between the position after the last buffer-fill and the current last fragment position was done in the wrong direction.
So if the condition that we wanted to achieve was already true, we would continue filling the buffer until partition end which may lead to OOMs such as in #13491.
There was already a fix in this area to handle `partition_start` fragments correctly - #13563 - but it missed that the position comparison was done in the wrong order.
Fix the comparison and adjust one of the tests (added in #13563) to detect this case.
After the fix, the evictable reader starts generating some redundant (but expected) range tombstone change fragments since it's now being paused and resumed. For this we need to adjust mutation source tests which were a bit too specific. We modify `flat_mutation_reader_assertions` to squash the redundant `r_t_c`s.
Fixes#13491Closes#14375
* github.com:scylladb/scylladb:
readers: evictable_reader: don't accidentally consume the entire partition
test: flat_mutation_reader_assertions: squash `r_t_c`s with the same position
(cherry picked from commit 586102b42e)
Fixes https://github.com/scylladb/scylla-enterprise/issues/3036
This commit adds support for Ubuntu 22.04 to the list
of OSes supported by ScyllaDB Enterprise 2021.1.
This commit fixex a bug and must be backported to
branch-5.3 and branch-5.2.
Closes#14372
(cherry picked from commit 74fc69c825)
Fixes https://github.com/scylladb/scylladb/issues/14333
This commit replaces the documentation landing page with
the Open Source-only documentation landing page.
This change is required as now there is a separate landing
page for the ScyllaDB documentation, so the page is duplicated,
creating bad user experience.
(cherry picked from commit f60f89df17)
Closes#14371
Fixes https://github.com/scylladb/scylladb/issues/14084
This commit adds OS support for version 5.3 to the table on the OS Support by Linux Distributions and Version page.
Closes#14228
* github.com:scylladb/scylladb:
doc: remove OS support for outdated ScyllaDB versions 2.x and 3.x
doc: add OS support for ScyllaDB 5.3
(cherry picked from commit aaac455ebe)
Problem can be reproduced easily:
1) wrote some sstables with smp 1
2) shut down scylla
3) moved sstables to upload
4) restarted scylla with smp 2
5) ran refresh (resharding happens, adds sstable to cleanup
set and never removes it)
6) cleanup (tries to cleanup resharded sstables which were
leaked in the cleanup set)
Bumps into assert "Assertion `!sst->is_shared()' failed", as
cleanup picks a shared sstable that was leaked and already
processed by resharding.
Fix is about not inserting shared sstables into cleanup set,
as shared sstables are restricted to resharding and cannot
be processed later by cleanup (nor it should because
resharding itself cleaned up its input files).
Dtest: https://github.com/scylladb/scylla-dtest/pull/3206Fixes#14001.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14147
(cherry picked from commit 156d771101)
Fixes https://github.com/scylladb/scylladb/issues/14097
This commit removes support for Ubuntu 18 from
platform support for ScyllaDB Enterprise 2023.1.
The update is in sync with the change made for
ScyllaDB 5.2.
This commit must be backported to branch-5.2 and
branch-5.3.
Closes#14118
(cherry picked from commit b7022cd74e)
With regards to closing the looked-up querier if an exception is thrown. In particular, this requires closing the querier if a semaphore mismatch is detected. Move the table lookup above the line where the querier is looked up, to avoid having to handle the exception from it. As a consequence of closing the querier on the error path, the lookup lambda has to be made a coroutine. This is sad, but this is executed once per page, so its cost should be insignificant when spread over an
entire page worth of work.
Also add a unit test checking that the mismatch is detected in the first place and that readers are closed.
Fixes: #13784Closes#13790
* github.com:scylladb/scylladb:
test/boost/database_test: add unit test for semaphore mismatch on range scans
partition_slice_builder: add set_specific_ranges()
multishard_mutation_query: make reader_context::lookup_readers() exception safe
multishard_mutation_query: lookup_readers(): make inner lambda a coroutine
(cherry picked from commit 1c0e8c25ca)
Due to a simple programming oversight, one of keyspace_metadata
constructors is using empty user_types_metadata instead of the
passed one. Fix that.
Fixes#14139Closes#14143
(cherry picked from commit 1a521172ec)
A long long time ago there was an issue about removing infinite timeouts
from distributed queries: #3603. There was also a fix:
620e950fc8. But apparently some queries
escaped the fix, like the one in `default_role_row_satisfies`.
With the right conditions and timing this query may cause a node to hang
indefinitely on shutdown. A node tries to perform this query after it
starts. If we kill another node which is required to serve this query
right before that moment, the query will hang; when we try to shutdown
the querying node, it will wait for the query to finish (it's a
background task in auth service), which it never does due to infinite
timeout.
Use the same timeout configuration as other queries in this module do.
Fixes#13545.
Closes#14134
(cherry picked from commit f51312e580)
After c7826aa910, sstable runs are cleaned up together.
The procedure which executes cleanup was holding reference to all
input sstables, such that it could later retry the same cleanup
job on failure.
Turns out it was not taking into account that incremental compaction
will exhaust the input set incrementally.
Therefore cleanup is affected by the 100% space overhead.
To fix it, cleanup will now have the input set updated, by removing
the sstables that were already cleaned up. On failure, cleanup
will retry the same job with the remaining sstables that weren't
exhausted by incremental compaction.
New unit test reproduces the failure, and passes with the fix.
Fixes#14035.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14038
(cherry picked from commit 23443e0574)
cleanup_compaction should resolve only after all
sstables that require cleanup are cleaned up.
Since it is possible that some of them are in staging
and therefore cannot be cleaned up, retry once a second
until they become eligible.
Timeout if there is no progress within 5 minutes
to prevent hanging due to view building bug.
Fixes#9559Closes#13812
* github.com:scylladb/scylladb:
table: signal compaction_manager when staging sstables become eligible for cleanup
compaction_manager: perform_cleanup: wait until all candidates are cleaned up
compaction_manager: perform_cleanup: perform_offstrategy if needed
compaction_manager: perform_cleanup: update_sstables_cleanup_state in advance
sstable_set: add for_each_sstable_gently* helpers
now that scylla-jmx has a dedicated script for detecting the existence
of OpenJDK, and this script is included in the unified package, let's
just leverage it instead of repeating it in `install.sh`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13514
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using `generation_type::int_t` in the helper functions, we just
pass `generation_type` in place of integer.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13931
Without the feature, the system schema doesn't have the table, and the
read will fail with:
Transferring snapshot to ... failed with: seastar::rpc::remote_verb_error (Can't find a column family tablets in keyspace system)
We should not attempt to read tablet metadata in the experimental
feature is not enabled.
Fixes#13946Closes#13947
Currently temporary directories with incomplete sstables and pending deletion log are processed by distributed loader on start. That's not nice, because for s3 backed sstables this code makes no sense (and is currently a no-op because of incomplete implementation). This garbage collecting should be kept in sstable_directory where it can off-load this work onto lister component that is storage-aware.
Once g.c. code moved, it allows to clean the class sstable list of static helpers a bit.
refs: #13024
refs: #13020
refs: #12707Closes#13767
* github.com:scylladb/scylladb:
sstable: Toss tempdir extension usage
sstable: Drop pending_delete_dir_basename()
sstable: Drop is_pending_delete_dir() helper
sstable_directory: Make garbage_collect() non-static
sstable_directory: Move deletion log exists check
distributed_loader: Move garbage collecting into sstable_directory
distributed_loader: Collect garbace collecting in one call
sstable: Coroutinize remove_temp_dir()
sstable: Coroutinize touch_temp_dir()
sstable: Use storage::temp_dir instead of hand-crafted path
When a CQL expression is printed, it can be done using
either the `debug` mode, or the `user` mode.
`user` mode is basically how you would expect the CQL
to be printed, it can be printed and then parsed back.
`debug` mode is more detailed, for example in `debug`
mode a column name can be displayed as
`unresolved_identifier(my_column)`, which can't
be parsed back to CQL.
The default way of printing is the `debug` mode,
but this requires us to remember to enable the `user`
mode each time we're printing a user-facing message,
for example for an invalid_request_exception.
It's cumbersome and people forget about it,
so let's change the default to `user`.
There issues about expressions being printed
in a `strange` way, this fixes them.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#13916
The previous implementation didn't actually do a read barrier, because
the statement failed on an early prepare/validate step which happened
before read barrier was even performed.
Change it to a statement which does not fail and doesn't perform any
schema change but requires a read barrier.
This breaks one test which uses `RandomTables.verify_schema()` when only
one node is alive, but `verify_schema` performs a read barrier. Unbreak
it by skipping the read barrier in this case (it makes sense in this
particular test).
Closes#13933
This implicit link it pretty bad, because feature service is a low-level
one which lots of other services depend on. System keyspace is opposite
-- a high-level one that needs e.g. query processor and database to
operate. This inverse dependency is created by the feature service need
to commit enabled features' names into system keyspace on cluster join.
And it uses the qctx thing for that in a best-effort manner (not doing
anything if it's null).
The dependency can be cut. The only place when enabled features are
committed is when gossiper enables features on join or by receiving
state changes from other nodes. By that time the
sharded<system_keyspace> is up and running and can be used.
Despite gossiper already has system keyspace dependency, it's better not
to overload it with the need to mess with enabling and persisting
features. Instead, the feature_enabler instance is equipped with needed
dependencies and takes care of it. Eventually the enabler is also moved
to feature_service.cc where it naturally belongs.
Fixes: #13837Closes#13172
* github.com:scylladb/scylladb:
gossiper: Remove features and sysks from gossiper
system_keyspace: De-static save_local_supported_features()
system_keyspace: De-static load_|save_local_enabled_features()
system_keyspace: Move enable_features_on_startup to feature_service (cont)
system_keyspace: Move enable_features_on_startup to feature_service
feature_service: Open-code persist_enabled_feature_info() into enabler
gms: Move feature enabler to feature_service.cc
gms: Move gossiper::enable_features() to feature_service::enable_features_on_join()
gms: Persist features explicitly in features enabler
feature_service: Make persist_enabled_feature_info() return a future
system_keyspace: De-static load_peer_features()
gms: Move gossiper::do_enable_features to persistent_feature_enabler::enable_features()
gossiper: Enable features and register enabler from outside
gms: Add feature_service and system_keyspace to feature_enabler
Some state that is used to fill in 'peeers' table is still propagated
over gossiper. When moving a node into the normal state in raft
topology code use the data from the gossiper to populate peers table because
storage_service::on_change() will not do it in case the node was not in
normal state at the time it was called.
Fixes: #13911
Message-Id: <ZGYk/V1ymIeb8qMK@scylladb.com>
The `system_keyspace` has several methods to query the tables in it. These currently require a storage proxy parameter, because the read has to go through storage-proxy. This PR uses the observation that all these reads are really local-replica reads and they only actually need a relatively small code snippet from storage proxy. These small code snippets are exported into standalone function in a new header (`replica/query.hh`). Then the system keyspace code is patched to use these new standalone functions instead of their equivalent in storage proxy. This allows us to replace the storage proxy dependency with a much more reasonable dependency on `replica::database`.
This PR patches the system keyspace code and the signatures of the affected methods as well as their immediate callers. Indirect callers are only patched to the extent it was needed to avoid introducing new includes (some had only a forward-declaration of storage proxy and so couldn't get database from it). There are a lot of opportunities left to free other methods or maybe even entire subsystems from storage proxy dependency, but this is not pursued in this PR, instead being left for follow-ups.
This PR was conceived to help us break the storage proxy -> storage service -> system tables -> storage proxy dependency loop, which become a major roadblock in migrating from IP -> host_id. After this PR, system keyspace still indirectly depends on storage proxy, because it still uses `cql3::query_processor` in some places. This will be addressed in another PR.
Refs: #11870Closes#13869
* github.com:scylladb/scylladb:
db/system_keyspace: remove dependency on storage_proxy
db/system_keyspace: replace storage_proxy::query*() with replica:: equivalent
replica: add query.hh
Commit 8c4b5e4283 introduced an optimization which only
calculates max purgeable timestamp when a tombstone satisfy the
grace period.
Commit 'repair: Get rid of the gc_grace_seconds' inverted the order,
probably under the assumption that getting grace period can be
more expensive than calculating max purgeable, as repair-mode GC
will look up into history data in order to calculate gc_before.
This caused a significant regression on tombstone heavy compactions,
where most of tombstones are still newer than grace period.
A compaction which used to take 5s, now takes 35s. 7x slower.
The reason is simple, now calculation of max purgeable happens
for every single tombstone (once for each key), even the ones that
cannot be GC'ed yet. And each calculation has to iterate through
(i.e. check the bloom filter of) every single sstable that doesn't
participate in compaction.
Flame graph makes it very clear that bloom filter is a heavy path
without the optimization:
45.64% 45.64% sstable_compact sstable_compaction_test_g
[.] utils::filter::bloom_filter::is_present
With its resurrection, the problem is gone.
This scenario can easily happen, e.g. after a deletion burst, and
tombstones becoming only GC'able after they reach upper tiers in
the LSM tree.
Before this patch, a compaction can be estimated to have this # of
filter checks:
(# of keys containing *any* tombstone) * (# of uncompacting sstable
runs[1])
[1] It's # of *runs*, as each key tend to overlap with only one
fragment of each run.
After this patch, the estimation becomes:
(# of keys containing a GC'able tombstone) * (# of uncompacting
runs).
With repair mode for tombstone GC, the assumption, that retrieval
of gc_before is more expensive than calculating max purgeable,
is kept. We can revisit it later. But the default mode, which
is the "timeout" (i.e. gc_grace_seconds) one, we still benefit
from the optimization of deferring the calculation until
needed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13908
sstables_manager::get_component_lister() is used by sstable_directory.
and almost all the "ingredients" used to create a component lister
are located in sstable_directory. among the other things, the two
implementations of `components_lister` are located right in
`sstable_directory`. there is no need to outsource this to
sstables_manager just for accessing the system_keyspace, which is
already exposed as a public function of `sstables_manager`. so let's
move this helper into sstable_directory as a member function.
with this change, we can even go further by moving the
`components_lister` implementations into the same .cc file.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13853
There are several places that need to carry a pointer to a table that's shard-wide accessible -- database snapshot and truncate code and distributed loader. The database code uses `get_table_on_all_shards()` returning a vector of foreign lw-pointers, the loader code uses its own global_column_family_ptr class.
This PR generalizes both into global_table_ptr facility.
Closes#13909
* github.com:scylladb/scylladb:
replica: Use global_table_ptr in distributed loader
replica: Make global_table_ptr a class
replica: Add type alias for vector of foreign lw-pointers
replica: Put get_table_on_all_shards() to header
replica: Rewrite get_table_on_all_shards()
instead of encoding the fact that we are using generation identifier
as a hint where the SSTable with this generation should be processed
at the caller sites of `as_int()`, just provide an accessor on
sstable_generation_generator's side. this helps to encapsulate the
underlying type of generation in `generation_type` instead of exposing
it to its users.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13846
`ep` is std::move'ed to get_endpoint_state_for_endpoint_ptr
but it's used later for logger.warn()
Fixes#13921
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13920
The loader has very similar global_column_family_ptr class for its
distributed loadings. Now it can use the "standard" one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now all users of global_table know it's a vector and reference its
elements with this_shard_id() index. Making the global_table_ptr a class
makes it possible to stop using operator[] and "index" this_shard_id()
in its -> and * operators.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Use sharded<database>::invoke_on_all() instead of open-coded analogy.
Also don't access database's _column_families directly, use the
find_column_family() method instead.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The tempdir for filesystem-based sstables is {generation}.sstable one.
There are two places that need to know the ".sstable" extention -- the
tempdir creating code and the tempdir garbage-collecting code.
This patch simplifies the sstable class by patching the aforementioned
functions to use newly introduced tempdir_extension string directly,
without the help of static one-line helpers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The helper is used to return const char* value of the pending delete
dir. Callers can use it directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's only used by the sstable_directory::replay_pending_delete_log()
method. The latter is only called by the sstable_directory itself with
the path being pending-delete dir for sure. So the method can be made
private and the is_pending_delete_dir() can be removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When non static the call can use sstable_directory::_sstable_dir path,
not the provided argument. The main benefit is that the method can later
be moved onto lister so that filesystem and ownership-table listers can
process dangling bits differently.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Check if the deletion log exists in the handling helper, not outside of
it. This makes next patch shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's the directory that owns the components lister and can reason about
the way to pick up dangling bits, be it local directories or entries
from the ownership table.
First thing to do is to move the g.c. code into sstable_directory. While
at it -- convert ssting dir into fs::path dir and switch logger.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the loader starts it first scans the directory for sstables'
tempdirs and pending deletion logs. Put both into one call so that it
can be moved more easily later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When opening an sstable on filesystem it's first created in a temporary
directory whose path is saved in storage::temp_dir variable. However,
the opening method constructs the path by hand. Fix that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fixes https://github.com/scylladb/scylladb/issues/13915
This commit fixes broken links to the Enterprise docs.
They are links to the enterprise branch, which is not
published. The links to the Enterprise docs should include
"stable" instead of the branch name.
This commit must be backported to branch-5.2, because
the broken links are present in the published 5.2 docs.
Closes#13917
perform_cleanup may be waiting for those sstables
to become eligible for cleanup so signal it
when table::move_sstables_from_staging detects an
sstable that requires cleanup.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
cleanup_compaction should resolve only after all
sstables that require cleanup are cleaned up.
Since it is possible that some of them are in staging
and therefore cannot be cleaned up, retry once a second
until they become eligible.
Timeout if there is no progress within 5 minutes
to prevent hanging due to view building bug.
Fixes#9559
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is possible that cleanup will be executed
right after repair-based node operations,
in which case we have a 5 minutes timer
before off-strategy compaction is started.
After marking the sstables that need cleanup,
perform offstrategy compaction, if needed.
This will implicitly cleanup those sstables
as part of offstrategy compaction, before
they are even passed for view update (if the table
has views/secondary index).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Scan all sstables to determine which of them
requires cleanup before calling perform_task_on_all_files.
This allows for cheaper no-op return when
no sstable was identified as requiring cleanup,
and also it will allow triggering offstrategy
compaction if needed, after selecting the sstables
for cleanup, in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently callers of `for_each_sstable` need to
use a seastar thread to allow preemption
in the for_each_sstable loop.
Provide for_each_sstable_gently and
for_each_sstable_gently_until to make using this
facility from a coroutine easier, without requiring
a seastar thread.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
string_format_test was added in 1b5d5205c8,
so let's add it to CMake building system as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13912
with off-strategy, input list size can be close to 1k, which will
lead to unneeded reallocations when formatting the list for
logging.
in the past, we faced stalls in this area, and excessive reallocation
(log2 ~1k = ~10) may have contributed to that.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13907
Currently we may start to receive requests before group0 is configured
during boot. If that happens those requests may try to pull schema and
issue raft read barrier which will crash the system because group0 is
not yet available. Workaround it by pretending the raft is disabled in
this case and use non raft procedure. The proper fix should make sure
that storage proxy verbs are registered only after group0 is fully
functional.
Message-Id: <ZGOZkXC/MsiWtNGu@scylladb.com>
range_tombstone_change_generator::flush() mishandles the case when two range
tombstones are adjacent and flush(pos, end_of_range=true) is called with pos
equal to the end bound of the lesser-position range tombstone.
In such case, the start change of the greater-position rtc will be accidentally
emitted, and there won't be an end change, which breaks reader assumptions by
ending the stream with an unclosed range tombstone, triggering an assertion.
This is due to a non-strict inequality used in a place where strict inequality
should be used. The modified line was intended to close range tombstones
which end exactly on the flush position, but this is unnecessary because such
range tombstones are handled by the last `if` in the function anyway.
Instead, this line caused range tombstones beginning right after the flush
position to be emitted sometimes.
Fixes#12462Closes#13906
CI once failed due to mc being unable to configure minio server. There's currently no glues why it could happen, let's increase the minio.py verbosity a bit
refs: #13896Closes#13901
* github.com:scylladb/scylladb:
test,minio: Run mc with --debug option
test,minio: Log mc operations to log file
Currently, when a user creates a function or a keyspace, no
permissions on functions are update.
Instead, the user should gain all permissions on the function
that they created, or on all functions in the keyspace they have
created. This is also the behavior in Cassandra.
However, if the user is granted permissions on an function after
performing a CREATE OR REPLACE statement, they may
actually only alter the function but still gain permissions to it
as a result of the approach above, which requires another
workaround added to this series.
Lastly, as of right now, when a user is altering a function, they
need both CREATE and ALTER permissions, which is incompatible
with Cassandra - instead, only the ALTER permission should be
required.
This series fixes the mentioned issues, and the tests are already
present in the auth_roles_test dtest.
Fixes#13747Closes#13814
* github.com:scylladb/scylladb:
cql: adjust tests to the updated permissions on functions
cql: fix authorization when altering a function
cql: grant permissions on functions when creating a keyspace/function
cql: pass a reference to query processor in grant_permissions_to_creator
test_permissions: make tests pass on cassandra
This series introduces a new gossiper method: get_endpoints that returns a vector of endpoints (by value) based on the endpoint state map.
get_endpoints is used here by gossiper and storage_service for iterations that may preempt
instead of iterating direction over the endpoint state map (`_endpoint_state_map` in gossiper or via `get_endpoint_states()`) so to prevent use-after-free that may potentially happen if the map is rehashed while the function yields causing invalidation of the loop iterators.
Fixes#13899Closes#13900
* github.com:scylladb/scylladb:
storage_service: do not preempt while traversing endpoint_state_map
gossiper: do not preempt while traversing endpoint_state_map
It turns out that numeric_limits defines an implicit implementation
for std::numeric_limits<utils::tagged_integer<Tag, ValueType>>
which apprently returns a default-constructed tagged_integer
for min() and max(), and this broke
`gms::heart_beat_state::force_highest_possible_version_unsafe()`
since [gms: heart_beat_state: use generation_type and version_type](4cdad8bc8b)
(merged in [Merge 'gms: define and use generation and version types'...](7f04d8231d))
Implementing min/max correctly
Fixes#13801Closes#13880
* github.com:scylladb/scylladb:
storage_service: handle_state_normal: on_internal_error on "owns no tokens"
utils: tagged_integer: implement std::numeric_limits::{min,max}
test: add tagged_integer_test
The previous version of wasmtime had a vulnerability that possibly
allowed causing undefined behavior when calling UDFs.
We're directly updating to wasmtime 8.0.1, because the update only
requires a slight code modification and the Wasm UDF feature is
still experimental. As a result, we'll benefit from a number of
new optimizations.
Fixes#13807Closes#13804
The auto& db = proxy.local().get_db() is called few lines above this
patch, so the &db can be reused for invoke_on_all() call.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13896
Currently everything minio.py does goes to test.py log, while mc (and
minio) output go to another log file. That's inconvenient, better to
keep minio.py's messages in minio log file.
Also, while at it, print a message if local alias drop fails (it's
benign failure, but it's good to have the note anyway).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
this option allows user to use specified linker instead of the
default one. this is more flexible than adding more linker
candidates to the known linkers.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13874
For unknown reasons, clang 16 rejects equality comparison
(operator==) where the left-hand-side is an std::string and the
right-hand-side is an sstring. gcc and older clang versions first
convert the left-hand-side to an sstring and then call the symmetric
equality operator.
I was able to hack sstring to support this assymetric comparison,
but the solution is quite convoluted, and it may be that it's clang
at fault here. So instead this patch eliminates the three cases where
it happened. With is applied, we can build with clang 16.
Closes#13893
in this change, the type of the "generation" field of "sstable" in the
return value of RESTful API entry point at
"/storage_service/sstable_info" is changed from "long" to "string".
this change depends on the corresponding change on tools/jmx submodule,
so we have to include the submodule change in this very commit.
this API is used by our JMX exporter, which in turn exposes the
SSTable information via the "StorageService.getSSTableInfo" mBean
operation, which returns the retrieved SSTable info as a list of
CompositeData. and "generation" is a field of an element in the
CompositeData. in general, the scylla JMX exporter is consumed
by the nodetool, which prints out returned SSTable info list with
a pretty formatted table, see
tools/java/src/java/org/apache/cassandra/tools/nodetool/SSTableInfo.java.
the nodetool's formatter is not aware of the schema or type of the
SSTables to be printed, neither does it enforce the type -- it just
tries it best to pretty print them as a tabular.
But the fields in CompositeData is typed, when the scylla JMX exporter
translates the returned SSTables from the RESTful API, it sets the
typed fields of every `SSTableInfo` when constructing `PerTableSSTableInfo`.
So, we should be consistent on the type of "generation" field on both
the JMX and the RESTful API sides. because we package the same version
of scylla-jmx and nodetool in the same precompiled tarball, and enforce
the dependencies on exactly same version when shipping deb and rpm
packages, we should be safe when it comes to interoperability of
scylla-jmx and scylla. also, as explained above, nodetool does not care
about the typing, so it is not a problem on nodetool's front.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13834
Instead of simply throwing an exception. With just the exception, it is
impossible to find out what went wrong, as this API is very generic and
is used in a variety of places. The backtrace printed by
`on_internal_error()` will help zero in on the problem.
Fixes: #13876Closes#13883
Although this condition should not happen,
we suspect that certain timing conditions might
lead this state of node in handle_normal_state
(possibly when shutdown) has no tokens.
Currently we call on_internal_error_noexcept, so
if abort_on_internal_error is false, we will just
print an error and continue on with handle_state_normal.
Change that to `on_internal_error` so to throw an
exception in production in this unexpected state.
Refs #13801
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Fixes https://github.com/scylladb/scylladb/issues/13857
This commit adds the OS support for ScyllaDB Enterprise 2023.1.
The support is the same as for ScyllaDB Open Source 5.2, on which
2023.1 is based.
After this commit is merged, it must be backported to branch-5.2.
In this way, it will be merged to branch-2023.1 and available in
the docs for Enterprise 2023.1
Closes: #13858
Separate cluster_size into a cluster section and specify this value as
initial_size.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13440
Add add a respective unit test.
It turns out that numeric_limits defines an implicit implementation
for std::numeric_limits<utils::tagged_integer<Tag, ValueType>>
which apprently returns a default-constructed tagged_integer
for min() and max(), and this broke
`gms::heart_beat_state::force_highest_possible_version_unsafe()`
since 4cdad8bc8b
(merged in 7f04d8231d)
Implementing min/max correctly
Fixes#13801
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
before this change, alternator_timeout_in_ms is not live-updatable,
as after setting executor's default timeout right before creating
sharded executor instances, they never get updated with this option
anymore. but many users would like to set the driver timers based on
server timers. we need to enable them to configure timeout even
when the server is still running.
in this change,
* `alternator_timeout_in_ms` is marked as live-updateable
* `executor::_s_default_timeout` is changed to a thread_local variable,
so it can be updated by a per-shard updateable_value. and
it is now a updateable_value, so its variable name is updated
accordingly. this value is set in the ctor of executor, and
it is disconnected from the corresponding named_value<> option
in the dtor of executor.
* alternator_timeout_in_ms is passed to the constructor of
executor via sharded_parameter, so `executor::_timeout_in_ms` can
be initialized on per-shard basis
* `executor::set_default_timeout()` is dropped, as we already pass
the option to executor in its ctor.
Fixes#12232Closes#13300
* github.com:scylladb/scylladb:
alternator: split the param list of executor ctor into multi lines
alternator,config: make alternator_timeout_in_ms live-updateable
in this series, instead of hardwiring to integer, we switch to generation generator for creating new generations. this should helps us to migrate to a generation identifier which can also represented by UUID. and potentially can help to improve the testing coverage once we switch over to UUID-based generation identifier. will need to parameterize these tests by then, for sure.
Closes#13863
* github.com:scylladb/scylladb:
test: sstable: use generator to generate generations
test: sstable: pass generation_type in helper functions
test: sstable: use generator to generate generations
It is useful to know which node has the error. For example, when a node
has a corrupted sstable, with this patch, repair master node can tell
which node has the corrupted sstable.
```
WARN 2023-05-15 10:54:50,213 [shard 0] repair -
repair[2df49b2c-219d-411d-87c6-2eae7073ba61]: get_combined_row_hash: got
error from node=127.0.0.2, keyspace=ks2a, table=tb,
range=(8992118519279586742,9031388867920791714],
error=seastar::rpc::remote_verb_error (some error)
```
Fixes#13881Closes#13882
Currently, error injections can be enabled either through HTTP or CQL.
While these mechanisms are effective for injecting errors after a node
has already started, it can't be reliably used to trigger failures
shortly after node start. In order to support this use case, this commit
adds possibility to enable some error injections via config.
A configuration option `error_injections_at_startup` is added. This
option uses our existing configuration framework, so it is possible to
supply it either via CLI or in the YAML configuration file.
- When passed in commandline, the option is parsed as a
semicolon-separated list of error injection names that should be
enabled. Those error injections are enabled in non-oneshot mode.
The CLI option is marked as not used in release mode and does not
appear in the option list.
Example:
--error-injections-at-startup failure_point1;failure_point2
- When provided in YAML config, the option is parsed as a list of items.
Each item is either a string or a map or parameters. This method is
more flexible as it allows to provide parameters for each injection
point. At this time, the only benefit is that it allows enabling
points in oneshot mode, but more parameters can be added in the future
if needed.
Explanatory example:
error_injections_at_startup:
- failure_point1 # enabled in non-oneshot mode
- name: failure_point2 # enabled in oneshot mode
one_shot: true # due to one_shot optional parameter
The primary goal of this feature is to facilitate testing of raft-based
cluster features. An error injection will be used to enable an
additional feature to simulate node upgrade.
Tests: manual
Closes#13861
Constructors of trace_state class initialize most of the fields in constructor body with the help of non-inline helper method. It's possible and is better to initialize as much as possible with initializer lists.
Closes#13871
* github.com:scylladb/scylladb:
tracing: List-initialize trace_state::_records
tracing: List-initialize trace_state::_props
tracing: List-initialize trace_state::_slow_query_threshold
tracing: Reorder trace_state fields initialization
tracing: Remove init_session_records()
tracing: List-initialize one_session_records::ttl
tracing: List-initialize one_session_records
tracing: List-initialize session_record
There are two layers of stables deletion -- delete-atomically and wipe. The former is in fact the "API" method, it's called by table code when the specific sstable(s) are no longer needed. It's called "atomically" because it's expected to fail in the middle in a safe manner so that subsequent boot would pick the dangling parts and proceed. The latter is a low-level removal function that can fail in the middle, but it's not of _its_ care.
Currently the atomic deletion is implemented with the help of sstable_directory::delete_atomically() method that commits sstables files names into deletion log, then calls wipe (indirectly), then drops the deletion log. On boot all found deletion logs are replayed. The described functionality is used regardless of the sstable storage type, even for S3, though deletion log is an overkill for S3, it's better be implemented with the help of ownership table. In fact, S3 storage already implements atomic deletion in its wipe method thus being overly careful.
So this PR
- makes atomic deletion be storage-specific
- makes S3 wipe non-atomic
fixes: #13016
note: Replaying sstables deletion from ownership table on boot is not here, see #13024Closes#13562
* github.com:scylladb/scylladb:
sstables: Implement atomic deleter for s3 storage
sstables: Get atomic deleter from underlying storage
sstables: Move delete_atomically to manager and rename
Add basic test for tagged+integer arithmetic operations.
Remove const qualifier from `tagged_integer::operator[+-]=`
as these are add/sub-assign operators that need to modify
the value in place.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Similarly to how we handle Roles and Tables, we do not
allow permissions on non-existent objects, so the CREATE
permission on a specific function is meaningless, because
for the permission to be granted to someone, the function
must be already created.
This patch removes the CREATE permission from the set of
permissions applicable to a specific function.
Fixes#13822Closes#13824
This is a translation of Cassandra's CQL unit test source file
validation/entities/UFTypesTest.java into our cql-pytest framework.
There are 7 tests, which reproduce one known bug:
Refs #13746: UDF can only be used in SELECT, and abort when used in WHERE, or in INSERT/UPDATE/DELETE commands
And uncovered two previously unknown bugs:
Refs #13855: UDF with a non-frozen collection parameter cannot be called on a frozen value
Refs #13860: A non-frozen collection returned by a UDF cannot be used as a frozen one
Additionally, we encountered an issue that can be treated as either a bug or a hole in documentation:
Refs #13866: Argument and return types in UDFs can be frozen
Closes#13867
Adding new APIs /column_family/tombstone_gc and /storage_service/tombstone_gc, that will allow for disabling tombstone garbage collection (GC) in compaction.
Mimicks existing APIs /column_family/autocompaction and /storage_service/autocompaction.
column_family variant must specify a single table only, following existing convention.
whereas the storage_service one can specify an entire keyspace, or a subset of a tables in a keyspace.
column_family API usage
-----
```
The table name must be in keyspace:name format
Get status:
curl -s -X GET "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf"
Enable GC
curl -s -X POST "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf"
Disable GC
curl -s -X DELETE "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf"
```
storage_service API usage
-----
```
Tables can be specified using a comma-separated list.
Enable GC on keyspace
curl -s -X POST "http://127.0.0.1:10000/storage_service/tombstone_gc/ks"
Disable GC on keyspace
curl -s -X DELETE "http://127.0.0.1:10000/storage_service/tombstone_gc/ks"
Enable GC on a subset of tables
curl -s -X POST
"http://127.0.0.1:10000/storage_service/tombstone_gc/ks?cf=table1,table2"
```
Closes#13793
* github.com:scylladb/scylladb:
test: Test new API for disabling tombstone GC
test: rest_api: extract common testing code into generic functions
Add API to disable tombstone GC in compaction
api: storage_service: restore indentation
api: storage_service: extract code to set attribute for a set of tables
tests: Test new option for disabling tombstone GC in compaction
compaction_strategy: bypass tombstone compaction if tombstone GC is disabled
table: Allow tombstone GC in compaction to be disabled on user request
Schema pull may fail because the pull does not contain everything that
is needed to instantiate a schema pointer. For instance it does not
contain a keyspace. This series changes the code to issue raft read
barrier before the pull which will guaranty that the keyspace is created
before the actual schema pull is performed.
database_test is failing sporadically and the cause was traced back
to commit e3e7c3c7e5.
The commit forces a subset of tests in database_test, to run once
for each of predefined x_log2_compaction_group settings.
That causes two problems:
1) test becomes 240% slower in dev mode.
2) queries on system.auth is timing out, and the reason is a small
table being spread across hundreds of compaction groups in each
shard. so to satisfy a range scan, there will be multiple hops,
making the overhead huge. additionally, the compaction group
aware sstable set is not merged yet. so even point queries will
unnecessarily scan through all the groups.
Fixes#13660.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13851
This PR contains some small improvements to the safety of consuming/releasing resources to/from the semaphore:
* reader_permit: make the low-level `consume()/signal()` API private, making the only user (an RAII class) friend.
* reader_resources: split `reset()` into `noexcept` and potentially throwing variant.
* reader_resources::reset_to(): try harder to avoid calling `consume()` (when the new resource amount is smaller then the previous one)
Closes#13678
* github.com:scylladb/scylladb:
reader_permit: resource_units::reset_to(): try harder to avoid calling consume()
reader_permit: split resource_units::reset()
reader_permit: make consume()/signal() API private
It consists of two parts -- call for do_read_simple() with lambda and handling of its results. PR coroutinizes it in two steps for review simplicity -- first the lambda, then the outer caller. Then restores indentation.
Closes#13862
* github.com:scylladb/scylladb:
sstables: Restore indentation after previous patches
sstables: Coroutinuze read_toc() outer part
sstables: Coroutinuze read_toc() inner part
Currently s3::client is created for each sstable::storage. It's later shared between sstable's files and upload sink(s). Also foreign_sstable_open_info can produce a file from a handle making a new standalone client. Coupled with the seastar's http client spawning connections on demand, this makes it impossible to control the amount of opened connections to object storage server.
In order to put some policy on top of that (as well as apply workload prioritization) s3 clients should be collected in one place and then shared by users. Since s3::client uses seastar::http::client under the hood which, in turn, can generate many connections on demand, it's enough to produce a single s3::client per configured endpoint one each shard and then share it between all the sstables, files and sinks.
There's one difficulty however, solving which is most of what this PR does. The file handle, that's used to transfer sstable's file across shards, should keep aboard all it needs to re-create a file on another shard. Since there's a single s3::client per shard, creation of a file out of a handle should grab that shard's client somehow. The meaningful shard-local object that can help is the sstables_manager and there are three ways to make use of it. All deal with the fact that sstables_manager-s are not sharded<> services, but are owner by the database independently on each shard.
1. walk the client -> sst.manager -> database -> container -> database -> sst.manager -> client chain by keeping its first half on the handle and unrolling the second half to produce a file
2. keep sharded peering service referenced by the sstables_manager that's initialized in main and passed though the database constructor down to sstables_manager(s)
3. equip file_handle::to_file with the "context" argument and teach sstables foreign info opener to push sstables_manager down to s3 file ... somehow
This PR chooses the 2nd way and introduces the sstables::storage_manager main-local sharded peering service that maintains all the s3::clients. "While at it" the new manager gets the object_storage_config updating facilities from the database (it's overloaded even without it already). Later the manager will also be in charge of collecting and exporting S3 metrics. In order to limit the number of S3 connections it also needs a patch seastar http::client, there's PR already doing that, once (if) merged there'll come one more fix on top.
refs: #13458
refs: #13369
refs: scylladb/seastar#1652Closes#13859
* github.com:scylladb/scylladb:
s3: Pick client from manager via handle
s3: Generalize s3 file handle
s3: Live-update clients' configs
sstables: Keep clients shared across sstables
storage_manager: Rewrap config map
sstables, database: Move object storage config maintenance onto storage_manager
sstables: Introduce sharded<storage_manager>
The existing storage::wipe() method of s3 is in fact atomic deleter --
it commits "deleting" status into ownership table, deletes the objects
from server, then removes the entry from ownership table. So the atomic
deleter does the same and the .wipe() just removes the objects, because
it's not supposed to be atomic.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
While the driver isn't known without the sstable itself, we have a
vector of them can can get it from the front element. This is not very
generic, but fortunately all sstables here belong to the same table and,
respectively, to the same storage and even prefix. The latter is also
assert-checked by the sstable_directory atomic deleter code.
For now S3 storage returns the same directory-based deleter, but next
patch will change that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is to let manager decide which storage driver to call for atomic
sstables deletion in the next patch. While at it -- rename the
sstable_directory's method into something more descriptive (to make
compiler catch all callers of it).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This field needs to call trace_state::ttl_by_type() which, in turn,
looks into _props. The latter should have been initialized already
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It takes props from constructor args and tunes them according to the
constructing "flavor" -- primary or secondary state. Adding two static
helpers code-document the intent and make list-initialization possible
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
compaction strategies know how to pick files that are most likely to
satisfy tombstone purge conditions (i.e. not shadow data in uncompacting
files).
This logic can be bypassed if tombstone GC was disabled by user,
as it's a waste of effort to proceed with it until re-enabled.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If tombstone GC was disabled, compaction will ensure that fully expired
sstables won't be bypassed and that no expired tombstones will be
purged. Changing the value takes immediate effect even on ongoing
compactions.
Not wired into an API yet.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The instance ptr and props have to be set up early, because other
members' initialization depends on them. It's currently OK, because
other members are initialized in the constructor body, but moving them
into initializer list would require correct ordering
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It now does nothing but wraps make_lw_shared<one_session_records>()
call. Callers can do it on their own thus facilitating further
list-initialization patching
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
For that to happen the value evaluation is moved from the
init_session_records() into a private trace_state helper as it checks
the props values initialized earlier
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This object is constructed via one_session_records thus the latter needs
to pass some arguments along
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The methods that take storage_proxy as argument can now accept a
replica::database instead. So update their signatures and update all
callers. With that, system_keyspace.* no longer depends on storage_proxy
directly.
Use the recently introduced replica side query utility functions to
query the content of the system tables. This allows us to cut the
dependency of the system keyspace on storage proxy.
The methods still take storage proxy parameter, this will be replaced
with replica::database in the next patch.
There is still one hidden storage proxy dependency left, via
clq3::query_processor. This will be addressed later.
Containing utility methods to query data from the local replica.
Intended to be used to read system tables, completely bypassing storage
proxy in the process.
This duplicates some code already found in storage proxy, but that is a
small price to pay, to be able to break some circular dependencies
involving storage proxy, that have been plaguing us since time
immemorial.
One thing we lose with this, is the smp service level using in storage
proxy. If this becomes a problem, we can create one in database and use
it in these methods too.
Another thing we lose is increasing `replica_cross_shard_ops` storage
proxy stat. I think this is not a problem at all as these new functions
are meant to be used by internal users, which will reduce the internal
noise in this metric, which is meant to indicate users not using
shard-aware clients.
As a result of the preceding patches, permissions on a function
are now granted to its creator. As a result, some permissions may
appear which we did not expect before.
In the test_udf_permissions_serialization, we create a function
as the superuser, and as a result, when we compare the permissions
we specifically granted to the ones read from the LIST PERMISSIONS
result, we get more than expected - this is fixed by granting
permissions explicitly to a new user and only checking this user's
permissions list.
In the test_grant_revoke_udf_permissions case, we test whether
the DROP permission in enforced on a function that we have previously
created as the same user - as a result we have the DROP permission
even without granting it directly. We fix this by testing the DROP
permission on a function created by a different user.
In the test_grant_revoke_alter_udf_permissions case, we previously
tested that we require both ALTER and CREATE permissions when executing
a CREATE OR REPLACE FUNCTION statement. The new permissions required
for this statement now depend on whether we actually CREATE or REPLACE
a function, so now we test that the ALTER permission is required when
REPLACING a function, and the CREATE permission is required when
CREATING a function. After the changes, the case no longer needs to
be arfitifially extracted from the previous one, so they are merged
now. Analogous adjustments are made in the test case
test_grant_revoke_alter_uda_permissions.
Currently, when a user is altering a function, they need
both CREATE and ALTER permissions, instead of just ALTER.
Additionally, after altering a function, the user is
treated as an owner of this function, gaining all access
permissions to it.
This patch fixes these 2 issues, by checking only the ALTER
permission when actually altering, and by not modifying
user's permisssions if the user did not actually create
the function.
When a user creates a function, they should have all permissions on
this function.
Similarly, when a user creates a keyspace, they should have all
permissions on functions in the keyspace.
This patch introduces GRANTs on the missing permissions.
In the following patch, the grant_permissions_to_creator method is going
to be also used to grant permissions on a newly created function. The
function resource may contain user-defined types which need the
query processor to be prepared, so we add a reference to it in advance
in this patch for easier review.
Despite the cql-pytests being intended to pass on both Scylla and
Cassandra, the test_permissions.py case was actually failing on
Cassandra in a few cases. The most common issue was a different
exception type returned by Scylla and Cassandra for an invalid
query. This was fixed by accepting 2 types of exceptions when
necessary.
The second issue was java UDF code that did not compile, which was
fixed simply by debugging the code.
The last issue was a case that was scylla_only with no good reason.
The missing java UDFs were added to that case, and the test was
adjusted so that the ALTER permission was only checked in a
CREATE OR REPLACE statement only if the UDF was already existing -
- Scylla requires it in both cases, which will get resolved in the
next patch.
instead of assuming the integer-based generation id, let's use
the generation generator for creating a new generation id. this
helps us to improve the testing coverity once we migrate to the
UUID-based generation identifier.
this change uses generator to generate generations for
`make_sstable_for_all_shards()`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
always avoid using generation_type if possible. this helps us to
hide the underlying type of generation identifier, which could also
be a UUID in future.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead of assuming the integer-based generation id, let's use
the generation generator for creating a new generation id. this
helps us to improve the testing coverity once we migrate to the
UUID-based generation identifier.
this change uses generator to create generations for
`make_sstable_for_this_shard()`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Add the global-factory onto the client that is
- cross-shard copyable
- generates a client from local storage_manager by given endpoint
With that the s3 file handle is fixed and also picks up shared s3
clients from the storage manager instead of creating its own one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the s3 file handle tries to carry client's info via explicit
host name and endpoint config pointer. This is buggy, the latter pointer
is shard-local can cannot be transferred across shards.
This patch prepares the fix by abstracting the client handle part.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the client is accessible directli via the storage_manager, when
the latter is requested to update its endpoint config, it can kick the
client to do the same.
The latter, in turn, can only update the AWS creds info for now. The
endpoint port and https usage are immutable for now.
Also, updating the endpoint address is not possible, but for another
reason -- the endpoint itself is the part of keyspace configuration and
updating one in the object_storage.yaml will have no effect on it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Nowadays each sstable gets its own instance of an s3::client. This patch
keeps clients on storage_manager's endpoints map and when creating a
storage for an sstable -- grab the shared pointer from the map, thus
making one client serve all sstables over there (except for those that
duplicated their files with the help of foreign-info, but that's to be
handled by next patches).
Moving the ownership of a client to the storage_manager level also means
that the client has to be closed on manager's stop, not on sstable
destroy.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now the map is endpoint -> config_ptr. Wrap the config_ptr into an
s3_endpoint struct. Next patch will keep the client on this new wrapper
struct thus making them shared between sstables.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now the map<endpoint, config> sits on the sstables manager and its
update is governed by database (because it's peering and can kick other
shards to update it as well).
Having the sharded<storage_manager> at hand lets freeing database from
the need to update configs and keeps sstables_manager a bit smaller.
Also this will allow keeping s3 clients shared between sstables via this
map by next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The manager in question keeps track of whatever sstables_manager needs
to work with the storage (spoiler: only S3 one). It's main-local sharded
peering service, so that container() call can be used by next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It just needs to catch the system_error of ENOENT and re-throw it as
malformed_sstable_exception.
Indentatil is deliberately left broken. Again.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
One non-trivial change is the removal of buf temporary variable. That's
because it existed under the same name in the .then() lambda generating
name conflict after coroutinization.
Other than that it's pretty straightforward.
Indentation is deliberately left broken.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now which schema pull may issues raft read barrier it may stuck if
majority is not available. Make the operation abortable and abort it
during queries if timeout is reached.
The immediate mode is similar to timeout mode with gc_grace_seconds
zero. Thus, the gc_before returned should be the query_time instead of
gc_clock::time_point::max in immediate mode.
Setting gc_before to gc_clock::time_point::max, a row could be dropped
by compaction even if the ttl is not expired yet.
The following procedure reproduces the issue:
- Start 2 nodes
- Insert data
```
CREATE KEYSPACE ks2a WITH REPLICATION = { 'class' : 'SimpleStrategy',
'replication_factor' : 2 };
CREATE TABLE ks2a.tb (pk int, ck int, c0 text, c1 text, c2 text, PRIMARY
KEY(pk, ck)) WITH tombstone_gc = {'mode': 'immediate'};
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (10 ,1, 'x', 'y', 'z')
USING TTL 1000000;
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (20 ,1, 'x', 'y', 'z')
USING TTL 1000000;
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (30 ,1, 'x', 'y', 'z')
USING TTL 1000000;
```
- Run nodetool flush and nodetool compact
- Compaction drops all data
```
~128 total partitions merged to 0.
```
Fixes#13572Closes#13800
It is possible that a node will have no owned token ranges
in some keyspaces based on their replication strategy,
if the strategy is configured to have no replicas in
this node's data center.
In this case we should go ahead with cleanup that will
effectively delete all data.
Note that this is current very inefficient as we need
to filter every partition and drop it as unowned.
It can be optimized by either special casing this case
or, better, use skip forward to the next owned range.
This will skip to end-of-stream since there are no
owned ranges.
Fixes#13634
Also, add a respective rest_api unit test
Closes#13849
* github.com:scylladb/scylladb:
test: rest_api: test_storage_service: add test_storage_service_keyspace_cleanup_with_no_owned_ranges
compaction_manager: perform_cleanup: handle empty owned ranges
Schema pull may fail because the pull does not contain everything that
is needed to instantiate a schema pointer. For instance it does not
contain a keyspace. This patch changes the code to issue raft read
barrier before the pull which will guaranty that the keyspace is created
before the actual schema pull is performed.
Refs: #3760Fixes: #13211
Fixes https://github.com/scylladb/scylladb/issues/13805
This commit fixes the redirection required by moving the Glossary
page from the top of the page tree to the Reference section.
As the change was only merged to master (not to branch-5.2),
it is not working for version 5.2, which is now the latest stable
version.
For this reason, "stable" in the path must be replaced with "master".
Closes#13847
the series drops some of the callers using SSTable generation as integer. as the generation of SSTable is but an identifier, we should not use it as an integer out of generation_type's implementation.
Closes#13845
* github.com:scylladb/scylladb:
test: drop unused helper functions
test: sstable_mutation_test: avoid using helper using generation_type::int_t
test: sstable_move_test: avoid using helper using generation_type::int_t
test: sstable_*test: avoid using helper using generation_type::int_t
test: sstable_3_x_test: do not use reuseable_sst() accepting integer
Updates to the compaction_group sstable sets are
never done in place. Instead, the update is done
on a mutable copy of the sstable set, and the lw_shared
result is set back in the compaction_group.
(see for example compaction_group::set_main_sstables)
Therefore, there's currently a risk in perform_cleanup
`get_sstables` lambda that if it yield while in
set.for_each_sstable, the sstable_set might be replaced
and the copy it is traversing may be destroyed.
This was introduced in c2bf0e0b72.
To prevent that, hold on to set.shared_from_this()
around set.for_each_sstable.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13852
all users of these two helpers have switched to their alternatives,
so there is no need to keep them.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using `generation_type::int_t` in the helper functions, we just
pass `generation_type` in place of integer. also, since
`generate_clustered()` is only used by functions in the same
compilation unit, let's take the opportunity to mark it `static`.
and there is no need to pass generation as a template parameter,
we just pass it as a regular parameter.
we will divert other callers of `reusable_sst(...,
generation_type::int)` in following-up changes in different ways.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using `generation_type::int_t` in helper functions, we just use
`generation_type`. please note, despite that we'd prefer generating
the generations using generator, the SSTables used by the tests
modified by this change are stored in the repo, to ensure that the
tests are always able to find the SSTable files, we keep them
unchanged instead of using generation_generator, or a random
generation for the testing.
we will divert other callers of `reusable_sst(...,
generation_type::int)` in following-up changes in different ways.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using the helper accepting int, we switch to the one which accepts
generation_type by offering a default paramter, which is a
generation created using 1. this preserves the existing behavior.
we will divert other callers of `reusable_sst(...,
generation_type::int)` in following-up changes in different ways.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using the helper accepting int, we switch to the one which accepts
generation_type.
also, as no callers are using the last parameter of `make_test_sstable()`,
let's drop it .
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This series fixes an issue with altering permissions on UDFs with
parameter types that are UDTs with quoted names and adds
a test for it.
The issue was caused by the format of the temporary string
that represented the UDT in `auth::resource`. After parsing the
user input to a raw type, we created a string representing the
UDT using `ut_name::to_string()`. The segment of the resulting
string that represented the name of the UDT was not quoted,
making us unable to parse it again when the UDT was being
`prepare`d. Other than for this purpose, the `ut_name::to_string()`
is used only for logging, so the solution was modifying it to
maybe quote the UDT name.
Ref: https://github.com/scylladb/scylladb/pull/12869Closes#13257
* github.com:scylladb/scylladb:
cql-pytest: test permissions for UDTs with quoted names
cql: maybe quote user type name in ut_name::to_string()
cql: add a check for currently used stack in parser
cql-pytest: add an optional name parameter to new_type()
Currently, when creating a UDA, we only check for permissions
for creating functions. However, the creator gains all permissions
to the UDA, including the EXECUTE permission. This enables the
user to also execute the state/reduce/final functions that were
used in the UDA, even if they don't have the EXECUTE permissions
on them.
This patch adds checks for the missing EXECUTE permissions, so
that the UDA can be only created if the user has all required
permissions.
The new permissions that are now required when creating a UDA
are now granted in the existing UDA test.
Fixes#13818Closes#13819
Currently, when a function has no arguments, the function_args()
method, which is supposed to return a vector of string_views
representing the arguments of the function, returns a nullopt
instead, as if it was a functions_resource on all functions
or all functions in a keyspace. As a result, the functions_resource
can't be properly formatted.
This is fixed in this patch by returning an empty vector instead,
and the fix is confirmed in a cql-pytest.
Fixes#13842Closes#13844
data_consume_rows keeps an input_stream member that must be closed.
In particular, on the error path, when we destroy it possibly
with readaheads in flight.
Fixes#13836
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13840
It is possible that a node will have no owned token ranges
in some keyspaces based on their replication strategy,
if the strategy is configured to have no replicas in
this node's data center.
In this case we should go ahead with cleanup that will
effectively delete all data.
Note that this is current very inefficient as we need
to filter every partition and drop it as unowned.
It can be optimized by either special casing this case
ot, better, use skip forward to the next owned range.
This will skip to end-of-stream since there are no
owned ranges.
Fixes#13634
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
CQL evolved several expression evaluation mechanisms: WHERE clause,
selectors (the SELECT clause), and the LWT IF clause are just some
examples. Most now use expressions, which use managed_bytes_opt
as the underlying value representation, but selectors still use bytes_opt.
This poses two problems:
1. bytes_opt generates large contiguous allocations when used with large blobs, impacting latency
2. trying to use expressions with bytes_opt will incur a copy, reducing performance
To solve the problem, we harmonize the data types to managed_bytes_opt
(#13216 notwithstanding). This is somewhat difficult since the source of the values
are views into a bytes_ostream. However, luckily bytes_ostream and managed_bytes_view
are mostly compatible so with a little effort this can be done.
The series is neutral wrt performance:
before:
```
222118.61 tps ( 61.1 allocs/op, 12.1 tasks/op, 43092 insns/op, 0 errors)
224250.14 tps ( 61.1 allocs/op, 12.1 tasks/op, 43094 insns/op, 0 errors)
224115.66 tps ( 61.1 allocs/op, 12.1 tasks/op, 43092 insns/op, 0 errors)
223508.70 tps ( 61.1 allocs/op, 12.1 tasks/op, 43107 insns/op, 0 errors)
223498.04 tps ( 61.1 allocs/op, 12.1 tasks/op, 43087 insns/op, 0 errors)
```
after:
```
220708.37 tps ( 61.1 allocs/op, 12.1 tasks/op, 43118 insns/op, 0 errors)
225168.99 tps ( 61.1 allocs/op, 12.1 tasks/op, 43081 insns/op, 0 errors)
222406.00 tps ( 61.1 allocs/op, 12.1 tasks/op, 43088 insns/op, 0 errors)
224608.27 tps ( 61.1 allocs/op, 12.1 tasks/op, 43102 insns/op, 0 errors)
225458.32 tps ( 61.1 allocs/op, 12.1 tasks/op, 43098 insns/op, 0 errors)
```
Though I expect with some more effort we can eliminate some copies.
Closes#13637
* github.com:scylladb/scylladb:
cql3: untyped_result_set: switch to managed_bytes_view as the cell type
cql3: result_set: switch cell data type from bytes_opt to managed_bytes_opt
cql3: untyped_result_set: always own data
types: abstract_type: add mixed-type versions of compare() and equal()
utils/managed_bytes, serializer: add conversion between buffer_view<bytes_ostream> and managed_bytes_view
utils: managed_bytes: add bidirectional conversion between bytes_opt and managed_bytes_opt
utils: managed_bytes: add managed_bytes_view::with_linearized()
utils: managed_bytes: mark managed_bytes_view::is_linearized() const
Currently, when a user has permissions on a function/all functions in
keyspace, and the function/keyspace is dropped, the user keeps the
permissions. As a result, when a new function/keyspace is created
with the same name (and signature), they will be able to use it even
if no permissions on it are granted to them.
Simliarly to regular UDFs, the same applies to UDAs.
After this patch, the corresponding permissions on functions are dropped
when a function/keyspace is dropped.
Fixes#13820Closes#13823
Task manager's tasks covering scrub compaction on top,
shard and table level.
For this levels we have common scrub tasks for each scrub
mode since they share code. Scrub modes will be differentiated
on compaction group level.
Closes#13694
* github.com:scylladb/scylladb:
test: extend test_compaction_task.py to test scrub compaction
compaction: add table_scrub_sstables_compaction_task_impl
compaction: add shard_scrub_sstables_compaction_task_impl
compaction: add scrub_sstables_compaction_task_impl
api: get rid of unnecessary std::optional in scrub
compaction: rename rewrite_sstables_compaction_task_impl
in this series, `data_dictionary::storage_options` is refactored so that each dedicated storage option takes care of itself, instead of putting all the logic into `storage_options`. cleaner this way. as the next step, i will add yet another set of options for the tiered_storage which is backed by the s3_storage and the local filesystem_storage. with this change, we will be able to group the per-option functionalities together by the option thy are designed for, instead of sharding them by the actual function.
Closes#13826
* github.com:scylladb/scylladb:
data_dictionary: define helpers in options
data_dictionary: only define operator== for storage options
The validator classes have their definition in a header located in mutation/, while their implementation is located in a .cc in readers/mutation_reader.cc.
This PR fixes this inconsistency by moving the implementation into mutation/mutation_fragment_stream_validator.cc. The only change is that the validator code gets a new logger instance (but the logger variable itself is left unchanged for now).
Closes#13831
* github.com:scylladb/scylladb:
mutation/mutation_fragment_stream_validator.cc: rename logger
readers,mutation: move mutation_fragment_stream_validator to mutation/
When new nodes are added or existing nodes are deleted, the topology
state machine needs to shunt reads from the old nodes to the new ones.
This happens in the `write_both_read_new` state. The problem is that
previously this state was not handled in any way in `token_metadata` and
the read nodes were only changed when the topology state machine reached
the final 'owned' state.
To handle `write_both_read_new` an additional `interval_map` inside
`token_metadata` is maintained similar to `pending_endpoints`. It maps
the ranges affected by the ongoing topology change operation to replicas
which should be used for reading. When topology state sm reaches the
point when it needs to switch reads to a new topology, it passes
`request_read_new=true` in a call to `update_pending_ranges`. This
forces `update_pending_ranges` to compute the ranges based on new
topology and store them to the `interval_map`. On the data plane, when a
read on coordinator needs to decide which endpoints to use, it first
consults this `interval_map` in `token_metadata`, and only if it doesn't
contain a range for current token it uses normal endpoints from
`effective_replication_map`.
Closes#13376
* github.com:scylladb/scylladb:
storage_proxy, storage_service: use new read endpoints
storage_proxy: rename get_live_sorted_endpoints->get_endpoints_for_reading
token_metadata: add unit test for endpoints_for_reading
token_metadata: add endpoints for reading
sequenced_set: add extract_set method
token_metadata_impl: extract maybe_migration_endpoints helper function
token_metadata_impl: introduce migration_info
token_metadata_impl: refactor update_pending_ranges
token_metadata: add unit tests
token_metadata: fix indentation
token_metadata_impl: return unique_ptr from clone functions
Change f5f566bdd8 introduced
tagged_integer and replaced raft::internal::tagged_uint64
with utils::tagged_integer.
However, the idl type for raft::internal::tagged_uint64
was not marked as final, but utils::tagged_integer is, breaking
the on-the-wire compatibility.
This change restores the use of raft::internal::tagged_uint64
for the raft types and adds back an idl definition for
it that is not marked as final, similar to the way
raft::internal::tagged_id extends utils::tagged_uuid.
Fixes#13752Closes#13774
* github.com:scylladb/scylladb:
raft, idl: restore internal::tagged_uint64 type
raft: define term_t as a tagged uint64_t
idl: gossip_digest: include required headers
in this series, we encode the value of generation using UUID to prepare for the UUID generation identifier. simpler this way, as we don't need to have two ways to encode integer or a timeduuid: uuid with a zero timestamp, and a variant. also, add a `from_string()` factory method to convert string to generation to hide the the underlying type of value from generation_type's users.
Closes#13782
* github.com:scylladb/scylladb:
sstable: use generation_type::from_string() to convert from string
sstable: encode int using UUID in generation_type
Let's say that we have a prepared statement with a token restriction:
```cql
SELECT * FROM some_table WHERE token(p1, p2) = ?
```
After calling `prepare` the drivers receives some information about the prepared statment, including names of values bound to each bind marker.
In case of a partition token restriction (`token(p1, p2) = ?`) there's an expectation that the name assigned to this bind marker will be `"partition key token"`.
In a recent change the code handling `token()` expressions has been unified with the code that handles generic function calls, and as a result the name has changed to `token(p1, p2)`.
It turns out that the Java driver relies on the name being `"partition key token"`, so a change to `token(p1, p2)` broke some things.
This patch sets the name back to `"partition key token"`. To achieve this we detect any restrictions that match the pattern `token(p1, p2, p3) = X` and set the receiver name for X to `"partition key token"`.
Fixes: #13769Closes#13815
* github.com:scylladb/scylladb:
cql-pytest: test that bind marker is partition key token
cql3/prepare_expr: force token() receiver name to be partition key token
in this change,
* instead of using "\d+" to match the generation, use "[^-]",
* let generation_type to convert a string to generation
before this change, we casts the matched string in SSTable file name
to integer and then construct a generation identifier from the integer.
this solution has a strong assumption that the generation is represented
with an integer, we should not encode this assumption in sstable.cc,
instead we'd better let generation_type itself to take care of this. also,
to relax the restriction of regex for matching generation, let's
just use any characters except for the delimeter -- "-".
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
since we already use UUID for encoding an bigint in SSTable registry
table, let's just use the same approach for encoding bigint in generation_type,
to be more consistent, and less repeatings this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
We use set_topology_transition_state to set read_new state
in storage_service::topology_state_load
based on _topology_state_machine._topology.tstate.
This triggers update_pending_ranges to compute and store new ranges
for read requests. We use this information in
storage_proxy::get_endpoints_for_reading
when we need to decide which nodes to use for reading.
We are going to use remapped_endpoints_for_reading, we need
to make sure we use it in the right place. The
get_live_sorted_endpoints function looks like what we
need - it's used in all read code paths.
From its name, however, this was not obvious.
Also, we add the parameter ks_name as we'll need it
to pass to remapped_endpoints_for_reading.
In this patch we add
token_metadata::set_topology_transition_state method.
If the current state is
write_both_read_new update_pending_ranges
will compute new ranges for read requests. The default value
of topology_transition_state is null, meaning no read
ranges are computed. We will add the appropriate
set_topology_transition_state calls later.
Also, we add endpoints_for_reading method to get
read endpoints based on the computed ranges.
instead of dispatching and implementing the per-option handling
right in `storage_option`, define these helpers in the dedicated
option themselves, so `storage_option` is only responsible for
dispatching.
much cleaner this way. this change also makes it easier to add yet
another storage backend.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
as the only user of these comparison operators is
`storage_options::can_update_to()`, which just check if the given
`storage_options` is equal to the stored one. so no need to define
the <=> operator.
also, no need to add the `friend` specifier, as the options are plain
struct, all the member variables are public.
make the comparison operator a member function instead of a free
function, as in C++20 comparision operators are symmetric.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The validator classes have their definition in a header located in mutation/,
while their implementation is located in a .cc in readers/mutation_reader.cc.
This patch fixes this inconsistency by moving the implementation into
mutation/mutation_fragment_stream_validator.cc. The only change is that
the validator code gets a new logger instance (but the logger variable itself
is left unchanged for now).
this change extracts the storage class and its derived classes
out into their own source files. for couple reasons:
- for better readability. the sstables.hh is over 1005 lines.
and sstables.cc 3602 lines. it's a little bit difficult to figure
out how the different parts in these sources interact with each
other. for instance, with this change, it's clear some of helper
functions are only used by file_system_storage.
- probably less inter-source dependency. by extracting the sources
files out, they can be compiled individually, so changing one .cc
file does not impact others. this could speed up the compilation
time.
Closes#13785
* github.com:scylladb/scylladb:
sstables: storage: coroutinize idempotent_link_file()
sstables: extract storage out
When preparing a query each bind marker gets a name.
For a query like:
```cql
SELECT * FROM some_table WHERE token(p1, p2) = ?
```
The bind marker's name should be `"partition key token"`.
Java driver relies on this name, having something else,
like `"token(p1, p2)"` be the name breaks the Java driver.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Let's say that we have a prepared statement with a token restriction:
```cql
SELECT * FROM some_table WHERE token(p1, p2) = ?
```
After calling `prepare` the drivers receives some information
about the prepared statment, including names of values bound
to each bind marker.
In case of a partition token restriction (`token(p1, p2) = ?`)
there's an expectation that the name assigned to this bind marker
will be `"partition key token"`.
In a recent change the code handling `token()` expressions has been
unified with the code that handles generic function calls,
and as a result the name has changed to `token(p1, p2)`.
It turns out that the Java driver relies on the name being
`"partition key token"`, so a change to `token(p1, p2)`
broke some things.
This patch sets the name back to `"partition key token"`.
To achieve this we detect any restrictions that match
the pattern `token(p1, p2, p3) = X` and set the receiver
name for X to `"partition key token"`.
Fixes: #13769
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
We are going to add a function in token_metadata to get read endpoints,
similar to pending_endpoints_for. So in this commit we extract
the maybe_migration_endpoints helper function, which will be
used in both cases.
We are going to store read_endpoints in a way similar
to pending ranges, so in this commit we add
migration_info - a container for two
boost::icl::interval_map.
Also, _pending_ranges_interval_map is renamed to
_keyspace_to_migration_info, since it captures
the meaning better.
Now update_pending_ranges is quite complex, mainly
because it tries to act efficiently and update only
the affected intervals. However, it uses the function
abstract_replication_strategy::get_ranges, which calls
calculate_natural_endpoints for every token
in the ring anyway.
Our goal is to start reading from the new replicas for
ranges in write_both_read_new state. In the current
code structure this is quite difficult to do, so
in this commit we first simplify update_pending_ranges.
The main idea of the refactoring is to build a new version
of token_metadata based on all planned changes
(join, bootstrap, replace) and then for each token
range compare the result of calculate_natural_endpoints on
the old token_metadata and on the new one.
Those endpoints that are in the new version and
are not in the old version should be added to the pending_ranges.
The add_mapping function is extracted for the
future - we are going to use it to handle read mappings.
Special care is taken when replacing with the same IP.
The coordinator employs the
get_natural_endpoints_without_node_being_replaced function,
which excludes such endpoints from its result. If we compare
the new (merged) and current token_metadata configurations, such
endpoints will also be absent from pending_endpoints since
they exist in both. To address this, we copy the current
token_metadata and remove these endpoints prior to comparison.
This ensures that nodes being replaced are treated
like those being deleted.
Change f5f566bdd8 introduced
tagged_integer and replaced raft::internal::tagged_uint64
with utils::tagged_integer.
However, the idl type for raft::internal::tagged_uint64
was not marked as final, but utils::tagged_integer is, breaking
the on-the-wire compatibility.
This change defines the different raft tagged_uint64
types in idl/raft_storage.idl.hh as non-final
to restore the way they were serialized prior to
f5f566bdd8Fixes#13752
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
All Raft verbs include `dst_id`, the ID of the destination server, but
it isn't checked. `append_entries` will work even if it arrives at
completely the wrong server (but in the same group). It can cause
problems, e.g. in the scenario of replacing a dead node.
This commit adds verifying if `dst_id` matches the server's ID and if it
doesn't, the Raft verb is rejected.
Closes#12179
Testing
---
Testcase and scylla's configuration:
57d3ef14d8
It artificially lengthens the duration of replacing the old node. It
increases the chance of getting the RPC command sent to a replaced node,
by the new node.
In the logs of the node that replaced the old one, we can see logs in
the form:
```
DEBUG <time> [shard 0] raft_group_registry - Got message for server <dst_id>, but my id is <my_id>
```
It indicates that the Raft verb with the wrong `dst_id` was rejected.
This test isn't included in the PR because it doesn't catch any specific error.
Closes#13575
* github.com:scylladb/scylladb:
service/raft: raft_group_registry: Add verification of destination ID
service/raft: raft_group_registry: `handle_raft_rpc` refactor
this change extracts the storage class and its derived classes
out into storage.cc and storage.hh. for couple reasons:
- for better readability. the sstables.hh is over 1005 lines.
and sstables.cc 3602 lines. it's a little bit difficult to figure
out how the different parts in these sources interact with each
other. for instance, with this change, it's clear some of helper
functions are only used by file_system_storage.
- probably less inter-source dependency. by extracting the sources
files out, they can be compiled individually, so changing one .cc
file does not impact others. this could speed up the compilation
time.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Rename rewrite_sstables_compaction_task_impl to
sstables_compaction_task_impl as a new name describes the class
of tasks better. Rewriting sstables is a slightly more fine-grained
type of sstable compaction task then the one needed here.
this series prepares for the UUID based generation by replacing the general `value()` function with the function with more specific name: `as_int()`.
Closes#13796
* github.com:scylladb/scylladb:
test: drop a reusable_sst() variant which accepts int as generation
treewide: replace generation_type::value() with generation_type::as_int()
In addition to the data file itself. Currently validation avoids the
index altogether, using the crawling reader which only relies on the
data file and ignores the index+summary. This is because a corrupt
sstable usually has a corrupt index too and using both at the same time
might hide the corruption. This patch adds targeted validation of the
index, independent of and in addition to the already existing data
validation: it validates the order of index entries as well as whether
the entry points to a complete partition in the data file.
This will usually result in duplicate errors for out-of-order
partitions: one for the data file and one for the index file.
Fixes: #9611Closes#11405
* github.com:scylladb/scylladb:
test/cql-pytest: add test_sstable_validation.py
test/cql-pytest: extract scylla_path,temp_workdir fixtures to conftest.py
tools/scylla-sstables: write validation result to stdout
sstables/sstable: validate(): delegate to mx validator for mx sstables
sstables/mx/reader: add mx specific validator
mutation/mutation_fragment_stream_validator: add validator() accessor to validating filter
sstables/mx/reader: template data_consume_rows_context_m on the consumer
sstables/mx/reader: move row_processing_result to namespace scope
sstables/mx/reader: use data_consumer::proceed directly
sstables/mx/reader.cc: extend namespace to end-of-file (cosmetic)
compaction/compaction: remove now unused scrub_validate_mode_validate_reader()
compaction/compaction: move away from scrub_validate_mode_validate_reader()
tools/scylla-sstable: move away from scrub_validate_mode_validate_reader()
test/boost/sstable_compaction_test: move away from scrub_validate_mode_validate_reader()
sstables/sstable: add validate() method
compaction/compaction: scrub_sstables_validate_mode(): validate sstables one-by-one
compaction: scrub: use error messages from validator
mutation_fragment_stream_validator: produce error messages in low-level validator
The execution loop consumes permits from the _ready_list and executes
them. The _ready_list usually contains a single permit. When the
_ready_list is not empty, new permits are queued until it becomes empty.
The execution loops relies on admission checks triggered by the read
releasing resouces, to bring in any queued read into the _ready_list,
while it is executing the current read. But in some cases the current
read might not free any resorces and thus fail to trigger an admission
check and the currently queued permits will sit in the queue until
another source triggers an admission check.
I don't yet know how this situation can occur, if at all, but it is
reproducible with a simple unit test, so it is best to cover this
corner-case in the off-chance it happens in the wild.
Add an explicit admission check to the execution loop, after the
_ready_list is exhausted, to make sure any waiters that can be admitted
with an empty _ready_list are admitted immediately and execution
continues.
Fixes: #13540Closes#13541
The discussion on the thread says, when we reformat a volume with another
filesystem, kernel and libblkid may skip to populate /dev/disk/by-* since it
detected two filesystem signatures, because mkfs.xxx did not cleared previous
filesystem signature.
To avoid this, we need to run wipefs before running mkfs.
Note that this runs wipefs twice, for target disks and also for RAID device.
wipefs for RAID device is needed since wipefs on disks doesn't clear filesystem signatures on /dev/mdX (we may see previous filesystem signature on /dev/mdX when we construct RAID volume multiple time on same disks).
Also dropped -f option from mkfs.xfs, it will check wipefs is working as we
expected.
Fixes#13737
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#13738
The test performs an `INSERT` followed by a `SELECT`, checking if the
previously inserted data is returned.
This may fail because we're using `ring_delay = 0` in tests and the two
queries may arrive at different nodes, whose `token_metadata` didn't
converge yet (it's eventually consistent based on gossiping).
I illustrated this here:
https://github.com/scylladb/scylladb/issues/12937#issuecomment-1536147455
Ensure that the nodes' token rings are synchronized (by waiting until
the token ring members on each node is the same as group 0
configuration).
Fixes#12937Closes#13791
`RandomTables.verify_schema` is often called in topology tests after
performing a schema change. It compares the schema tables fetched from
some node to the expected latest schema stored by the `RandomTables`
object.
However there's no guarantee that the latest schema change has already
propagated to the node which we query. We could have performed the
schema change on a different node and the change may not have been
applied yet on all nodes.
To fix that, pick a specific node and perform a read barrier on it, then
use that node to fetch the schema tables.
Fixes#13788Closes#13789
Currently, when we deal with a Wasm program, we store
it in its final WebAssembly Text form. This causes a lot
of code bloat and is hard to read. Instead, we would like
to store only the source codes, and build Wasm when
necessary. This series adds build commands that
compile C/Rust sources to Wasm and uses them for Wasm
programs that we're already using.
After these changes, adding a new program that should be
compiled to Rust, requires only adding the source code
of it and updating the `wasms` and `wasm_deps` lists in
`configure.py`.
All Wasm programs are build by default when building all
artifacts, artifacts in a given mode, or when building
tests. Additionally, a {mode}-wasm target is added, so that
it's possible to build just the wasm files.
The generated files are saved in $builddir/{mode}/wasm,
and are accessed in cql-pytests similarly to the way we're
accessing the scylla binary - using glob.
Closes#13209
* github.com:scylladb/scylladb:
wasm: replace wasm programs with their source programs
build: prepare rules for compiling wasm files
build: set the type of build_artifacts
test: extend capabilities of Wasm reading helper funciton
token_metadata takes token_metadata_impl as unique_ptr,
so it makes sense to create it that way in the first place
to avoid unnecessary moves.
token_metadata_impl constructor with shallow_copy parameter
was made public for std::make_unique. The effective
accessibility of this constructor hasn't changed though since
shallow_copy remains private.
After recent changes, we are able to store only the
C/Rust source codes for Wasm programs, and only build
them when neccessary. This patch utilizes this
opportunity by removing most of the currently stored
raw Wasm programs, replacing them with C/Rust sources
and adding them to the new build system.
Currently, when we deal with a Wasm program, we store
it in its final WebAssembly Text form. This causes a lot
of code bloat and is hard to read. Instead, we would like
to store only the (C/Rust) source codes, and build Wasm
when neccessary. This patch adds build commands that
compile C/Rust sources to Wasm.
After these changes, adding a new program that should be
compiled to Rust, requires only adding the source code
of it and updating the wasms and wasm_deps lists in
configure.py.
All Wasm programs are build by default when building all
artifacts, all artifacts in a given mode, or when building
tests. Additionally, a ninja wasm target is added, so that
it's possible to build just the wasm files.
The generated files are saved in $builddir/wasm.
Currently, build_artifacts are of type set[str] | list, which prevents
us from performing set operations on it. In a future patch, we will
want to take a set difference and set intersections with it, so we
initialize the type of build_artifacts to a set in all cases.
Currently, we require that the Wasm file is named the same
as the funciton. In the future we may want multiple functions
with the same name, which we can't currently do due to this
limitation.
This patch allows specifying the function name, so that multiple
files can have a function with the same name.
Additionally, the helper method now escapes "'" characters, so
that they can appear in future Wasm files.
SSTable relies on st.st_mtime for providing creation time of data
file, which in turn is used by features like tombstone compaction.
Therefore, let's implement it.
Fixes https://github.com/scylladb/scylladb/issues/13649.
Closes#13713
* github.com:scylladb/scylladb:
s3: Provide timestamps in the s3 file implementation
s3: Introduce get_object_stats()
s3: introduce get_object_header()
SSTable relies on st.st_mtime for providing creation time of data
file, which in turn is used by features like tombstone compaction.
Fixes#13649.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
get_object_stats() will be used for retrieving content size and
also last modified.
The latter is required for filling st_mtim, etc, in the
s3::client::readable_file::stat() method.
Refs #13649.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
since #13452, we switched most of the caller sites from std::regex
to boost::regex. in this change, all occurences of `#include <regex>`
are dropped unless std::regex is used in the same source file.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13765
functinoality wise, `uint64_t_tri_compare()` is identical to the
three-way comparison operator, so no need to keep it. in this change,
it is dropped in favor of <=>.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13794
The expression system uses managed_bytes_opt for values, but result_set
uses bytes_opt. This means that processing values from the result set
in expressions requires a copy.
Out of the two, managed_bytes_opt is the better choice, since it prevents
large contiguous allocations for large blobs. So we switch result_set
to use managed_bytes_opt. Users of the result_set API are adjusted.
The db::function interface is not modified to limit churn; instead we
convert the types on entry and exit. This will be adjusted in a following
patch.
untyped_result_set is used for internal queries, where ease-of-use is more
important than performance. Currently, cells are held either by value or
by reference (managed_bytes_view). An upcoming change will cause the
result set to be built from managed_bytes_view, making it non-owning, but
the source data is not actually held, resulting in a use-after-free.
Rather than chase the source and force the data to be owned in this case,
just drop the possibility for a non-owning untyped_result_set. It's only
used in non-performance-critical paths and safety is more important than
saving a few cycles.
This also results in simplification: previously, we had a variant selecting
monostate (for NULL), managed_bytes_view (for a reference), and bytes (for
owning data); now we only have a bytes_opt since that already signifies
data-or-NULL.
Once result_set transitions to managed_bytes_opt, untyped_result_set
will follow. For now it's easier to use bytes_opt.
compare() and equal() can compare two unfragmented values or two
fragmented values, but a mix of a fragmented value and an unfragmented
value runs afoul of C++ conversion rules. Add more overloads to
make it simpler for users.
The codebase evolved to have several different ways to hold a fragmented
buffer: fragmented_temporary_buffer (for data received from the network;
not relevant for this discussion); bytes_ostream (for fragmented data that
is built incrementally; also used for a serialized result_set), and
managed_bytes (used for lsa and serialized individual values in
expression evaluation).
One problem with this state of affairs is that using data in one
fragmented form with functions that accept another fragmented form
requires either a copy, or templating everything. The former is
unpalatable for fast-path code, and the latter is undesirable for
compile time and run-time code footprint. So we'd like to make
the various forms compatible.
In 53e0dc7530 ("bytes_ostream: base on managed_bytes") we changed
bytes_ostream to have the same underlying data structure as
managed_bytes, so all that remains is to add the right API. This
is somewhat difficult as the data is hidden in multiple layers:
ser::buffer_view<> is used to abstract a slice of bytes_ostream,
and this is further abstracted by using iterators into bytes_ostream
rather than directly using the internals. Likewise, it's impossible
to construct a managed_bytes_view from the internals.
Hack through all of these by adding extract_implementation() methods,
and a build_managed_bytes_view_from_internals() helper. These are all
used by new APIs buffer_view_to_managed_bytes_view() that extract
the internals and put them back together again.
Ideally we wouldn't need any of this, but unifying the type system
in this area is quite an undertaking, so we need some shortcuts.
Becomes useful in later patches.
To avoid double-compiling the call to func(), use an
immediately-invoked lambda to calculate the bytes_view we'll be
calling func() with.
The code was incorrectly passing a data_value of type bytes due to
implicit conversion of the result of serialize() (bytes_opt) to a
data_value object of type bytes_type via:
data_value(std::optional<NativeType>);
mutation::set_static_cell() accepts a data_value object, which is then
serialized using column's type in abstract_type::decompose(data_value&):
bytes b(bytes::initialized_later(), serialized_size(*this, value._value));
auto i = b.begin();
value.serialize(i);
Notice that serialized_size() is taken from the column type, but
serialization is done using data_value's type. The two types may have
a compatible CQL binary representation, but may differ in native
types. serialized_size() may incorrectly interpret the native type and
come up with the wrong size. If the size is too smaller, we end up with
stack or heap corruption later after serialize().
For example, if the column type is utf8 but value holds bytes, the
size will be wrong because even though both use the basic_sstring
type, they have a different layout due to max_size (15 vs 31).
Fixes#13717Closes#13787
When requesting memory via `reader_permit::request_memory()`, the
requested amount is added to `_requested_memory` member of the permit
impl. This is because multiple concurrent requests may be blocked and
waiting at the same time. When the requests are fulfilled, the entire
amount is consumed and individual requests track their requested amount
with `resource_units` to release later.
There is a corner-case related to this: if a reader permit is registered
as inactive while it is waiting for memory, its active requests are
killed with `std::bad_alloc`, but the `_requested_memory` fields is not
cleared. If the read survives because the killed requests were part of
a non-vital background read-ahead, a later memory request will also
include amount from the failed requests. This extra amount wil not be
released and hence will cause a resource leak when the permit is
destroyed.
Fix by detecting this corner case and clearing the `_requested_memory`
field. Modify the existing unit test for the scenario of a permit
waiting on memory being registered as inactive, to also cover this
corner case, reproducing the bug.
Fixes: #13539Closes#13679
this is one of the changes to reduce the usage of integer based generation
test. in future, we will need to expand the test to exercise the UUID
based generation, or at least to be neutral to the underlying generation's
identifier type. so, to remove the helpers which only accept `generation_type::int_t`
would helps us to make this happen.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* replace generation_type::value() with generation_type::as_int()
* drop generation_value()
because we will switch over to UUID based generation identifier, the member
function or the free function generation_value() cannot fulfill the needs
anymore. so, in this change, they are consolidated and are replaced by
"as_int()", whose name is more specific, and will also work and won't be
misleading even after switching to UUID based generation identifier. as
`value()` would be confusing by then: it could be an integer or a UUID.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
For some reason Scylla crashes on `aarch64` in release mode when calling
`fmt::format` in `raft_removenode` and `raft_decommission`. E.g. on this
line:
```
group0_command g0_cmd = _group0->client().prepare_command(std::move(change), guard, fmt::format("decomission: request decomission for {}", raft_server.id()));
```
I found this in our configure.py:
```
def get_clang_inline_threshold():
if args.clang_inline_threshold != -1:
return args.clang_inline_threshold
elif platform.machine() == 'aarch64':
# we see miscompiles with 1200 and above with format("{}", uuid)
# also coroutine miscompiles with 600
return 300
else:
return 2500
```
but reducing it to `0` didn't help.
I managed to get the following backtrace (with inline threshold 0):
```
void boost::intrusive::list_impl<boost::intrusive::mhtraits<seastar::thread_context, boost::intrusive::list_member_hook<>, &seastar::thread_context::_all_link>, unsigned long, false, void>::clear_and_dispose<boost::intrusive::detail::null_disposer>(boost::intrusive::detail::null_disposer) at /usr/include/boost/intrusive/list.hpp:751
(inlined by) boost::intrusive::list_impl<boost::intrusive::mhtraits<seastar::thread_context, boost::intrusive::list_member_hook<>, &seastar::thread_context::_all_link>, unsigned long, false, void>::clear() at /usr/include/boost/intrusive/list.hpp:728
(inlined by) ~list_impl at /usr/include/boost/intrusive/list.hpp:255
void fmt::v9::detail::buffer<wchar_t>::append<wchar_t>(wchar_t const*, wchar_t const*) at ??:?
void fmt::v9::detail::vformat_to<char>(fmt::v9::detail::buffer<char>&, fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<std::conditional<std::is_same<fmt::v9::type_identity<char>::type, char>::value, fmt::v9::appender, std::back_insert_iterator<fmt::v9::detail::buffer<fmt::v9::type_identity<char>::type> > >::type, fmt::v9::type_identity<char>::type> >, fmt::v9::detail::locale_ref) at ??:?
fmt::v9::vformat[abi:cxx11](fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<fmt::v9::appender, char> >) at ??:?
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > fmt::v9::format<utils::tagged_uuid<raft::server_id_tag>&>(fmt::v9::basic_format_string<char, fmt::v9::type_identity<utils::tagged_uuid<raft::server_id_tag>&>::type>, utils::tagged_uuid<raft::server_id_tag>&) at /usr/include/fmt/core.h:3206
(inlined by) service::storage_service::raft_removenode(utils::tagged_uuid<locator::host_id_tag>) at ./service/storage_service.cc:3572
```
Maybe it's a bug in `fmt` library?
In any case replacing the call with `::format` (i.e. `seastar::format`
from seastar/core/print.hh) helps.
Do it for the entire file for consistency (and avoiding this bug).
Also, for the future, replace `format` calls with `::format` - now it's
the same thing, but the latter won't clash with `std::format` once we
switch to libstdc++13.
Fixes#13707Closes#13711
This is a follow-up to #13399, the patch
addresses the issues mentioned there:
* linesep can be split between blocks;
* linesep can be part of UTF-8 sequence;
* avoid excessively long lines, limit to 256 chars;
* the logic of the function made simpler and more maintainable.
Closes#13427
* github.com:scylladb/scylladb:
pylib_test: add tests for read_last_line
pytest: add pylib_test directory
scylla_cluster.py: fix read_last_line
scylla_cluster.py: move read_last_line to util.py
to avoid the FTBFS after we bump up the Seastar submodule which bumped up its API level to v7. and API v7 is a breaking change. so, in order to unbreak the build, we have to hardwire the API level to 6. `configure.py` also does this.
Closes#13780
* github.com:scylladb/scylladb:
build: cmake: disable deprecated warning
build: cmake: use Seastar API level 6
we introduced the linkage to Boost::unit_test_framework in
fe70333c19, this library is used by
test/lib/test_utils.cc, so update CMake accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13781
This is a follow-up to #13399, the patch
addresses the issues mentioned there:
* linesep can be split between blocks;
* linesep can be part of UTF-8 sequence;
* avoid excessively long lines, limit to 512 chars;
* the logic of the function made simpler and more
maintainable.
There are two of them currently with slightly different declaration. Better to leave only one.
Closes#13772
* github.com:scylladb/scylladb:
test: Deduplicate test::filename() static overload
test: Make test::filename return fs::path
The method in question suffers from scylladb/seastar#1298. The PR fixes it and makes a bit shorter along the way
Closes#13776
* github.com:scylladb/scylladb:
sstable: Close file at the end
sstables: Use read_entire_stream_cont() helper
This commit changes the configuration in the conf.py
file to make branch-5.2 the latest version and
remove it from the list of unstable versions.
As a result, the docs for version 5.2 will become
the default for users accessing the ScyllaDB Open Source
documentation.
This commit should be merged as soon as version 5.2
is released.
Closes#13681
One is unused, the other one is not really required in public
Closes#13771
* github.com:scylladb/scylladb:
file_writer: Remove static make() helper
sstable: Use toc_filename() to print TOC file path
The test case consists of two internal sub-test-cases. Making them
explicit kills three birds with one stone
- improves parallelizm
- removes env's tempdir wiping
- fixes code indentation
refs: #12707
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13768
since Seastar now deprecates a bunch of APIs which accept io_priority_class,
we started to have deprecated warnings. before migrating to V7 API,
let's disable this warning.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
to avoid the FTBFS after we bump up the Seastar submodule
which bumped up its API level to v7. and API v7 is a breaking
change. so, in order to unbreak the build, we have to hardwire
the API level to 6. `configure.py` also does this.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* seastar 02d5a0d7c...f94b1bb9c (12):
> Merge 'Unify CPU scheduling groups and IO priority classes' from Pavel Emelyanov
> scripts: addr2line: relax regular expression for matching kernel traces
> add dirs for clangd to .gitignore
> http::client: Log failed requests' body
> build: always quote the ENVIRONMENT with quotes
> exception_hacks: Change guard check order to work around static init fail
> shared_future: remove support for variadic futures
> iotune: Don't close file that wasn't opened
Fixes#13439
> Merge 'Relax per tick IO grab threshold' from Pavel Emelyanov
> future: simplify constraint on then() a little
> Merge 'coroutine: generator: initialize const member variable and enable generator tests' from Kefu Chai
> future: drop libc++ std::tuple compatibility hack
Closes#13777
The thing is than when closing file input stream the underlying file is
not .close()-d (see scylladb/seastar#1298). The remove_by_toc_name() is
buggy in this sense. Using with_closeable() fixes it and makes the code
shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The remove_by_toc_name() wants to read the whole stream into a sstring.
There's a convenience helper to facilitate that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In https://github.com/scylladb/scylladb/pull/13482 we renamed the reader permit states to more descriptive names. That PR however only covered only the states themselves and their usages, as well as the documentation in `docs/dev`.
This PR is a followup to said PR, completing the name changes: renaming all symbols, names, comments etc, so all is consistent and up-to-date.
Closes#13573
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: misc updates w.r.t. recent permit state name changes
reader_concurrency_semaphore: update permit members w.r.t. recent permit state name changes
reader_concurrency_semaphore: update RAII state guard classes w.r.t. recent permit state name changes
reader_concurrency_semaphore: update API w.r.t. recent permit state name changes
reader_concurrency_semaphore: update stats w.r.t. recent permit state name changes
e2c9cdb576 moved the validation of the range tombstone change to the place where it is actually consumed, so we don't attempt to pass purged or discarded range tombstones to the validator. In doing so however, the validate pass was moved after the consume call, which moves the range tombstone change, the validator having been passed a moved-from range tombstone. Fix this by moving he validation to before the consume call.
Refs: #12575Closes#13749
* github.com:scylladb/scylladb:
test/boost/mutation_test: add sanity test for mutation compaction validator
mutation/mutation_compactor: add validation level to compaction state query constructor
mutation/mutation_compactor: validate range tombstone change before it is moved
Current S3 client was tested over minio and it takes few more touches to work with amazon S3.
The main challenge here is to support singed requests. The AWS S3 server explicitly bans unsigned multipart-upload requests, which in turn is the essential part of the sstables S3 backend, so we do need signing. Signing a request has many options and requirements, one of them is -- request _body_ can be or can be not included into signature calculations. This is called "(un)signed payload". Requests sent over plain HTTP require payload signing (i.e. -- request body should be included into signature calculations), which can a bit troublesome, so instead the PR uses unsigned payload (i.e. -- doesn't include the request body into signature calculation, only necessary headers and query parameters), but thus also needs HTTPS.
So what this set does is makes the existing S3 client code sign requests. In order to sign the request the code needs to get AWS key and secret (and region) from somewhere and this somewhere is the conf/object_storage.yaml config file. The signature generating code was previously merged (moved from alternator code) and updated to suit S3 client needs.
In order to properly support HTTPS the PR adds special connection factory to be used with seastar http client. The factory makes DNS resolving of AWS endpoint names and configures gnutls systemtrust.
fixes: #13425Closes#13493
* github.com:scylladb/scylladb:
doc: Add a document describing how to configure S3 backend
s3/test: Add ability to run boost test over real s3
s3/client: Sign requests if configured
s3/client: Add connection factory with DNS resolve and configurable HTTPS
s3/client: Keep server port on config
s3/client: Construct it with config
s3/client: Construct it with sstring endpoint
sstables: Make s3_storage with endpoint config
sstables_manager: Keep object storage configs onboard
code: Introduce conf/object_storage.yaml configuration file
Commit ecbd112979
`distributed_loader: reshard: consider sstables for cleanup`
caused a regression in loading new sstables using the `upload`
directory, as seen in e.g. https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/230/testReport/migration_test/TestMigration/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split000___test_migrate_sstable_without_compression_3_0_md_/
```
query = "SELECT COUNT(*) FROM cf"
statement = SimpleStatement(query)
s = self.patient_cql_connection(node, 'ks')
result = list(s.execute(statement))
> assert result[0].count == expected_number_of_rows, \
"Expected {} rows. Got {}".format(expected_number_of_rows, list(s.execute("SELECT *
FROM ks.cf")))
E AssertionError: Expected 1 rows. Got []
E assert 0 == 1
E +0
```
The reason for the regression is that the call to `do_for_each_sstable` in `collect_all_shared_sstables` to search for sstables that need cleanup caused the list of sstables in the sstable directory to be moved and cleared.
parallel_for_each_restricted moves the container passed to it into a `do_with` continuation. This is required for parallel_for_each_restricted.
However, moving the container is destructive and so, the decision whether to move or not needs to be the caller's, not the callee.
This patch changes the signature of parallel_for_each_restricted to accept a container rather than a rvalue reference, allowing the callers to decide whether to move or not.
Most callers are converted to move the container, except for `do_for_each_sstable` that copies `_unshared_local_sstables`, allowing callers to call `dir.do_for_each_sstable` multiple times without moving the list contents.
Closes#13526
* github.com:scylladb/scylladb:
sstable_directory: coroutinize parallel_for_each_restricted
sstable_directory: parallel_for_each_restricted: use std::ranges for template definition
sstable_directory: parallel_for_each_restricted: do not move container
There are two of them currently, both returning fs::path for sstable
components. One is static and can be dropped, callers are patched to use
the non-static one making the code tiny bit shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sstable::filename() is private and is not supposed to be used as a
path to open any files. However, tests are different and they sometimes
know it is. For that they use test wrapper that has access to private
members and may make assumptions about meaning of sstable::filename().
Said that, the test::filename() should return fs::path, not sstring.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Commit 1cb95b8cf caused a small regression in the debug printer.
After that commit, range tombstones are printed to stdout,
instead of the target stream.
In practice, this causes range tombstones to appear in test logs
out of order with respect to other parts of the debug message.
Fix that.
Closes#13766
The sstable::write_toc() gets TOC filename from file writer, while it
can get it from itself. This makes the file_writer::get_filename()
private and actually improves logging, as the writer is not required
to have the filename onboard, while sstable always has it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All Raft verbs include dst_id, the ID of the destination server, but it isn't checked.
`append_entries` will work even if it arrives at completely the wrong server (but in the same group).
It can cause problems, e.g. in the scenario of replacing a dead node.
This commit adds verifying if `dst_id` matches the server's ID and if it doesn't,
the Raft verb is rejected.
Closes#12179
storage_service uses raft_group0 but the during shutdown the later is
destroyed before the former is stopped. This series move raft_group0
destruction to be after storage_service is stopped already. For the
move to work some existing dependencies of raft_group0 are dropped
since they do not really needed during the object creation.
Fixes#13522
In case an sstable unit test case is run individually, it would fail
with exception saying that S3_... environment is not set. It's better to
skip the test-case rather than fail. If someone wants to run it from
shell, it will have to prepare S3 server (minio/AWS public bucket) and
provide proper environment for the test-case.
refs: #13569
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13755
One-way RPC and two-way RPC have different semantics, i.e. in the first one
client doesn't need to wait for an answer.
This commit splits the logic of `handle_raft_rpc` to enable handle differences
in semantics, e.g. errors handling.
Raft replication doesn't guarantee that all replicas see
identical Raft state at all times, it only guarantees the
same order of events on all replicas.
When comparing raft state with gossip state on a node, first
issue a read barrier to ensure the node has the latest raft state.
To issue a read barrier it is sufficient to alter a non-existing
state: in order to validate the DDL the node needs to sync with the
leader and fetch its latest group0 state.
Fixes#13518 (flaky topology test).
Closes#13756
raft_group0 does not really depends on cdc::generation_service, it needs
it only transiently, so pass it to appropriate methods of raft_group0
instead of during its creation.
Commit ecbd112979
`distributed_loader: reshard: consider sstables for cleanup`
caused a regression in loading new sstables using the `upload`
directory, as seen in e.g. https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/230/testReport/migration_test/TestMigration/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split000___test_migrate_sstable_without_compression_3_0_md_/
```
query = "SELECT COUNT(*) FROM cf"
statement = SimpleStatement(query)
s = self.patient_cql_connection(node, 'ks')
result = list(s.execute(statement))
> assert result[0].count == expected_number_of_rows, \
"Expected {} rows. Got {}".format(expected_number_of_rows, list(s.execute("SELECT * FROM ks.cf")))
E AssertionError: Expected 1 rows. Got []
E assert 0 == 1
E +0
E -1
```
The reason for the regression is that the call to `do_for_each_sstable`
in `collect_all_shared_sstables` to search for sstables that need
cleanup caused the list of sstables in the sstable directory to be
moved and cleared.
parallel_for_each_restricted moves the container passed to it
into a `do_with` continuation. This is required for
parallel_for_each_restricted.
However, moving the container is destructive and so,
the decision whether to move or not needs to be the
caller's, not the callee.
This patch changes the signature of parallel_for_each_restricted
to accept a lvalue reference to the container rather than a rvalue reference,
allowing the callers to decide whether to move or not.
Most callers are converted to move the container, as they effectively do
today, and a new method, `filter_sstables` was added for the
`collect_all_shared_sstables` us case, that allows the `func` that
processes each sstable to decide whether the sstable is kept
in `_unshared_local_sstables` or not.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the validate command uses the logger to output the result of
validation. This is inconsistent with other commands which all write
their output to stdout and log any additional information/errors to
stderr. This patch updates the validate command to do the same. While at
it, remove the "Validating..." message, it is not useful.
We have a more in-depth validator for the mx format, so delegate to that
if the validated sstable is of that format. For kl/la we fall-back to
the reader-level validator we used before.
Working with the low-level sstable parser and index reader, this
validator also cross-checks the index with the data file, making sure
all partitions are located at the position and in the order the index
describes. Furthermore, if the index also has promoted index, the order
and position of clustering elements is checked against it.
This is above the usual fragment kind order, partition key order and
clustering order checks that we already had with the reader-level
validator.
it turns out the only places where we have compiler warnings of -W-parentheses-equality is the source code generated by ANTLR. strictly speaking, this is valid C++ code, just not quite readable from the hygienic point of view. so let's enable this warning in the source tree, but only disable it when compiling the sources generated by ANTLR.
please note, this warning option is supported by both GCC and Clang, so no need to test if it is supported.
for a sample of the warnings, see:
```
/home/kefu/dev/scylladb/build/cmake/cql3/CqlLexer.cpp:21752:38: error: equality comparison with extraneous parentheses [-Werror,-Wparentheses-equality]
if ( (LA4_0 == '$'))
~~~~~~^~~~~~
/home/kefu/dev/scylladb/build/cmake/cql3/CqlLexer.cpp:21752:38: note: remove extraneous parentheses around the comparison to silence this warning
if ( (LA4_0 == '$'))
~ ^ ~
```
Closes#13762
* github.com:scylladb/scylladb:
build: only apply -Wno-parentheses-equality to ANTLR generated sources
compaction: disambiguate format_to()
it turns out the only places where we have compiler warnings of
-W-parentheses-equality is the source code generated by ANTLR. strictly
speaking, this is valid C++ code, just not quite readable from the
hygienic point of view. so let's enable this warning in the source tree,
but only disable it when compiling the sources generated by ANTLR.
please note, this warning option is supported by both GCC and Clang,
so no need to test if it is supported.
for a sample of the warnings, see:
```
/home/kefu/dev/scylladb/build/cmake/cql3/CqlLexer.cpp:21752:38: error: equality comparison with extraneous parentheses [-Werror,-Wparentheses-equality]
if ( (LA4_0 == '$'))
~~~~~~^~~~~~
/home/kefu/dev/scylladb/build/cmake/cql3/CqlLexer.cpp:21752:38: note: remove extraneous parentheses around the comparison to silence this warning
if ( (LA4_0 == '$'))
~ ^ ~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
we should always qualify `format_to` with its namespace. otherwise
we'd have following failure when compiling with libstdc++ from GCC-13:
```
/home/kefu/dev/scylladb/compaction/table_state.hh:65:16: error: call to 'format_to' is ambiguous
return format_to(ctx.out(), "{}.{} compaction_group={}", s->ks_name(), s->cf_name(), t.get_group_id());
^~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13760
Support the AWS_S3_EXTRA environment vairable that's :-split and the
respective substrings are set as endpoint AWS configuration. This makes
it possible to run boost S3 test over real S3.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If the endpoint config specifies AWS key, secret and region, all the
S3 requests get signed. Signature should have all the x-amz-... headers
included and should contain at least three of them. This patch includes
x-ams-date, x-amz-content-sha256 and host headers into the signing list.
The content can be unsigned when sent over HTTPS, this is what this
patch does.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Existing seastar's factories work on socket_address, but in S3 we have
endpoint name which's a DNS name in case of real S3. So this patch
creates the http client for S3 with the custom connection factory that
does two things.
First, it resolves the provided endpoint name into address.
Second, it loads trust-file from the provided file path (or sets system
trust if configured that way).
Since s3 client creation is no-waiting code currently, the above
initialization is spawned in afiber and before creating the connection
this fiber is waited upon.
This code probably deserves living in seastar, but for now it can land
next to utils/s3/client.cc.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the code temporarily assumes that the endpoint port is 9000.
This is what tests' local minio is started with. This patch keeps the
port number on endpoint config and makes test get the port number from
minio starting code via environment.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similar to previous patch -- extent the s3::client constructor to get
the endpoint config value next to the endpoint string. For now the
configs are likely empty, but they are yet unused too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the client is constructed with socket_address which's prepared
by the caller from the endpoint string. That's not flexible engouh,
because s3 client needs to know the original endpoint string for two
reasons.
First, it needs to lookup endpoint config for potential AWS creds.
Second, it needs this exact value as Host: header in its http requests.
So this patch just relaxes the client constructor to accept the endpoint
string and hard-code the 9000 port. The latter is temporary, this is how
local tests' minio is started, but next patch will make it configurable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Continuation of the previous patch. The sstables::s3_storage gets the
endpoint config instance upon creation.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The user sstables manager will need to provide endpoint config for
sstables' storage drivers. For that it needs to get it from db::config
and keep in-sync with its updates.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In order to access real S3 bucket, the client should use signed requests
over https. Partially this is due to security considerations, partially
this is unavoidable, because multipart-uploading is banned for unsigned
requests on the S3. Also, signed requests over plain http require
signing the payload as well, which is a bit troublesome, so it's better
to stick to secure https and keep payload unsigned.
To prepare signed requests the code needs to know three things:
- aws key
- aws secret
- aws region name
The latter could be derived from the endpoint URL, but it's simpler to
configure it explicitly, all the more so there's an option to use S3
URLs without region name in them we could want to use some time.
To keep the described configuration the proposed place is the
object_storage.yaml file with the format
endpoints:
- name: a.b.c
port: 443
aws_key: 12345
aws_secret: abcdefghijklmnop
...
When loaded, the map gets into db::config and later will be propagated
down to sstables code (see next patch).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Allowing the validation level to be customized by whoever creates the
compaction state. Add a default value (the previous hardcoded level) to
avoid the churn of updating all call sites.
e2c9cdb576 moved the validation of the
range tombstone change to the place where it is actually consumed, so we
don't attempt to pass purged or discarded range tombstones to the
validator. In doing so however, the validate pass was moved after the
consume call, which moves the range tombstone change, the validator
having been passed a moved-from range tombstone. Fix this by moving he
validation to before the consume call.
Refs: #12575
in this series, we try to use `generation_type` as a proxy to hide the consumers from its underlying type. this paves the road to the UUID based generation identifier. as by then, we cannot assume the type of the `value()` without asking `generation_type` first. better off leaving all the formatting and conversions to the `generation_type`. also, this series changes the "generation" column of sstable registry table to "uuid", and convert the value of it to the original generation_type when necessary, this paves the road to a world with UUID based generation id.
Closes#13652
* github.com:scylladb/scylladb:
db: use uuid for the generation column in sstable registry table
db, sstable: add operator data_value() for generation_type
db, sstable: print generation instead of its value
Currently there are only 2 tests for S3 -- the pure client test and compound object_store test that launches scylla, creates s3-backed table and CQL-queries it. At the same time there's a whole lot of small unit test for sstables functionality, part of it can run over S3 storage too.
This PR adds this support and patches several test cases to use it. More test cases are to come later on demand.
fixes: #13015Closes#13569
* github.com:scylladb/scylladb:
test: Make resharding test run over s3 too
test: Add lambda to fetch bloom filter size
test: Tune resharding test use of sstable::test_env
test: Make datafile test case run over s3 too
test: Propagate storage options to table_for_test
test: Add support for s3 storage_options in config
test: Outline sstables::test_env::do_with_async()
test: Keep storage options on sstable_test_env config
sstables: Add and call storage::destroy()
sstables: Coroutinize sstable::destroy()
Currently mp_row_consumer_m creates an alias to data_consumer::proceed.
Code in the rest of the file uses both unqualified name and
mp_row_consumer_m::proceed. Remove the alias and just use
`data_consumer::proceed` directly everywhere, leads to cleaner code.
Test sstable::validate() instead. Also rename the unit test testing said
method from scrub_validate_mode_validate_reader_test to
sstable_validate_test to reflect the change.
At this point this test should probably be moved to
sstable_datafile_test.cc, but not in this patch.
Sadly this transition means we loose some test scenarios. Since now we
have to write the invalid data to sstables, we have to drop scenarios
which trigger errors on either the write or read path.
To replace the validate code currently in compaction/compaction.cc (not
in this commit). We want to push down this logic to the sstable layer,
so that:
* Non compaction code that wishes to validate sstables (tests, tools)
doesn't have to go through compaction.
* We can abstract how sstables are validated, in particular we want to
add a new more low-level validation method that only the more recent
sstable versions (mx) will support.
Currently said method creates a combined reader from all the sstables
passed to it then validates this combined reader.
Change it to validate each sstable (reader) individually in preparation
of the new validate method which can handle a single sstable at a time.
Note that this is not going to make much impact in practice, all callers
pass a single sstable to this method already.
Currently, error messages for validation errors are produced in several
places:
* the high-level validator (which is built on the low-level one)
* scrub compaction and validation compaction (scrub in validate mode)
* scylla-sstable's validate operation
We plan to introduce yet another place which would use the low-level
validator and hence would have to produce its own error messages. To cut
down all this duplication, centralize the production of error messages
in the low-level validator, which now returns a `validation_result`
object instead of bool from its validate methods. This object can be
converted to bool (so its backwards compatible) and also contains an
error message if validation failed. In the next patches we will migrate
all users of the low level validator (be that direct or indirect) to use
the error messages provided in this result object instead of coming up
with one themselves.
The evictable reader must ensure that each buffer fill makes forward
progress, i.e. the last fragment in the buffer has a position larger
than the last fragment from the last buffer-fill. Otherwise, the reader
could get stuck in an infinite loop between buffer fills, if the reader
is evicted in-between.
The code guranteeing this forward change has a bug: when the next
expected position is a partition-start (another partition), the code
would loop forever, effectively reading all there is from the underlying
reader.
To avoid this, add a special case to ignore the progress guarantee loop
altogether when the next expected position is a partition start. In this
case, progress is garanteed anyway, because there is exactly one
partition-start fragment in each partition.
Fixes: #13491Closes#13563
This mini-series cleans up printing of ranges in utils/to_string.hh
It generalizes the helper function to work on a std::ranges::range,
with some exceptions, and adds a helper for boost::transformed_range.
It also changes the internal interface by moving `join` the the utils namespace
and use std::string rather than seastar::sstring.
Additional unit tests were added to test/boost/json_test
Fixes#13146Closes#13159
* github.com:scylladb/scylladb:
utils: to_string: get rid of utils::join
utils: to_string: get rid of to_string(std::initializer_list)
utils: to_string: get rid of to_string(const Range&)
utils: to_string: generalize range helpers
test: add string_format_test
utils: chunked_vector: add std::ranges::range ctor
DynamoDB limits the allowed magnitude and precision of numbers - valid
decimal exponents are between -130 and 125 and up to 38 significant
decimal digitst are allowed. In contrast, Scylla uses the CQL "decimal"
type which offers unlimited precision. This can cause two problems:
1. Users might get used to this "unofficial" feature and start relying
on it, not allowing us to switch to a more efficient limited-precision
implementation later.
2. If huge exponents are allowed, e.g., 1e-1000000, summing such a
number with 1.0 will result in a huge number, huge allocations and
stalls. This is highly undesirable.
This series adds more tests in this area covering additional corner cases,
and then fixes the issue by adding the missing verification where it's
needed. After the series, all 12 tests in test/alternator/test_number.py now pass.
Fixes#6794Closes#13743
* github.com:scylladb/scylladb:
alternator: unit test for number magnitude and precision function
alternator: add validation of numbers' magnitude and precision
test/alternator: more tests for limits on number precision and magnitude
test/alternator: reproducer for DoS in unlimited-precision addition
* change the "generation" column of sstable registry table from
bigint to uuid
* from helper to convert UUID back to the original generation
in the long run, we encourage user to use uuid based generation
identifier. but in the transition period, both bigint based and uuid
based identifiers are used for the generation. so to cater both
needs, we use a hackish way to store the integer into UUID. to
differentiate the was-integer UUID from the geniune UUID, we
check the UUID's most_significant_bits. because we only support
serialize UUID v1, so if the timestamp in the UUID is zero,
we assume the UUID was generated from an integer when converting it
back to a generation identififer.
also, please note, the only use case of using generation as a
column is the sstable_registry table, but since its schema is fixed,
we cannot store both a bigint and a UUID as the value of its
`generation` column, the simpler way forward is to use a single type
for the generation. to be more efficient and to preserve the type of
the generation, instead of using types like ascii string or bytes,
we will always store the generation as a UUID in this table, if the
generation's identifier is a int64_t, the value of the integer will
be used as the least significant bits of the UUID.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This is a translation of Cassandra's CQL unit test source file
validation/operations/InsertUpdateIfConditionTest.java into our cql-pytest
framework.
This test file checks various LWT conditional updates which involve
collections or UDTs (there is a separate test file for LWT conditional
updates which do not involve collections, which I haven't translated
yet).
The tests reproduce one known bug:
Refs #5855: lwt: comparing NULL collection with empty value in IF
condition yields incorrect results
And also uncovered three previously-unknown bugs:
Refs #13586: Add support for CONTAINS and CONTAINS KEY in LWT expressions
Refs #13624: Add support for UDT subfields in LWT expression
Refs #13657: Misformatted printout of column name in LWT error message
Beyond those bona-fide bugs, this test also demonstrates several places
where we intentionally deviated from Cassandra's behavior, forcing me
to comment out several checks. These deviations are known, and intentional,
but some of them are undocumented and it's worth listing here the ones
re-discovered by this test:
1. On a successful conditional write, Cassandra returns just True, Scylla
also returns the old contents of the row. This difference is officially
documented in docs/kb/lwt-differences.rst.
2. Scylla allows the test "l = [null]" or "s = {null}" with this weird
null element (the result is false), whereas Cassandra prints an error.
3. Scylla allows "l[null]" or "m[null]" (resulting in null), Cassandra
prints an error.
4. Scylla allows a negative list index, "l[-2]", resulting in null.
Cassandra prints an error in this case.
5. Cassandra allows in "IF v IN (?, ?)" to bind individual values to
UNSET_VALUE and skips them, Scylla treats this as an error. Refs #13659.
6. Scylla allows "IN null" (the condition just fails), Cassandra prints
an error in this case.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13663
Now when the test case and used lib/utils code is using storage-agnostic
approach, it can be extended to run over S3 storage as well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The resharding test compares bloom filter sizes before and after reshard
runs. For that it gets the filter on-disk filename and stat()s it. That
won't work with S3 as it doesn't have its accessable on-disk files.
Some time ago there existed the storage::get_stats() method, but now
it's gone. The new s3::client::get_object_stat() is coming, but it will
take time to switch to it. For now, generalize filter size fetching into
a local lambda. Next patch will make a stub in it for S3 case, and once
the get_object_stat() is there we'll be able to smoothly start using it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
`clustering_key_columns()` returns a range view, and `front()` returns
the reference to its first element. so we cannot assume the availability
of this reference after the expression is evaluated. to address this
issue, let's capture the returned range by value, and keep the first
element by reference.
this also silences warning from GCC-13:
```
/home/kefu/dev/scylladb/db/schema_tables.cc:3654:30: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
3654 | const column_definition& first_view_ck = v->clustering_key_columns().front();
| ^~~~~~~~~~~~~
/home/kefu/dev/scylladb/db/schema_tables.cc:3654:79: note: the temporary was destroyed at the end of the full expression ‘(& v)->view_ptr::operator->()->schema::clustering_key_columns().boost::iterator_range<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> > >::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::random_access_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::bidirectional_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::incrementable_traversal_tag>::front()’
3654 | const column_definition& first_view_ck = v->clustering_key_columns().front();
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
```
Fixes#13720
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13721
The test case in question spawns async context then makes the test_env
instance on the stack (and stopper for it too). There's helper for the
above steps, better to use them.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Most of the sstable_datafile test cases are capable of running with S3
storage, so this patch makes the simplest of them do it. Patching the
rest from this file is optional, because mostly the cases test how the
datafile data manipulations work without checking the files
manipulations. So even if making them all run over S3 is possible, it
will just increase the testing time w/o real test of the storage driver.
So this patch makes one test case run over local and S3 storages, more
patches to update more test cases with files manipulations are yet to
come.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Teach table_for_tests use any storage options, not just local one. For
now the only user that passes non-local options is sstables::test_env.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the sstable test case wants to run over S3 storage it needs to
specify that in test config by providing the S3 storage options. So
first thing this patch adds is the helper that makes these options based
on the env left by minio launcher from test.py.
Next, in order to make sstables_manager work with S3 it needs the
plugged system keyspace which, in turn, needs query processor, proxy,
database, etc. All this stuff lives in cql_test_env, so the test case
running with S3 options will run in a sstables::test_env nested inside
cql_test_env. The latter would also need to plug its system keyspace to
the former's sstables manager and turn the experimental feature ON.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We have known for a long time (see issue #1703) that the quality of our
CQL "syntax error" messages leave a lot to be desired, especially when
compared to Cassandra. This patch doesn't yet bring us great error
messages with great context - doing this isn't easy and it appears that
Antlr3's C++ runtime isn't as good as the Java one in this regard -
but this patch at least fixes **garbage** printed in some error messages.
Specifically, when the parser can deduce that a specific token is missing,
it used to print
line 1:83 missing ')' at '<missing '
After this patch we get rid of the meaningless string '<missing ':
line 1:83 : Missing ')'
Also, when the parser deduced that a specific token was unneeded, it
used to print:
line 1:83 extraneous input ')' expecting <invalid>
Now we got rid of this silly "<invalid>" and write just:
line 1:83 : Unexpected ')'
Refs #1703. I didn't yet marked that issue "fixed" because I think a
complete fix would also require printing the entire misparsed line and the
point of the parse failure. Scylla still prints a generic "Syntax Error"
in most cases now, and although the character number (83 in the above
example) can help, it's much more useful to see the actual failed
statement and where character 83 is.
Unfortunately some tests enshrine buggy error messages and had to be
fixed. Other tests enshrined strange text for a generic unexplained
error message, which used to say " : syntax error..." (note the two
spaces and elipses) and after this patch is " : Syntax error". So
these tests are changed. Another message, "no viable alternative at
input" is deliberately kept unchanged by this patch so as not to break
many more tests which enshrined this message.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13731
So that it could be set to s3 by the test case on demand. Default is
local storage which uses env's tempdir or explicit path argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The s3_storage leaks client when sstable gets destoryed. So far this
came unnoticed, but debug-mode unit test ran over minio captured it. So
here's the fix.
When sstable is destroyed it also kicks the storage to do whatever
cleanup is needed. In case of s3 storage the cleanup is in closing the
on-boarded client. Until #13458 is fixed each sstable has its own
private version of the client and there's no other place where it can be
close()d in co_await-able mannter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In the previous patch we added a limit in Alternator for the magnitude
and precision of numbers, based on a function get_magnitude_and_precision
whose implementation was, unfortunately, rather elaborate and delicate.
Although we did add in the previous patches some end-to-end tests which
confirmed that the final decision made based on this function, to accept or
reject numbers, was a correct decision in a few cases, such an elaborate
function deserves a separate unit test for checking just that function
in isolation. In fact, this unit tests uncovered some bugs in the first
implementation of get_magnitude_and_precision() which the other tests
missed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
DynamoDB limits the allowed magnitude and precision of numbers - valid
decimal exponents are between -130 and 125 and up to 38 significant
decimal digitst are allowed. In contrast, Scylla uses the CQL "decimal"
type which offers unlimited precision. This can cause two problems:
1. Users might get used to this "unofficial" feature and start relying
on it, not allowing us to switch to a more efficient limited-precision
implementation later.
2. If huge exponents are allowed, e.g., 1e-1000000, summing such a
number with 1.0 will result in a huge number, huge allocations and
stalls. This is highly undesirable.
After this patch, all tests in test/alternator/test_number.py now
pass. The various failing tests which verify magnitude and precision
limitations in different places (key attributes, non-key attributes,
and arithmetic expressions) now pass - so their "xfail" tags are removed.
Fixes#6794
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We already have xfailing tests for issue #6794 - the missing checks on
precision and magnitudes of numbers in Alternator - but this patch adds
checks for additional corner cases. In particular we check the case that
numbers are used in a *key* column, which goes to a different code path
than numbers used in non-key columns, so it's worth testing as well.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
As already noted in issue #6794, whereas DynamoDB limits the magnitude
of numbers to between 10^-130 and 10^125, Scylla does not. In this patch
we add yet another test for this problem, but unlike previous tests
which just shown too much magnitude being allowed which always sounded
like a benign problem - the test in this patch shows that this "feature"
can be used to DoS Scylla - a user user can send a short request that
causes arbitrarily-large allocations, stalls and CPU usage.
The test is currently marked "skip" because it cause cause Scylla to
take a very long time and/or run out of memory. It passes on DynamoDB
because the excessive magnitude is simply not allowed there.
Refs #6794
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
It's unused.
Just in case, add a unit test case for using the fmt library to
format it (that includes fmt::to_string(std::initializer_list)).
Note that the existing to_string implementation
used square brackets to enclose the initializer_list
but the new, standardized form uses curly braces.
This doesn't break anything since to_string(initializer_list)
wasn't used.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
As seen in https://github.com/scylladb/scylladb/issues/13146
the current implementation is not general enough
to provide print helpers for all kind of containers.
Modernize the implementation using templates based
on std::ranges::range and using fmt::join.
Extend unit test for formatting different types of ranges,
boost::transformed ranges, deque.
Fixes#13146
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, when altering permissions on a functions resource, we
only check if it's a builtin function and not if it's all functions
in the "system" keyspace, which contains all builtin functions.
This patch adds a check of whether the function resource keyspace
is "system". This check actually covers both "single function"
and "all functions in keyspace" cases, so the additional check
for single functions is removed.
Closes#13596
In many cases we trigger offstrategy compaction opportunistically
also when there's nothing to do. In this case we still print
to the log lots of info-level message and call
`run_offstrategy_compaction` that wastes more cpu cycles
on learning that it has nothing to do.
This change bails out early if the maintenance set is empty
and prints a "Skipping off-strategy compaction" message in debug
level instead.
Fixes#13466
Also, add an group_id class and return it from compaction_group and table_state.
Use that to identify the compaction_group / table_state by "ks_name.cf_name compaction_group=idx/total" in log messages.
Fixes#13467Closes#13520
* github.com:scylladb/scylladb:
compaction_manager: print compaction_group id
compaction_group, table_state: add group_id member
compaction_manager: offstrategy compaction: skip compaction if no candidates are found
This reverts commit d85af3dca4. It
restores the linear search algorithm, as we expect the search to
terminate near the origin. In this case linear search is O(1)
while binary search is O(log n).
A comment is added so we don't repeat the mistake.
Closes#13704
No point in going through the vector<mutation> entry-point
just to discover in run time that it was called
with a single-element vector, when we know that
in advance.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13733
this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `cdc::generation_id` and `db_clock::time_point` without the help of `operator<<`.
the formatter of `cdc::generation_id` uses that of `db_clock::time_point` , so these two commits are posted together in a single pull request.
the corresponding `operator<<()` is removed in this change, as all its callers are now using fmtlib for formatting now.
Refs #13245Closes#13703
* github.com:scylladb/scylladb:
db_clock: specialize fmt::formatter<db_clock::time_point>
cdc: generation: specialize fmt::formatter<generation_id>
This series fixes a few issues caused by f1bbf705f9
(f1bbf705f9):
- table, compaction_manager: prevent cross shard access to owned_ranges_ptr
- Fixes#13631
- distributed_loader: distribute_reshard_jobs: pick one of the sstable shard owners
- compaction: make_partition_filter: do not assert shard ownership
- allow the filtering reader now used during resharding to process tokens owned by other shards
Closes#13635
* github.com:scylladb/scylladb:
compaction: make_partition_filter: do not assert shard ownership
distributed_loader: distribute_reshard_jobs: pick one of the sstable shard owners
table, compaction_manager: prevent cross shard access to owned_ranges_ptr
this series silences the warnings from GCC 13. some of these changes are considered as critical fixes, and posted separately.
see also #13243Closes#13723
* github.com:scylladb/scylladb:
cdc: initialize an optional using its value type
compaction: disambiguate type name
db: schema_tables: drop unused variable
reader_concurrency_semaphore: fix signed/unsigned comparision
locator: topology: disambiguate type names
raft: disambiguate promise name in raft::awaited_conf_changes
when compiling using GCC-13, it warns that:
```
/home/kefu/dev/scylladb/utils/s3/client.cc:224:9: error: stack usage might be 66352 bytes [-Werror=stack-usage=]
224 | sstring parse_multipart_upload_id(sstring& body) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
```
so it turns out that `rapidxml::xml_document<>` could be very large,
let's allocate it on heap instead of on the stack to address this issue.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13722
`data_dictionary::database::find_keyspace()` returns a temporary
object, and `data_dictionary::keyspace::user_types()` returns a
references pointing to a member of this temporary object. so we
cannot use the reference after the expression is evaluated. in
this change, we capture the return value of `find_keyspace()` using
universal reference, and keep the return value of `user_types()`
with a reference, to ensure us that we can use it later.
this change silences the warning from GCC-13, like:
```
/home/kefu/dev/scylladb/cql3/statements/authorization_statement.cc:68:21: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
68 | const auto& utm = qp.db().find_keyspace(*keyspace).user_types();
| ^~~
```
Fixes#13725
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13726
`current_scheduling_group()` returns a temporary value, and `name()`
returns a reference, so we cannot capture the return value by reference,
and use the reference after this expression is evaluated. this would
cause undefined behavior. so let's just capture it by value.
this change also silence following warning from GCC-13:
```
/home/kefu/dev/scylladb/transport/server.cc:204:11: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
204 | auto& cur_sg_name = current_scheduling_group().name();
| ^~~~~~~~~~~
/home/kefu/dev/scylladb/transport/server.cc:204:56: note: the temporary was destroyed at the end of the full expression ‘seastar::current_scheduling_group().seastar::scheduling_group::name()’
204 | auto& cur_sg_name = current_scheduling_group().name();
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
```
Fixes#13719
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13724
`data_dictionary::database::find_column_family()` return a temporary value,
and `data_dictionary::table::get_index_manager()` returns a reference in
this temporary value, so we cannot capture this reference and use it after
the expression is evaluated. in this change, we keep the return value
of `find_column_family()` by value, to extend the lifecycle of the return
value of `get_index_manager()`.
this should address the warning from GCC-13, like:
```
/home/kefu/dev/scylladb/cql3/restrictions/statement_restrictions.cc:519:15: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
519 | auto& sim = db.find_column_family(_schema).get_index_manager();
| ^~~
```
Fixes#13727
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13728
Let's remove `expr::token` and replace all of its functionality with `expr::function_call`.
`expr::token` is a struct whose job is to represent a partition key token.
The idea is that when the user types in `token(p1, p2) < 1234`, this will be internally represented as an expression which uses `expr::token` to represent the `token(p1, p2)` part.
The situation with `expr::token` is a bit complicated.
On one hand side it's supposed to represent the partition token, but sometimes it's also assumed that it can represent a generic call to the `token()` function, for example `token(1, 2, 3)` could be a `function_call`, but it could also be `expr::token`.
The query planning code assumes that each occurence of expr::token
represents the partition token without checking the arguments.
Because of this allowing `token(1, 2, 3)` to be represented as `expr::token` is dangerous - the query planning might think that it is `token(p1, p2, p3)` and plan the query based on this, which would be wrong.
Currently `expr::token` is created only in one specific case.
When the parser detects that the user typed in a restriction which has a call to `token` on the LHS it generates `expr::token`.
In all other cases it generates an `expr::function_call`.
Even when the `function_call` represents a valid partition token, it stays a `function_call`. During preparation there is no check to see if a `function_call` to `token` could be turned into `expr::token`. This is a bit inconsistent - sometimes `token(p1, p2, p3)` is represented as `expr::token` and the query planner handles that, but sometimes it might be represented as `function_call`, which the query planner doesn't handle.
There is also a problem because there's a lot of code duplication between a `function_call` and `expr::token`.
All of the evaluation and preparation is the same for `expr::token` as it's for a `function_call` to the token function.
Currently it's impossible to evaluate `expr::token` and preparation has some flaws, but implementing it would basically consist of copy-pasting the corresponding code from token `function_call`.
One more aspect is multi-table queries.
With `expr::token` we turn a call to the `token()` function into a struct that is schema-specific.
What happens when a single expression is used to make queries to multiple tables? The schema is different, so something that is represented as `expr::token` for one schema would be represented as `function_call` in the context of a different schema.
Translating expressions to different tables would require careful manipulation to convert `expr::token` to `function_call` and vice versa. This could cause trouble for index queries.
Overall I think it would be best to remove `expr::token`.
Although having a clear marker for the partition token is sometimes nice for query planning, in my opinion the pros are outweighted by the cons.
I'm a big fan of having a single way to represent things, having two separate representations of the same thing without clear boundaries between them causes trouble.
Instead of having both `expr::token` and `function_call` we can just have the `function_call` and check if it represents a partition token when needed.
Refs: #12906
Refs: #12677Closes: #12905Closes#13480
* github.com:scylladb/scylladb:
cql3: remove expr::token
cql3: keep a schema in visitor for extract_clustering_prefix_restrictions
cql3: keep a schema inside the visitor for extract_partition_range
cql3/prepare_expr: make get_lhs_receiver handle any function_call
cql3/expr: properly print token function_call
expr_test: use unresolved_identifier when creating token
cql3/expr: split possible_lhs_values into column and token variants
cql3/expr: fix error message in possible_lhs_values
cql3: expr: reimplement is_satisfied_by() in terms of evaluate()
cql3/expr: add a schema argument to expr::replace_token
cql3/expr: add a comment for expr::has_partition_token
cql3/expr: add a schema argument to expr::has_token
cql3: use statement_restrictions::has_token_restrictions() wherever possible
cql3/expr: add expr::is_partition_token_for_schema
cql3/expr: add expr::is_token_function
cql3/expr: implement preparing function_call without a receiver
cql3/functions: make column family argument optional in functions::get
cql3/expr: make it possible to prepare expr::constant
cql3/expr: implement test_assignment for column_value
cql3/expr: implement test_assignment for expr::constant
We change the meaning and name of `replication_state`: previously it was meant
to describe the "state of tokens" of a specific node; now it describes the
topology as a whole - the current step in the 'topology saga'. It was moved
from `ring_slice` into `topology`, renamed into `transition_state`, and the
topology coordinator code was modified to switch on it first instead of node
state - because there may be no single transitioning node, but the topology
itself may be transitioning.
This PR was extracted from #13683, it contains only the part which refactors
the infrastructure to prepare for non-node specific topology transitions.
Closes#13690
* github.com:scylladb/scylladb:
raft topology: rename `update_replica_state` -> `update_topology_state`
raft topology: remove `transition_state::normal`
raft topology: switch on `transition_state` first
raft topology: `handle_ring_transition`: rename `res` to `exec_command_res`
raft topology: parse replaced node in `exec_global_command`
raft topology: extract `cleanup_group0_config_if_needed` from `get_node_to_work_on`
storage_service: extract raft topology coordinator fiber to separate class
raft topology: rename `replication_state` to `transition_state`
raft topology: make `replication_state` a topology-global state
as this syntax is not supported by the standard, it seems clang
just silently construct the value with the initializer list and
calls the operator=, but GCC complains:
```
/home/kefu/dev/scylladb/cdc/split.cc:392:54: error: converting to ‘std::optional<partition_deletion>’ from initializer list would use explicit constructor ‘constexpr std::optional<_Tp>::optional(_Up&&) [with _Up = const tombstone&; typename std::enable_if<__and_v<std::__not_<std::is_same<std::optional<_Tp>, typename std::remove_cv<typename std::remove_reference<_Iter>::type>::type> >, std::__not_<std::is_same<std::in_place_t, typename std::remove_cv<typename std::remove_reference<_Iter>::type>::type> >, std::is_constructible<_Tp, _Up>, std::__not_<std::is_convertible<_Iter, _Iterator> > >, bool>::type <anonymous> = false; _Tp = partition_deletion]’
392 | _result[t.timestamp].partition_deletions = {t};
| ^
```
to silences the error, and to be more standard compliant,
let's use emplace() instead.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Let's remove expr::token and replace all of its functionality with expr::function_call.
expr::token is a struct whose job is to represent a partition key token.
The idea is that when the user types in `token(p1, p2) < 1234`,
this will be internally represented as an expression which uses
expr::token to represent the `token(p1, p2)` part.
The situation with expr::token is a bit complicated.
On one hand side it's supposed to represent the partition token,
but sometimes it's also assumed that it can represent a generic
call to the token() function, for example `token(1, 2, 3)` could
be a function_call, but it could also be expr::token.
The query planning code assumes that each occurence of expr::token
represents the partition token without checking the arguments.
Because of this allowing `token(1, 2, 3)` to be represented
as expr::token is dangerous - the query planning
might think that it is `token(p1, p2, p3)` and plan the query
based on this, which would be wrong.
Currently expr::token is created only in one specific case.
When the parser detects that the user typed in a restriction
which has a call to `token` on the LHS it generates expr::token.
In all other cases it generates an `expr::function_call`.
Even when the `function_call` represents a valid partition token,
it stays a `function_call`. During preparation there is no check
to see if a `function_call` to `token` could be turned into `expr::token`.
This is a bit inconsistent - sometimes `token(p1, p2, p3)` is represented
as `expr::token` and the query planner handles that, but sometimes it might
be represented as `function_call`, which the query planner doesn't handle.
There is also a problem because there's a lot of duplication
between a `function_call` and `expr::token`. All of the evaluation
and preparation is the same for `expr::token` as it's for a `function_call`
to the token function. Currently it's impossible to evaluate `expr::token`
and preparation has some flaws, but implementing it would basically
consist of copy-pasting the corresponding code from token `function_call`.
One more aspect is multi-table queries. With `expr::token` we turn
a call to the `token()` function into a struct that is schema-specific.
What happens when a single expression is used to make queries to multiple
tables? The schema is different, so something that is representad
as `expr::token` for one schema would be represented as `function_call`
in the context of a different schema.
Translating expressions to different tables would require careful
manipulation to convert `expr::token` to `function_call` and vice versa.
This could cause trouble for index queries.
Overall I think it would be best to remove expr::token.
Although having a clear marker for the partition token
is sometimes nice for query planning, in my opinion
the pros are outweighted by the cons.
I'm a big fan of having a single way to represent things,
having two separate representations of the same thing
without clear boundaries between them causes trouble.
Instead of having expr::token and function_call we can
just have the function_call and check if it represents
a partition token when needed.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The schema will be needed once we remove expr::token
and switch to using expr::is_partition_token_for_schema,
which requires a schema arguments.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The schema will be needed once we remove expr::token
and switch to using expr::is_partition_token_for_schema,
which requires a schema arguments.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
get_lhs_receiver looks at the prepared LHS of a binary operator
and creates a receiver corresponding to this LHS expression.
This receiver is later used to prepare the RHS of the binary operator.
It's able to handle a few expression types - the ones that are currently
allowed to be on the LHS.
One of those types is `expr::token`, to handle restrictions like `token(p1, p2) = 3`.
Soon token will be replaced by `expr::function_call`, so the function will need
to handle `function_calls` to the token function.
Although we expect there to be only calls to the `token()` function,
as other functions are not allowed on the LHS, it can be made generic
over all function calls, which will help in future grammar extensions.
The functions call that it can currently get are calls to the token function,
but they're not validated yet, so it could also be something like `token(pk, pk, ck)`.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Printing for function_call is a bit strange.
When printing an unprepared function it prints
the name and then the arguments.
For prepared function it prints <anonymous function>
as the name and then the arguments.
Prepared functions have a name() method, but printing
doesn't use it, maybe not all functions have a valid name(?).
The token() function will soon be represent as a function_call
and it should be printable in a user-readable way.
Let's add an if which prints `token(arg1, arg2)`
instead of `<anonymous function>(arg1, arg2)` when printing
a call to the token function.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
One test for expr::token uses raw column identifier
in the test.
Let's change it to unresloved_identifier, which is
a standard representation of unresolved column
names in expressions.
Once expr::token is removed it will be possible
to create a function_call with unresolved_identifiers
as arguments.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The possible_lhs_values takes an expression and a column
and finds all possible values for the column that make
the expression true.
Apart from finding column values it's also capable of finding
all matching values for the partition key token.
When a nullptr column is passed, possible_lhs_values switches
into token values mode and finds all values for the token.
This interface isn't ideal.
It's confusing to pass a nullptr column when one wants to
find values for the token. It would be better to have a flag,
or just have a separate function.
Additionally in the future expr::token will be removed
and we will use expr::is_partition_token_for_schema
to find all occurences of the partition token.
expr::is_partition_token_for_schema takes a schema
as an argument, which possible_lhs_values doesn't have,
so it would have to be extended to get the schema from
somewhere.
To fix these two problems let's split possible_lhs_values
into two functions - one that finds possible values for a column,
which doesn't require a schema, and one that finds possible values
for the partition token and requires a schema:
value_set possible_column_values(const column_definition* col, const expression& e, const query_options& options);
value_set possible_partition_token_values(const expression& e, const query_options& options, const schema& table_schema);
This will make the interface cleaner and enable smooth transition
once expr::token is removed.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
In possible_lhs_values there was a message talking
about is_satisifed_by. It looks like a badly
copy-pasted message.
Change it to possibel_lhs_values as it should be.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Just like has_token, replace_token will use
expr::is_partition_token_for_schema to find all instance
of the partition token to replace.
Let's prepare for this change by adding a schema argument
to the function before making the big change.
It's unsued at the moment, but having a separate commit
should make it easier to review.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
In the future expr::token will be removed and checking
whether there is a partition token inside an expression
will be done using expr::is_partition_token_for_schema.
This function takes a schema as an argument,
so all functions that will call it also need
to get the schema from somewhere.
Right now it's an unused argument, but in the future
it will be used. Adding it in a separate commit
makes it easier to review.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The statement_restrictions class has a method called has_token_restriction().
This method checks whether the partition key restrictions contain expr::token.
Let's use this function in all applicable places instead of manually calling has_token().
In the future has_token() will have an additional schema argument,
so eliminating calls to has_token() will simplify the transition.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add a function to check whether the expression
represents a partition token - that is a call
to the token function with consecutive partition
key columns as the arguments.
For example for `token(p1, p2, p3)` this function
would return `true`, but for `token(1, 2, 3)` or `token(p3, p2, p1)`
the result would be `false`.
The function has a schema argument because a schema is required
to get the list of partition columns that should be passed as
arguments to token().
Maybe it would be possible to infer the schema from the information
given earlier during prepare_expression, but it would be complicated
and a bit dangerous to do this. Sometimes we operate on multiple tables
and the schema is needed to differentiate between them - a token() call
can represent the base table's partition token, but for an index table
this is just a normal function call, not the partition token.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add a function that can be used to check
whether a given expression represents a call
to the token() function.
Note that a call to token() doesn't mean
that the expression represents a partition
token - it could be something like token(1, 2, 3),
just a normal function_call.
The code for checking has been taken from functions::get.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Currently trying to do prepare_expression(function_call)
with a nullptr receiver fails.
It should be possible to prepare function calls without
a known receiver.
When the user types in: `token(1, 2, 3)`
the code should be able to figure out that
they are looking for a function with name `token`,
which takes 3 integers as arguments.
In order to support that we need to prepare
all arguments that can be prepared before
attempting to find a function.
Prepared expressions have a known type,
which helps to find the right function
for the given arguments.
Additionally the current code for finding
a function requires all arguments to be
assignment_testable, which requires to prepare
some expression types, e.g column_values.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The method `functions::get` is used to get the `functions::function` object
of the CQL function called using `expr::function_call`.
Until now `functions::get` required the caller to pass both the keyspace
and the column family.
The keyspace argument is always needed, as every CQL function belongs
to some keyspace, but the column family isn't used in most cases.
The only case where having the column family is really required
is the `token()` function. Each variant of the `token()` function
belongs to some table, as the arguments to the function are the
consecutive partition key columns.
Let's make the column family argument optional. In most cases
the function will work without information about column family.
In case of the `token()` function there's gonna be a check
and it will throw an exception if the argument is nullopt.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
this also silence the warning from GCC-13:
```
/home/kefu/dev/scylladb/db/schema_tables.cc:1489:10: error: variable ‘ts’ set but not used [-Werror=unused-but-set-variable]
1489 | auto ts = db_clock::now();
| ^~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
a signed/unsigned comparsion can overflow. and GCC-13 rightly points
this out. so let's use `std::cmp_greater_equal()` when comparing
unsigned and signed for greater-or-equal.
```
/home/kefu/dev/scylladb/reader_concurrency_semaphore.cc:931:76: error: comparison of integer expressions of different signedness: ‘long int’ and ‘uint64_t’ {aka ‘long unsigned int’} [-Werror=sign-compare]
931 | if (_resources.memory <= 0 && (consumed_resources().memory + r.memory) >= get_kill_limit()) [[unlikely]] {
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
otherwise GCC 13 complains that
```
/home/kefu/dev/scylladb/raft/server.cc:42:15: error: declaration of ‘seastar::promise<void> raft::awaited_index::promise’ changes meaning of ‘promise’ [-Wchanges-meaning]
42 | promise<> promise;
| ^~~~~~~
/home/kefu/dev/scylladb/raft/server.cc:42:5: note: used here to mean ‘class seastar::promise<void>’
42 | promise<> promise;
| ^~~~~~~~~
```
see also cd4af0c722
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change silences the warning of `-Wmissing-braces` from
clang. in general, we can initialize an object without constructor
with braces. this is called aggregate initialization.
but the standard does allow us to initialize each element using
either copy-initialization or direct-initialization. but in our case,
neither of them applies, so the clang warns like
```
suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
options.elements.push_back({bytes(k.begin(), k.end()), bytes(v.begin(), v.end())});
^~~~~~~~~~~~~~~~~~~~~~~~~
{ }
```
in this change,
also, take the opportunity to use structured binding to simplify the
related code.
Closes#13705
* github.com:scylladb/scylladb:
build: reenable -Wmissing-braces
treewide: add braces around subobject
cql3/stats: use zero-initialization
S3 wasn't providing filter size and accurate size for all SSTable components on disk.
First, filter size is provided by taking advantage that its in-memory representation is roughly the same as on-disk one.
Second, size for all components is provided by piggybacking on sstable parser and writer, so no longer a need to do a separate additional step after Scylla have either parsed or written all components.
Finally, sstable::storage::get_stats() is killed, so the burden is no longer pushed on the storage type implementation.
Refs #13649.
Closes#13682
* github.com:scylladb/scylladb:
test: Verify correctness of sstable::bytes_on_disk()
sstable: Piggyback on sstable parser and writer to provide bytes_on_disk
sstable: restore indentation in read_digest() and read_checksum()
sstable: make all parsing of simple components go through do_read_simple()
sstable: Add missing pragma once to random_access_reader.hh
sstable: make all writing of simple components go through do_write_simple()
test: sstable_utils: reuse set_values()
sstable: Restore indentation in read_simple()
sstable: Coroutinize read_simple()
sstable: Use filter memory footprint in filter_size()
so we can apply `execute_cql()` on `generation_type` directly without
extracting its value using `generation.value()`. this paves the road to
adding UUID based generation id to `generation_type`. as by then, we
will have both UUID based and integer based `generation_type`, so
`generation_type::value()` will not be able to represent its value
anymore. and this method will be replaced by `operator data_value()` in
this use case.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change prepares for the change to use `variant<UUID, int64_t>`
as the value of `generation_type`. as after this change, the "value"
of a generation would be a UUID or an integer, and we don't want to
expose the variant in generation's public interface. so the `value()`
method would be changed or removed by then.
this change takes advantage of the fact that the formatter of
`generation_type` always prints its value. also, it's better to
reuse `generation_type` formatter when appropriate.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
try_prepare_expression(constant) used to throw an error
when trying to prepeare expr::constant.
It would be useful to be able to do this
and it's not hard to implement.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Make it possible to do test_assignment for column_values.
It's implemented using the generic expression assignment
testing function.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
test_assignment checks whether a value of some type
can be assigned to a value of different type.
There is no implementation of test_assignment
for expr::constant, but I would like to have one.
Currently there is a custom implementation
of test_assignment for each type of expression,
but generally each of them boils down to checking:
```
type1->is_value_compatible_with(type2)
```
Instead of implementing another type-specific funtion
I added expresion_test_assignment and used it to
implement test_assignment for constant.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
this change helps to silence the warning of `-Wmissing-braces` from
clang. in general, we can initialize an object without constructor
with braces. this is called aggregate initialization.
but the standard does allow us to initialize each element using
either copy-initialization or direct-initialization. but in our case,
neither of them applies, so the clang warns like
```
suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
options.elements.push_back({bytes(k.begin(), k.end()), bytes(v.begin(), v.end())});
^~~~~~~~~~~~~~~~~~~~~~~~~
{ }
```
in this change,
also, take the opportunity to use structured binding to simplify the
related code.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
use {} instead of {0ul} for zero initialization. as `_query_cnt`
is a multi-dimension array, each elements in `_query_cnt` is yet
another array. so we cannot initialize it with a `{0ul}`. but
to zero-initialize this array, we can just use `{}`, as per
https://en.cppreference.com/w/cpp/language/zero_initialization
> If T is array type, each element is zero-initialized.
so this should recursively zero-initialize all arrays in `_query_cnt`.
this change should silence following warning:
stats.hh:88:60: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
[statements::statement_type::MAX_VALUE + 1] = {0ul};
^~~
{ }
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
@annastuchlik please review
Closes#13691
* github.com:scylladb/scylladb:
adding documentation for integration with MindsDB
adding documentation for integration with MindsDB
this series syncs the CMake building system with `configure.py` which was updated for introducing the tablets feature. also, this series include a couple cleanups.
Closes#13699
* github.com:scylladb/scylladb:
build: cmake: remove dead code
build: move test-perf down to test/perf
build: cmake: pick up tablets related changes
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `db_clock::time_point` without the help of `operator<<`.
the corresponding `operator<<()` is removed in this change, as all its
callers are now using fmtlib for formatting now.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `generation_id` without the help of `operator<<`.
the corresponding `operator<<()` is removed in this change, as all its
callers are now using fmtlib for formatting now.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Loading cores from Scylla executables installed in a non-standard
location can cause gdb to fail reading required libraries.
This is an example of a warning I've got after trying to load core
generated by dtest jenkins job (using ./scripts/open-coredump.sh):
> warning: Can't open file /jenkins/workspace/scylla-master/dtest-daily-debug/scylla/.ccm/scylla-repository/0d64f327e1af9bcbb711ee217eda6df16e517c42/libreloc/libboost_system.so.1.78.0 during file-backed mapping note processing
Invocations of `scylla threads` command ended with an error:
> (gdb) scylla threads
> Python Exception <class 'gdb.error'>: Cannot find thread-local storage for LWP 2758, executable file (...)/scylla-debug-unstripped-5.3.0~dev-0.20230121.0d64f327e1af.x86_64/scylla/libexec/scylla:
> Cannot find thread-local variables on this target
> Error occurred in Python: Cannot find thread-local storage for LWP 2758, executable file (...)/scylla-debug-unstripped-5.3.0~dev-0.20230121.0d64f327e1af.x86_64/scylla/libexec/scylla:
> Cannot find thread-local variables on this target
An easy fix for this is to set solib-search-path to
/opt/scylladb/libreloc/.
This commit adds that set command to suggested command line gdb
arguments. I guess it's a good idea to always suggest setting
solib-search-path to that path, as it can save other people from wasting
their time on looking why does coredump opening does not work.
Closes#13696
the removed CMake script was designed to cater the needs when
Seastar's CMake script is not included in the parent project, but
this part is never tested and is dysfunctional as the `target_source()`
misses the target parameter. we can add it back when it is actually needed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
We may catch exceptions that are not `marshal_exception`.
Print std::current_exception() in this case to provide
some context about the marshalling error.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13693
Consider
- n1, n2, n3
- n3 is down
- n4 replaces n3 with the same ip address 127.0.0.3
- Inside the storage_service::handle_state_normal callback for 127.0.0.3 on n1/n2
```
auto host_id = _gossiper.get_host_id(endpoint);
auto existing = tmptr->get_endpoint_for_host_id(host_id);
```
host_id = new host id
existing = empty
As a result, del_replacing_endpoint() will not be called.
This means 127.0.0.3 will not be removed as a pending node on n1 and n2 when
replacing is done. This is wrong.
This is a regression since commit 9942c60d93
(storage_service: do not inherit the host_id of a replaced a node), where
replacing node uses a new host id than the node to be replaced.
To fix, call del_replacing_endpoint() when a node becomes NORMAL and existing
is empty.
Before:
n1:
storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3
token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Marked ops done from coordinator=127.0.0.3
storage_service - Node 127.0.0.3 state jump to normal
storage_service - Set host_id=6f9ba4e8-9457-4c76-8e2a-e2be257fe123 to be owned by node=127.0.0.3
After:
n1:
storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3
token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Marked ops done from coordinator=127.0.0.3
storage_service - Node 127.0.0.3 state jump to normal
token_metadata - Removed node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - Set host_id=72219180-e3d1-4752-b644-5c896e4c2fed to be owned by node=127.0.0.3
Tests: https://github.com/scylladb/scylla-dtest/pull/3126Closes#13677
bytes_on_disk is the sum of all sstable components.
As read_simple() fetches the file size before parsing the component,
bytes_on_disk can be added incrementally rather than an additional
step after all components were already parsed.
Likewise, write_simple() tracks the offset for each new component,
and therefore bytes_on_disk can also be added incrementally.
This simplifies s3 life as it no longer have to care about feeding
a bytes_on_disk, which is currently limited to data and index
sizes only.
Refs #13649.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
With all parsing of simple components going through do_read_simple(),
common infrastructure can be reused (exception handling, debug logging,
etc), and also statistics spanning all components can be easily added.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
With all writing of simple components going through do_write_simple(),
common infrastructure can be reused (exception handling, debug logging,
etc), and also statistics spanning all components can be easily added.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
For S3, filter size is currently set to zero, as we want to avoid
"fstat-ing" each file.
On-disk representation of bloom filter is similar to the in-memory
one, therefore let's use memory footprint in filter_size().
User of filter_size() is API implementing "nodetool cfstats" and
it cares about the size of bloom filter data (that's how it's
described).
This way, we provide the filter data size regardless of the
underlying storage type.
Refs #13649.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Updated the empty() function in the struct fsm_output to include the
max_read_id_with_quorum field when checking whether the fsm output is
empty or not. The change was made in order maintain consistency with the
codebase and adding completeness to the empty check. This change has no
impact on other parts of the codebase.
Closes#13656
The new name is more generic and appropriate for topology transitions
which don't affect any specific replica but the entire cluster as a
whole (which we'll introduce later).
Also take `guard` directly instead of `node_to_work_on` in this more
generic function. Since we want `node_to_work_on` to die when we steal
its guard, introduce `take_guard` which takes ownership of the object
and returns the guard.
Previously the code assumed that there was always a 'node to work on' (a
node which wants to change its state) or there was no work to do at all.
It would find such a node, switch on its state (e.g. check if it's
bootstrapping), and in some states switch on the topology
`transition_state` (e.g. check if it's `write_both_read_old`).
We want to introduce transitions that are not node-specific and can work
even when all nodes are 'normal' (so there's no 'node to work on'). As a
first step, we refactor the code so it switches on `transition_state`
first. In some of these states, like `write_both_read_old`, there must
be a 'node to work on' for the state to make sense; but later in some
states it will be optional (such as `commit_cdc_generation`).
The lambdas defined inside the fiber are now methods of this class.
Currently `handle_node_transition` is calling `handle_ring_transition`,
in a later commit we will reverse this: `handle_ring_transition` will
call `handle_node_transition`. We won't have to shuffle the functions
around because they are members of the same class, making the change
easier to review. In general, the code will be easier to maintain in
this new form (no need to deal with so many lambda captures etc.)
Also break up some lines which exceeded the 120 character limit (as per
Seastar coding guidelines).
if the visitor clauses are the same, we can just use the generic version
of it by specifying the parameter with `auto&`. simpler this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13626
The new name is more generic - it describes the current step of a
'topology saga` (a sequence of steps used to implement a larger topology
operation such as bootstrap).
Previously it was part of `ring_slice`, belonging to a specific node.
This commit moves it into `topology`, making it a cluster-global
property.
The `replication_state` column in `system.topology` is now `static`.
This will allow us to easily introduce topology transition states that
do not refer to any specific node. `commit_cdc_generation` will be such
a state, allowing us to commit a new CDC generation even though all
nodes are normal (none are transitioning). One could argue that the
other states are conceptually already cluster-global: for example,
`write_both_read_new` doesn't affect only the tokens of a bootstrapping
(or decommissioning etc.) node; it affects replica sets of other tokens
as well (with RFs greater than 1).
This PR introduces an experimental feature called "tablets". Tablets are
a way to distribute data in the cluster, which is an alternative to the
current vnode-based replication. Vnode-based replication strategy tries
to evenly distribute the global token space shared by all tables among
nodes and shards. With tablets, the aim is to start from a different
side. Divide resources of replica-shard into tablets, with a goal of
having a fixed target tablet size, and then assign those tablets to
serve fragments of tables (also called tablets). This will allow us to
balance the load in a more flexible manner, by moving individual tablets
around. Also, unlike with vnode ranges, tablet replicas live on a
particular shard on a given node, which will allow us to bind raft
groups to tablets. Those goals are not yet achieved with this PR, but it
lays the ground for this.
Things achieved in this PR:
- You can start a cluster and create a keyspace whose tables will use
tablet-based replication. This is done by setting `initial_tablets`
option:
```
CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy',
'replication_factor': 3,
'initial_tablets': 8};
```
All tables created in such a keyspace will be tablet-based.
Tablet-based replication is a trait, not a separate replication
strategy. Tablets don't change the spirit of replication strategy, it
just alters the way in which data ownership is managed. In theory, we
could use it for other strategies as well like
EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy
is augmented to support tablets.
- You can create and drop tablet-based tables (no DDL language changes)
- DML / DQL work with tablet-based tables
Replicas for tablet-based tables are chosen from tablet metadata
instead of token metadata
Things which are not yet implemented:
- handling of views, indexes, CDC created on tablet-based tables
- sharding is done using the old method, it ignores the shard allocated in tablet metadata
- node operations (topology changes, repair, rebuild) are not handling tablet-based tables
- not integrated with compaction groups
- tablet allocator piggy-backs on tokens to choose replicas.
Eventually we want to allocate based on current load, not statically
Closes#13387
* github.com:scylladb/scylladb:
test: topology: Introduce test_tablets.py
raft: Introduce 'raft_server_force_snapshot' error injection
locator: network_topology_strategy: Support tablet replication
service: Introduce tablet_allocator
locator: Introduce tablet_aware_replication_strategy
locator: Extract maybe_remove_node_being_replaced()
dht: token_metadata: Introduce get_my_id()
migration_manager: Send tablet metadata as part of schema pull
storage_service: Load tablet metadata when reloading topology state
storage_service: Load tablet metadata on boot and from group0 changes
db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata()
migration_notifier: Introduce before_drop_keyspace()
migration_manager: Make prepare_keyspace_drop_announcement() return a future<>
test: perf: Introduce perf-tablets
test: Introduce tablets_test
test: lib: Do not override table id in create_table()
utils, tablets: Introduce external_memory_usage()
db: tablets: Add printers
db: tablets: Add persistence layer
dht: Use last_token_of_compaction_group() in split_token_range_msb()
locator: Introduce tablet_metadata
dht: Introduce first_token()
dht: Introduce next_token()
storage_proxy: Improve trace-level logging
locator: token_metadata: Fix confusing comment on ring_range()
dht, storage_proxy: Abstract token space splitting
Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries"
db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms()
db: Introduce get_non_local_vnode_based_strategy_keyspaces()
service: storage_proxy: Avoid copying keyspace name in write handler
locator: Introduce per-table replication strategy
treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type
locator: Introduce effective_replication_map
locator: Rename effective_replication_map to vnode_effective_replication_map
locator: effective_replication_map: Abstract get_pending_endpoints()
db: Propagate feature_service to abstract_replication_strategy::validate_options()
db: config: Introduce experimental "TABLETS" feature
db: Log replication strategy for debugging purposes
db: Log full exception on error in do_parse_schema_tables()
db: keyspace: Remove non-const replication strategy getter
config: Reformat
in C++20, compiler generate operator!=() if the corresponding
operator==() is already defined, the language now understands
that the comparison is symmetric in the new standard.
fortunately, our operator!=() is always equivalent to
`! operator==()`, this matches the behavior of the default
generated operator!=(). so, in this change, all `operator!=`
are removed.
in addition to the defaulted operator!=, C++20 also brings to us
the defaulted operator==() -- it is able to generated the
operator==() if the member-wise lexicographical comparison.
under some circumstances, this is exactly what we need. so,
in this change, if the operator==() is also implemented as
a lexicographical comparison of all memeber variables of the
class/struct in question, it is implemented using the default
generated one by removing its body and mark the function as
`default`. moreover, if the class happen to have other comparison
operators which are implemented using lexicographical comparison,
the default generated `operator<=>` is used in place of
the defaulted `operator==`.
sometimes, we fail to mark the operator== with the `const`
specifier, in this change, to fulfil the need of C++ standard,
and to be more correct, the `const` specifier is added.
also, to generate the defaulted operator==, the operand should
be `const class_name&`, but it is not always the case, in the
class of `version`, we use `version` as the parameter type, to
fulfill the need of the C++ standard, the parameter type is
changed to `const version&` instead. this does not change
the semantic of the comparison operator. and is a more idiomatic
way to pass non-trivial struct as function parameters.
please note, because in C++20, both operator= and operator<=> are
symmetric, some of the operators in `multiprecision` are removed.
they are the symmetric form of the another variant. if they were
not removed, compiler would, for instance, find ambiguous
overloaded operator '=='.
this change is a cleanup to modernize the code base with C++20
features.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13687
Common compression libraries work on contiguous buffers.
Contiguous buffers are a problem for the allocator. However, as long as they are short-lived,
we can avoid the expensive allocations by reusing buffers across tasks.
This idea is already applied to the compression of CQL frames, but with some deficiencies.
`utils: redesign reusable_buffer` attempts to improve upon it in a few ways. See its commit message for an extended discussion.
Compression buffer reuse also happens in the zstd SSTable compressor, but the implementation is misguided. Every `zstd_processor` instance reuses a buffer, but each instance has its own buffer. This is very bad, because a healthy database might have thousands of concurrent instances (because there is one for each sstable reader). Together, the buffers might require gigabytes of memory, and the reuse actually *increases* memory pressure significantly, instead of reducing it.
`zstd: share buffers between compressor instances` aims to improve that by letting a single buffer be shared across all instances on a shard.
Closes#13324
* github.com:scylladb/scylladb:
zstd: share buffers between compressor instances
utils: redesign reusable_buffer
This commit moves the Glossary page to the Reference
section. In addition, it adds the redirection so that
there are no broken links because of this change
and fixes a link to a subsection of Glossary.
Closes#13664
The zstd implementation of `compressor` has a separate decompression and
compression context per instance. This is unreasonably wasteful. One
decompression buffer and one compression buffer *per shard* is enough.
The waste is significant. There might exist thousands of SSTable readers, each
containing its own instance of `compressor` with several hundred KiB worth of
unneeded buffers. This adds up to gigabytes of wasted memory and gigapascals
of allocator pressure.
This patch modifies the implementation of zstd_processor so that all its
instances on the shard share their contexts.
Fixes#11733
Large contiguous buffers put large pressure on the allocator
and are a common source of reactor stalls. Therefore, Scylla avoids
their use, replacing it with fragmented buffers whenever possible.
However, the use of large contiguous buffers is impossible to avoid
when dealing with some external libraries (i.e. some compression
libraries, like LZ4).
Fortunately, calls to external libraries are synchronous, so we can
minimize the allocator impact by reusing a single buffer between calls.
An implementation of such a reusable buffer has two conflicting goals:
to allocate as rarely as possible, and to waste as little memory as
possible. The bigger the buffer, the more likely that it will be able
to handle future requests without reallocation, but also the memory
memory it ties up.
If request sizes are repetitive, the near-optimal solution is to
simply resize the buffer up to match the biggest seen request,
and never resize down.
However, if we anticipate pathologically large requests, which are
caused by an application/configuration bug and are never repeated
again after they are fixed, we might want to resize down after such
pathological requests stop, so that the memory they took isn't tied
up forever.
The current implementation of reusable buffers handles this by
resizing down to 0 every 100'000 requests.
This patch attempts to solve a few shortcomings of the current
implementation.
1. Resizing to 0 is too aggressive. During regular operation, we will
surely need to resize it back to the previous size again. If something
is allocated in the hole left by the old buffer, this might cause
a stall. We prefer to resize down only after pathological requests.
2. When resizing, the current implementation allocates the new buffer
before freeing the old one. This increases allocator pressure for no
reason.
3. When resizing up, the buffer is resized to exactly the requested
size. That is, if the current size is 1MiB, following requests
of 1MiB+1B and 1MiB+2B will both cause a resize.
It's preferable to limit the set of possible sizes so that every
reset doesn't tend to cause multiple resizes of almost the same size.
The natural set of sizes is powers of 2, because that's what the
underlying buddy allocator uses. No waste is caused by rounding up
the allocation to a power of 2.
4. The interval of 100'000 uses is both too low and too arbitrary.
This is up for discussion, but I think that it's preferable to base
the dynamics of the buffer on time, rather than the number of uses.
It's more predictable to humans.
The implementation proposed in this patch addresses these as follows:
1. Instead of resizing down to 0, we resize to the biggest size
seen in the last period.
As long as at least one maximal (up to a power of 2) "normal" request
appears each period, the buffer will never have to be resized.
2. The capacity of the buffer is always rounded up to the nearest
power of 2.
3. The resize down period is no longer measured in number of requests
but in real time.
Additionally, since a shared buffer in asynchronous code is quite a
footgun, some rudimentary refcounting is added to assert that only
one reference to the buffer exists at a time, and that the buffer isn't
downsized while a reference to it exists.
Fixes#13437
Fixes https://github.com/scylladb/scylladb/issues/13578
Now that the documentation is versioned, we can remove
the .. versionadded:: and .. versionchanged:: information
(especially that the latter is hard to maintain and now
outdated), as well as the outdated information about
experimental features in very old releases.
This commit removes that information and nothing else.
Closes#13680
std::rel_ops was deprecated in C++20, as C++20 provides a better solution for defining comparison operators. and all the use cases previously to be addressed by `using namespace std::rel_ops` have been addressed either by `operator<=>` or the default-generated `operator!=`.
so, in this series, to avoid using deprecated facilities, let's drop all these `using namespace std::rel_ops`. there are many more cases where we could either use `operator<=>` or the default-generated `operator!=` to simplify the implementation. but here, we care more about `std::rel_ops`, we will drop the most (if not all of them) of the explicitly defined `operator!=` and other comparison operators later.
Closes#13676
* github.com:scylladb/scylladb:
treewide: do not use std::rel_ops
dht: token: s/tri_compare/operator<=>/
Currently, the `reset_to()` implementation calls `consume(new_amount)` (if
not zero), then calls `signal(old_amount)`. This means that even if
`reset_to()` is a net reduction in the amount of resources, there is a
call to `consume()` which can now potentially throw.
Add a special case for when the new amount of resources is strictly
smaller than the old amount. In this case, just call `signal()` with the
difference. This not just avoids a potential `std::bad_alloc`, but also
helps relieving memory pressure when this is most needed, by not failing
calls to release memory.
Into reset_to() and reset_to_zero(). The latter replaces `reset()` with
the default 0 resources argument, which was often called from noexcept
contexts. Splitting it out from `reset()` allows for a specialized
implementation that is guaranteed to be `noexcept` indeed and thus
peace of mind.
This API is dangerous, all resource consumption should happen via RAII
objects that guarantee that all consumed resources are appropriately
released.
At this poit, said API is just a low-level building block for
higher-level, RAII objects. To ensure nobody thinks of using it for
other purposes, make it private and make external users friends instead.
Fix two issues with the replace operation introduced by recent PRs.
Add a test which performs a sequence of basic topology operations (bootstrap,
decommission, removenode, replace) in a new suite that enables the `raft`
experimental feature (so that the new topology change coordinator code is used).
Fixes: #13651Closes#13655
* github.com:scylladb/scylladb:
test: new suite for testing raft-based topology
test: remove topology_custom/test_custom.py
raft topology: don't require new CDC generation UUID to always be present
raft topology: include shard_count/ignore_msb during replace
std::rel_ops was deprecated in C++20, as C++20 provides a better
solution for defining comparison operators. and all the use cases
previously to be addressed by `using namespace std::rel_ops` have
been addressed either by `operator<=>` or the default-generated
`operator!=`.
so, in this change, to avoid using deprecated facilities, let's
drop all these `using namespace std::rel_ops`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
now that C++20 is able to generate the default-generated comparing
operators for us. there is no need to define them manually. and,
`std::rel_ops::*` are deprecated in C++20.
also, use `foo <=> bar` instead of `tri_compare(foo, bar)` for better
readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `range_tombstone_list` and `range_tombstone_entry`
without the help of `operator<<`.
the corresponding `operator<<()` for `range_tombstone_entry` is moved
into test, where it is used. and the other one is dropped in this change,
as all its callers are now using fmtlib for formatting now.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13627
there are two variants of `query_processor::for_each_cql_result()`,
both of them perform the pagination of results returned by a CQL
statement. the one which accepts a function returning an instant
value is not used now. so let's drop it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13675
Now gossiper doesn't need those two as its dependencies, they can be
removed making code shorter and dependencies graph simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
That's, in fact, an independent change, because feature enabler doesn't
need this method. So this patch is like "while at it" thing, but on the
other hand it ditches one more qctx usage.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All callers now have the system keyspace instance at hand.
Unfortunately, this de-static doesn't allow more qctx drops, because
both methods use set_|get_scylla_local_param helpers that do use qctx
and are still in use by other static methods.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This code belongs to feature service, system keyspace shoulnd't be aware
of any pecularities of startup features enabling, only loading and
saving the feature lists.
For now the move happens only in terms of code declarations, the
implementation is kept in its old place to reduce the patch churn.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method in question is only called by the enabler and is short enough
to be merged into the caller. This kills two birds with one stone --
makes less walks over features list and will make it possible to
de-static system keyspace features load and save methods.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
No functional changes, just move the code. Thie makes gossiper not mess
with enabling/persisting features, but just gossiping them around.
Feature intersection code is still in gossiper, but can be moved in
more suitable location any time later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Nowadays features are persisted in feature_service::enable() and there
are four callers of it
- feature enabler via gossiper notifications
- boot kicks feature enabler too
- schema loader tool
- cql test env
None but the first case need to persist features. The loader tool in
fact doesn't do it even now it by keeping qctx uninitialized. Cql test
env wires up the qctx, but it makes no differences for the test cases
themselves if the features are persisted or not.
Boot-time is a bit trickier -- it loads the feature list from system
keyspace and may filter-out some of them, then enable. In that case
committing the list back into system keyspace makes no sense, as the
persisted list doesn't extend.
The enabler, in turn, can call system keyspace directly via its explicit
dependency reference. This fixes the inverse dependency between system
keyspace and feature service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It now knows that it runs inside async context, but things are changing
and soon it will be moved out of it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's the enabler that's responsible for enabling the features and,
implicitly, persisting them into the system keyspace. This patch moves
this logic from gossiper to feature_enabler, further patching will make
the persisting code be explicit.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's a bit hairy. The maybe_enable_features() is called from two places
-- the feature_enabler upon notifications from gossiper and directory by
gossiper from wait_for_gossip_to_settle().
The _latter_ is called only when the wait_for_gossip_to_settle() is
called for the first time because of the _gossip_settled checks in it.
For the first time this method is called by storage_service when it
tries to join the ring (next it's called from main, but that's not of
interest here).
Next, despite feature_enabler is registered early -- when gossiper
instance is constructed by sharded<gossiper>::start() -- it checks for
the _gossip_settled to be true to take any actions.
Considering both -- calling maybe_enable_features() _and_ registering
enabler after storage_service's call to wait_for_gossip_to_settle()
doesn't break the code logic, but make further patching possible. In
particular, the feature_enabler will move to feature_service not to
pollute gossiper code with anything that's not gossiping.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
And rename the guy. These dependencies will be used further, both are
available and started when the enabler is constructed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fix a few cases where instead of printing column names in error messages, we printed weird stuff like ASCII codes or the address of the name.
Fixes#13657Closes#13658
* github.com:scylladb/scylladb:
cql3: fix printing of column_specification::name in some error messages
cql3: fix printing of column_definition::name in some error messages
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `storage_service::mode` without the help of `operator<<`.
the corresponding `operator<<()` for `storage_service::mode` is removed
in this change, as all its callers are now using fmtlib for formatting
now.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13640
Adds a reproducer for #12462, which doesn't manifest in master any
more after f73e2c992f. It's still useful
to keep the test to avoid regresions.
The bug manifests by reader throwing:
std::logic_error: Stream ends with an active range tombstone: {range_tombstone_change: pos={position: clustered,ckp{},-1}, {tombstone: timestamp=-9223372036854775805, deletion_time=2}}
The reason is that prior to the rework of the cache reader,
range_tombstone_generator::flush() was used with end_of_range=true to
produce the closing range_tombstone_change and it did not handle
correctly the case when there are two adjacent range tombstones and
flush(pos, end_of_range=true) is called such that pos is the boundary
between the two.
Closes#13665
raft_group0 does not really depends on migration_manager, it needs it only
transiently, so pass it to appropriate methods of raft_group0 instead
of during its creation.
raft_group0 does not really depends on query_processor, it needs it only
transiently, so pass it to appropriate methods of raft_group0 instead
of during its creation.
Introduce new test suite for testing the new topology coordinator
(runs under `raft` experimental flag). Add a simple test that performs a
basic sequence of topology operations.
raft_group0 does not really depends on storage_service, it needs it only
transiently, so pass it to appropriate methods of raft_group0 instead
of during its creation.
Consolidate `bytes_view_hasher` and abstract_replication_strategy `factory_key_hasher` which are the same into a reusable utils::basic_xx_hasher.
To be used in a followup series for netw:msg_addr.
Closes#13530
* github.com:scylladb/scylladb:
utils: hashing: use simple_xx_hasher
utils: hashing: add simple_xx_hasher
utils: hashers: add HasherReturning concept
hashing: move static_assert to source file
Add a test for get/enable/disable auto_compaction via to column_family api.
And add log messages for admin operations over that api.
Closes#13566
* github.com:scylladb/scylladb:
api: column_family: add log messages for admin operation
test: rest_api: add test_column_family
After the addition of the rust-std-static-wasm32-wasi target, we're
able to compile the Rust programs to Wasm binaries. However, we're still
only able to handle the Wasm UDFs in the Text format, so we need a tool
to translate the .wasm files to .wat. Additionally, the .wasm files
generated by default are unnecessarily large, which can be helped
using wasm-opt and wasm-strip.
The tool for translating wasm to wat (wasm2wat), and the tool for
stripping the wasm binaries (wasm-strip) are included in the `wabt`
package, and the optimization tool (wasm-opt) is included in the
`binaryen` package. Both packages are added to install-dependencies.sh
Closes#13282
[avi: regenerate frozen toolchain]
Closes#13605
The s3::readable_file::stat() call returns a hand-crafted stat structure
with some fields set to some sane values, most are constants. However,
other fields remain not initialized which leads to troubles sometimes.
Better to fill the stat with zeroes and later revisit it for more sane
values.
fixes: #13645
refs: #13649
Using designated initializers is not an option here, see PR #13499
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13650
the temporary directory holding the log file collecting the scylla
subprocess's output is specified by the test itself, and it is
`test_tempdir`. but unfortunately, cql-pytest/run.py is not aware
of this. so `cleanup_all()` is not able to print out the logging
messages at exit. as, please note, cql-pytest/run.py always
collect "log" file under the directory created using `pid_to_dir()`
where pid is the spawned subprocesses. but `object_store/run` uses
the main process's pid for its reusable tempdir.
so, with this change, we also register a cleanup func to printout
the logging message when the test exits.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
For concurrent schema changes test, log when the different stages of the
test are finished.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13654
this change ensures that `dk._key` is formatted with the "pk" prefix.
as in 3738fcb, the `operator<<` for partition_key was removed. so the
compiler has to find an alternative when trying to fulfill the needs
when this operator<< is called. fortunately, from the compiler's
perspective, `partition_key` has an `operator managed_bytes_view`, and
this operator does not have the explicit specifier, and,
`managed_bytes_view` does support `operator<<`. so this ends up with a
change in the format of `decorated_key` when it is printed using
`operator<<`. the code compiles. but unfortunately, the behavior is
changed, and it breaks scylla-dtest/cdc_tracing_info_test.py where the
partition_key is supposed to be printed like "pk{010203}" instead of
"010203". the latter is how `managed_bytes_view` is formatted.
a test is added accordingly to avoid future changes which break the
dtest.
Fixes scylladb#13628
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13653
column_specification::name is a shared pointer, so it should be
dereferenced before printing - because we want to print the name, not
the pointer.
Fix a few instances of this mistake in prepare_expr.cc. Other instances
were already correct.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Printing a column_definition::name() in an error message is wrong,
because it is "bytes" and printed as hexadecimal ASCII codes :-(
Some error messages in cql3/operation.cc incorrectly used name()
and should be changed to name_as_text(), as was correctly done in
a few other error messages in the same file.
This patch also fixes a few places in the test/cql approval tests which
"enshrined" the wrong behavior - printing things like 666c697374696e74
in error messages - and now needs to be fixed for the right behavior.
Fixes#13657
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
During node replace we don't introduce a new CDC generation, only during
regular bootstrap. Instead of checking that `new_cdc_generation_uuid`
must be present whenever there's a topology transition, only check it
when we're in `commit_cdc_generation` state.
Use simple_xx_hasher for bytes_view and effective_replication_map::factory_key
appending hashers instead of their custom, yet identical implementations.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And a more specific HasherReturningBytes for hashers
that return bytes in finalize().
HasherReturning will be used by the following patch
also for simple hashers that return size_t from
finalize().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, responsible for injecting mutations of system.tablets to
schema changes.
Note that not all migrations are handled currently. Dependant view or
cdc table drops are not handled.
tablet_aware_replication_strategy is a trait class meant to be
inherited by replication strategy which want to work with tablets. The
trait produces per-table effective_replication_map which looks at
tablet metadata to determine replicas.
No replication startegy is changed to use tablets yet in this patch.
This change puts the reloading into topology_state_load(), which is a
function which reloads token_metadata from system.topology (the new
raft-based topology management). It clears the metadata, so needs to
reload tablet map too. In the future, tablet metadata could change as
part of topology transaction too, so we reload rather than preserve.
It is already set by schema_maker. In tablets_test we will depend on
the id being the same as that set in the schema_builder, so don't
change it to something else.
Currently, scans are splitting partition ranges around tokens. This
will have to change with tablets, where we should split at tablet
boundaries.
This patch introduces token_range_splitter which abstracts this
task. It is provided by effective_replication_map implementation.
This reverts commit 95bf8eebe0.
Later patches will adapt this code to work with token_range_splitter,
and the unit test added by the reverted commit will start to fail.
The unit test asks the query_ranges_to_vnodes_generator to split the range:
[t:end, t+1:start)
around token t, and expects the generator to produce an empty range
[t:end, t:end]
After adapting this code to token_range_splitter, the input range will
not be split because it is recognized as adjacent to t:end, and the
optimization logic will not kick in. Rather than adding more logic to
handle this case, I think it's better to drop the optimization, as it
is not very useful (rarely happens) and not required for correctness.
This allows update_pending_ranges(), invoked on keyspace creation, to
succeed in the presence of keyspaces with per-table replication
strategy. It will update only vnode-based erms, which is intended
behavior, since only those need pending ranges updated.
This change will also make node operations like bootstrap, repair,
etc. to work (not fail) in the presence of keyspaces with per-table
erms, they will just not be replicated using those algorithms.
Before, these would fail inside get_effective_replication_map(), which
is forbidden for keyspaces with per-table replication.
It's meant to be used in places where currently
get_non_local_strategy_keyspaces() is used, but work only with
keyspaces which use vnode-based replication strategy.
Will be used by tablet-based replication strategies, for which
effective replication map is different per table.
Also, this patch adapts existing users of effective replication map to
use the per-table effective replication map.
For simplicity, every table has an effective replication map, even if
the erm is per keyspace. This way the client code can be uniform and
doesn't have to check whether replication strategy is per table.
Not all users of per-keyspace get_effective_replication_map() are
adapted yet to work per-table. Those algorithms will throw an
exception when invoked on a keyspace which uses per-table replication
strategy.
With tablet-based replication strategies it will represent replication
of a single table.
Current vnode_effective_replication_map can be adapted to this interface.
This will allow algorithms like those in storage_proxy to work with
both kinds of replication strategies over a single abstraction.
Add a formatter to compaction::table_state that
prints the table ks_name.cf_name and compaction group id.
Fixes#13467
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In many cases we trigger offstrategy compaction opportunistically
also when there's nothing to do. In this case we still print
to the log lots of info-level message and call
`run_offstrategy_compaction` that wastes more cpu cycles
on learning that it has nothing to do.
This change bails out early if the maintenance set is empty
and prints a "Skipping off-strategy compaction" message in debug
level instead.
Fixes#13466
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now, with f1bbf705f9
(Cleanup sstables in resharding and other compaction types),
we may filter sstables as part of resharding compaction
and the assertion that all tokens are owned by the current
shard when filtering is no longer true.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When distributing the resharding jobs, prefer one of
the sstable shard owners based on foreign_sstable_open_info.
This is particularly important for uploaded sstables
that are resharded since they require cleanup.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Seen after f1bbf705f9 in debug mode
distributed_loader collect_all_shared_sstables copies
compaction::owned_ranges_ptr (lw_shared_ptr<const
dht::token_range_vector>)
across shards.
Since update_sstable_cleanup_state is synchronous, it can
be passed a const refrence to the token_range_vector instead.
It is ok to access the memory read-only across shards
and since this happens on start-up, there are no special
performance requirements.
Fixes#13631
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Update comments, test names and etc. that are still using the old terminology for
permit state names, bring them up to date with the recent state name changes.
Similar to the storage_service api, print a log message
for admin operations like enabling/disabling auto_compaction,
running major compaction, and setting the table compaction
strategy.
Note that there is overlap in functionality
between the storage_service and the column_family api entry points.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
before this change, the line is 249 chars long, so split it into
multiple lines for better readabitlity.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, alternator_timeout_in_ms is not live-updatable,
as after setting executor's default timeout right before creating
sharded executor instances, they never get updated with this option
anymore.
in this change,
* alternator_timeout_in_ms is marked as live-updateable
* executor::_s_default_timeout is changed to a thread_local variable,
so it can be updated by a per-shard updateable_value. and
it is now a updateable_value, so its variable name is updated
accordingly. this value is set in the ctor of executor, and
it is disconnected from the corresponding named_value<> option
in the dtor of executor.
* alternator_timeout_in_ms is passed to the constructor of
executor via sharded_parameter, so executor::_timeout_in_ms can
be initialized on per-shard basis
* executor::set_default_timeout() is dropped, as we already pass
the option to executor in its ctor.
please note, in the ctor of executor, we always update the cached
value of `s_default_timeout` with the value of `_timeout_in_ms`,
and we set the default timeout to 10s in `alternator_test_env`.
this is a design decision to avoid bending the production code for
testing, as in production, we always set the timeout with the value
specified either by the default value of yaml conf file.
Fixes#12232
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Currently, we only tested whether permissions with UDFs
that have quoted names work correctly. This patch adds
the missing test that confirms that we can also use UDTs
(as UDF parameter types) when altering permissions.
Currently, the ut_name::to_string() is used only in 2 cases:
the first one is in logs or as part of error messages, and the
second one is during parsing, temporarily storing the user
defined type name in the auth::resource for later preparation
with database and data_dictionary context.
This patch changes the string so that the 'name' part of the
ut_name (as opposed to the 'keyspace' part) is now quoted when
needed. This does not worsen the logging set of cases, but it
does help with parsing of the resulting string, when finishing
preparing the auth::resource.
After the modification, a more fitting name for the function
is "ut_name::to_cql_string()", so the function is renamed to that.
While in debug mode, we may switch the default stack to
a larger one when parsing cql. We may, however, invoke
the parser recusively, causing us to switch to the big
stack while currently using it. After the reset, we
assume that the stack is empty, so after switching to
the same stack, we write over its previous contents.
This is fixed by checking if we're already using the large
stack, which is achieved by comparing the address of
a local variable to the start and end of the large stack.
clogger.error("[{} compaction {}.{}] Invalid partition start for partition {} ({}), previous partition {} ({}) didn't end with a partition-end fragment{}{}",
throwexceptions::invalid_request_exception(format("Invalid user type literal for {}: field {} is not of type {}",receiver.name,field,field_spec->type->as_cql3_type()));
throwexceptions::invalid_request_exception(format("Invalid user type literal for {}: field {} is not of type {}",*receiver.name,field,field_spec->type->as_cql3_type()));
throwexceptions::invalid_request_exception(format("Invalid tuple literal for {}: component {:d} is not of type {}",receiver.name,i,spec->type->as_cql3_type()));
throwexceptions::invalid_request_exception(format("Invalid tuple literal for {}: component {:d} is not of type {}",*receiver.name,i,spec->type->as_cql3_type()));
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.