Currently, scylla_fstrim_setup does not start scylla-fstrim.timer and
just enables it, so the timer starts only after rebooted.
This is incorrect behavior, we start start it during the setup.
Also, unmask is unnecessary for enabling the timer.
Fixes#14249Closes#14252
(cherry picked from commit c70a9cbffe)
Closes#14422
Consider
- 10 repair instances take all the 10 _streaming_concurrency_sem
- repair readers are done but the permits are not released since they
are waiting for view update _registration_sem
- view updates trying to take the _streaming_concurrency_sem to make
progress of view update so it could release _registration_sem, but it
could not take _streaming_concurrency_sem since the 10 repair
instances have taken them
- deadlock happens
Note, when the readers are done, i.e., reaching EOS, the repair reader
replaces the underlying (evictable) reader with an empty reader. The
empty reader is not evictable, so the resources cannot be forcibly
released.
To fix, release the permits manually as soon as the repair readers are
done even if the repair job is waiting for _registration_sem.
Fixes#14676Closes#14677
(cherry picked from commit 1b577e0414)
Fixes https://github.com/scylladb/scylladb/issues/14598
This commit adds the description of minimum_keyspace_rf
to the CREATE KEYSPACE section of the docs.
(When we have the reference section for all ScyllaDB options,
an appropriate link should be added.)
This commit must be backported to branch-5.3, because
the feature is already on that branch.
Closes#14686
(cherry picked from commit 9db9dedb41)
Adds preemption points used in Alternator when:
- sending bigger json response
- building results for BatchGetItem
I've tested manually by inserting in preemptible sections (e.g. before `os.write`) code similar to:
auto start = std::chrono::steady_clock::now();
do { } while ((std::chrono::steady_clock::now() - start) < 100ms);
and seeing reactor stall times. After the patch they
were not increasing while before they kept building up due to no preemption.
Refs #7926Fixes#13689Closes#12351
* github.com:scylladb/scylladb:
alternator: remove redundant flush call in make_streamed
utils: yield when streaming json in print()
alternator: yield during BatchGetItem operation
(cherry picked from commit d2e089777b)
On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name.
If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen.
This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster.
This PR proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore.
I tested this manually, by creating a cluster of 2 OSS nodes, then upgrading one of the nodes to enterprise and verifying (with extra logging) that service level connections are matched to the default statement tenant after the PR and they indeed match to the default scheduling group before.
Fixes: #13841Fixes: #12552Closes#13843
* github.com:scylladb/scylladb:
message: match unknown tenants to the default tenant
message: generalize per-tenant connection types
(cherry picked from commit a7c2c9f92b)
This reverts commit 52e4edfd5e, reversing
changes made to d2d53fc1db. The associated test
fails with about 10% probablity, which blocks other work.
Fixes#14395Closes#14662
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.
This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
e225c88a65 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.
To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.
\Fixes scylladb/scylladb#14182
Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing #14182
\Closes scylladb/scylladb#14183
* github.com:scylladb/scylladb:
docs: dml: add update ordering section
cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
atomic_cell: compare_atomic_cell_for_merge: update and add documentation
compare_atomic_cell_for_merge: compare value last for live cells
mutation_test: test_cell_ordering: improve debuggability
(cherry picked from commit 87b4606cd6)
Closes#14647
View update routines accept mutation objects.
But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to convert the fragment stream into mutations. This is done by piping the stream to mutation_rebuilder_v2.
To keep memory usage limited, the stream for a single partition might have to be split into multiple partial mutation objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error.
This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next mutation object).
The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic.
Backported from c25201c1a3. view_build_test.cc needed some tiny adjustments for the backport.
Closes#14622Fixes#14503
* github.com:scylladb/scylladb:
test: view_build_test: add range tombstones to test_view_update_generator_buffering
test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
view_updating_consumer: make buffer limit a variable
view: fix range tombstone handling on flushes in view_updating_consumer
This patch adds a full-range tombstone to the compacted mutation.
This raises the coverage of the test. In particular, it reproduces
issue #14503, which should have been caught by this test, but wasn't.
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of
mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to
convert the fragment stream into `mutation`s. This is done by piping
the stream to mutation_rebuilder_v2.
To keep memory usage limited, the stream for a single partition might
have to be split into multiple partial `mutation` objects.
view_update_consumer does that, but in improper way -- when the
split/flush happens inside an active range tombstone, the range
tombstone isn't closed properly. This is illegal, and triggers an
internal error.
This patch fixes the problem by closing the active range tombstone
(and reopening in the same position in the next `mutation` object).
The tombstone is closed just after the last seen clustered position.
This is not necessary for correctness -- for example we could delay
all processing of the range tombstone until we see its end
bound -- but it seems like the most natural semantic.
Fixes#14503
Fixes#11017
When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.
Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.
Closes#14294
(cherry picked from commit f18e967939)
In mutation_reader_merger and clustering_order_reader_merger, the
operator()() is responsible for producing mutation fragments that will
be merged and pushed to the combined reader's buffer. Sometimes, it
might have to advance existing readers, open new and / or close some
existing ones, which requires calling a helper method and then calling
operator()() recursively.
In some unlucky circumstances, a stack overflow can occur:
- Readers have to be opened incrementally,
- Most or all readers must not produce any fragments and need to report
end of stream without preemption,
- There has to be enough readers opened within the lifetime of the
combined reader (~500),
- All of the above needs to happen within a single task quota.
In order to prevent such a situation, the code of both reader merger
classes were modified not to perform recursion at all. Most of the code
of the operator()() was moved to maybe_produce_batch which does not
recur if it is not possible for it to produce a fragment, instead it
returns std::nullopt and operator()() calls this method in a loop via
seastar::repeat_until_value.
A regression test is added.
Fixes: scylladb/scylladb#14415
Closes#14452
(cherry picked from commit ee9bfb583c)
Closes#14606
On main.cc, we have early commands which want to run prior to initialize
Seastar.
Currently, perf_fast_forward is breaking this, since it defined
"app_template app" on global variable.
To avoid that, we should defer running app_template's constructor in
scylla_fast_forward_main().
Fixes#13945Closes#14026
(cherry picked from commit 45ef09218e)
There was a bug that caused aggregates to fail when
used on column-sensitive columns.
For example:
```
SELECT SUM("SomeColumn") FROM ks.table;
```
would fail, with a message saying that there
is no column "somecolumn".
This is because the case-sensitivity got lost on the way.
For non case-sensitive column names we convert them to lowercase,
but for case sensitive names we have to preserve the name
as originally written.
The problem was in `forward_service` - we took a column name
and created a non case-sensitive `column_identifier` out of it.
This converted the name to lowercase, and later such column
couldn't be found.
To fix it, let's make the `column_identifier` case-sensitive.
It will preserve the name, without converting it to lowercase.
Fixes: https://github.com/scylladb/scylladb/issues/14307
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit 7fca350075)
This PR fixes the Restore System Tables section of the upgrade guides by adding a command to clean upgraded SStables during rollback or adding the entire section to restore system tables (which was missing from the older documents).
This PR fixes is a bug and must be backported to branch-5.3, branch-5.2., and branch-5.1.
Refs: https://github.com/scylladb/scylla-enterprise/issues/3046
- [x] 5.1-to-2022.2 - update command (backport to branch-5.3, branch-5.2, and branch-5.1)
- [x] 5.0-to-2022.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1)
- [x] 4.3-to-2021.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1)
(see https://github.com/scylladb/scylla-enterprise/issues/3046#issuecomment-1604232864)
Closes#14444
* github.com:scylladb/scylladb:
doc: fix rollback in 4.3-to-2021.1 upgrade guide
doc: fix rollback in 5.0-to-2022.1 upgrade guide
doc: fix rollback in 5.1-to-2022.2 upgrade guide
(cherry picked from commit 8a7261fd70)
`query_partition_range_concurrent` implements an optimization when
querying a token range that intersects multiple vnodes. Instead of
sending a query for each vnode separately, it sometimes sends a single
query to cover multiple vnodes - if the intersection of replica sets for
those vnodes is large enough to satisfy the CL and good enough in terms
of the heat metric. To check the latter condition, the code would take
the smallest heat metric of the intersected replica set and compare them
to smallest heat metrics of replica sets calculated separately for each
vnode.
Unfortunately, there was an edge case that the code didn't handle: the
intersected replica set might be empty and the code would access an
empty range.
This was catched by an assertion added in
8db1d75c6c by the dtest
`test_query_dc_with_rf_0_does_not_crash_db`.
The fix is simple: check if the intersected set is empty - if so, don't
calculate the heat metrics because we can decide early that the
optimization doesn't apply.
Also change the `assert` to `on_internal_error`.
Fixes#14284Closes#14300
(cherry picked from commit 732feca115)
Backport note: the original `assert` was never added to branch-5.3, but
the fix is still applicable, so I backported the fix and the
`on_internal_error` check.
Another node can stop after it joined the group0 but before it advertised itself
in gossip. `get_inet_addrs` will try to resolve all IPs and
`wait_for_peers_to_enter_synchronize_state` will loop indefinitely.
But `wait_for_peers_to_enter_synchronize_state` can return early if one of
the nodes confirms that the upgrade procedure has finished. For that, it doesn't
need the IPs of all group 0 members - only the IP of some nodes which can do
the confirmation.
This commit restructures the code so that IPs of nodes are resolved inside the
`max_concurrent_for_each` that `wait_for_peers_to_enter_synchronize_state` performs.
Then, even if some IPs won't be resolved, but one of the nodes confirms a
successful upgrade, we can continue.
Fixes#13543
(cherry picked from commit a45e0765e4)
This commit fixes the Restore System Tables section
in the 5.2-to-2023.1 upgrade guide by adding a command
to clean upgraded SStables during rollback.
This is a bug (an incomplete command) and must be
backported to branch-5.3 and branch-5.2.
Refs: https://github.com/scylladb/scylla-enterprise/issues/3046Closes#14373
(cherry picked from commit f4ae2c095b)
The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the previous buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between.
The code guranteeing this forward progress had a bug: the comparison between the position after the last buffer-fill and the current last fragment position was done in the wrong direction.
So if the condition that we wanted to achieve was already true, we would continue filling the buffer until partition end which may lead to OOMs such as in #13491.
There was already a fix in this area to handle `partition_start` fragments correctly - #13563 - but it missed that the position comparison was done in the wrong order.
Fix the comparison and adjust one of the tests (added in #13563) to detect this case.
After the fix, the evictable reader starts generating some redundant (but expected) range tombstone change fragments since it's now being paused and resumed. For this we need to adjust mutation source tests which were a bit too specific. We modify `flat_mutation_reader_assertions` to squash the redundant `r_t_c`s.
Fixes#13491Closes#14375
* github.com:scylladb/scylladb:
readers: evictable_reader: don't accidentally consume the entire partition
test: flat_mutation_reader_assertions: squash `r_t_c`s with the same position
(cherry picked from commit 586102b42e)
Fixes https://github.com/scylladb/scylla-enterprise/issues/3036
This commit adds support for Ubuntu 22.04 to the list
of OSes supported by ScyllaDB Enterprise 2021.1.
This commit fixex a bug and must be backported to
branch-5.3 and branch-5.2.
Closes#14372
(cherry picked from commit 74fc69c825)
Fixes https://github.com/scylladb/scylladb/issues/14333
This commit replaces the documentation landing page with
the Open Source-only documentation landing page.
This change is required as now there is a separate landing
page for the ScyllaDB documentation, so the page is duplicated,
creating bad user experience.
(cherry picked from commit f60f89df17)
Closes#14371
Fixes https://github.com/scylladb/scylladb/issues/14084
This commit adds OS support for version 5.3 to the table on the OS Support by Linux Distributions and Version page.
Closes#14228
* github.com:scylladb/scylladb:
doc: remove OS support for outdated ScyllaDB versions 2.x and 3.x
doc: add OS support for ScyllaDB 5.3
(cherry picked from commit aaac455ebe)
Problem can be reproduced easily:
1) wrote some sstables with smp 1
2) shut down scylla
3) moved sstables to upload
4) restarted scylla with smp 2
5) ran refresh (resharding happens, adds sstable to cleanup
set and never removes it)
6) cleanup (tries to cleanup resharded sstables which were
leaked in the cleanup set)
Bumps into assert "Assertion `!sst->is_shared()' failed", as
cleanup picks a shared sstable that was leaked and already
processed by resharding.
Fix is about not inserting shared sstables into cleanup set,
as shared sstables are restricted to resharding and cannot
be processed later by cleanup (nor it should because
resharding itself cleaned up its input files).
Dtest: https://github.com/scylladb/scylla-dtest/pull/3206Fixes#14001.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14147
(cherry picked from commit 156d771101)
Fixes https://github.com/scylladb/scylladb/issues/14097
This commit removes support for Ubuntu 18 from
platform support for ScyllaDB Enterprise 2023.1.
The update is in sync with the change made for
ScyllaDB 5.2.
This commit must be backported to branch-5.2 and
branch-5.3.
Closes#14118
(cherry picked from commit b7022cd74e)
With regards to closing the looked-up querier if an exception is thrown. In particular, this requires closing the querier if a semaphore mismatch is detected. Move the table lookup above the line where the querier is looked up, to avoid having to handle the exception from it. As a consequence of closing the querier on the error path, the lookup lambda has to be made a coroutine. This is sad, but this is executed once per page, so its cost should be insignificant when spread over an
entire page worth of work.
Also add a unit test checking that the mismatch is detected in the first place and that readers are closed.
Fixes: #13784Closes#13790
* github.com:scylladb/scylladb:
test/boost/database_test: add unit test for semaphore mismatch on range scans
partition_slice_builder: add set_specific_ranges()
multishard_mutation_query: make reader_context::lookup_readers() exception safe
multishard_mutation_query: lookup_readers(): make inner lambda a coroutine
(cherry picked from commit 1c0e8c25ca)
Due to a simple programming oversight, one of keyspace_metadata
constructors is using empty user_types_metadata instead of the
passed one. Fix that.
Fixes#14139Closes#14143
(cherry picked from commit 1a521172ec)
A long long time ago there was an issue about removing infinite timeouts
from distributed queries: #3603. There was also a fix:
620e950fc8. But apparently some queries
escaped the fix, like the one in `default_role_row_satisfies`.
With the right conditions and timing this query may cause a node to hang
indefinitely on shutdown. A node tries to perform this query after it
starts. If we kill another node which is required to serve this query
right before that moment, the query will hang; when we try to shutdown
the querying node, it will wait for the query to finish (it's a
background task in auth service), which it never does due to infinite
timeout.
Use the same timeout configuration as other queries in this module do.
Fixes#13545.
Closes#14134
(cherry picked from commit f51312e580)
After c7826aa910, sstable runs are cleaned up together.
The procedure which executes cleanup was holding reference to all
input sstables, such that it could later retry the same cleanup
job on failure.
Turns out it was not taking into account that incremental compaction
will exhaust the input set incrementally.
Therefore cleanup is affected by the 100% space overhead.
To fix it, cleanup will now have the input set updated, by removing
the sstables that were already cleaned up. On failure, cleanup
will retry the same job with the remaining sstables that weren't
exhausted by incremental compaction.
New unit test reproduces the failure, and passes with the fix.
Fixes#14035.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14038
(cherry picked from commit 23443e0574)
cleanup_compaction should resolve only after all
sstables that require cleanup are cleaned up.
Since it is possible that some of them are in staging
and therefore cannot be cleaned up, retry once a second
until they become eligible.
Timeout if there is no progress within 5 minutes
to prevent hanging due to view building bug.
Fixes#9559Closes#13812
* github.com:scylladb/scylladb:
table: signal compaction_manager when staging sstables become eligible for cleanup
compaction_manager: perform_cleanup: wait until all candidates are cleaned up
compaction_manager: perform_cleanup: perform_offstrategy if needed
compaction_manager: perform_cleanup: update_sstables_cleanup_state in advance
sstable_set: add for_each_sstable_gently* helpers
now that scylla-jmx has a dedicated script for detecting the existence
of OpenJDK, and this script is included in the unified package, let's
just leverage it instead of repeating it in `install.sh`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13514
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using `generation_type::int_t` in the helper functions, we just
pass `generation_type` in place of integer.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13931
Without the feature, the system schema doesn't have the table, and the
read will fail with:
Transferring snapshot to ... failed with: seastar::rpc::remote_verb_error (Can't find a column family tablets in keyspace system)
We should not attempt to read tablet metadata in the experimental
feature is not enabled.
Fixes#13946Closes#13947
Currently temporary directories with incomplete sstables and pending deletion log are processed by distributed loader on start. That's not nice, because for s3 backed sstables this code makes no sense (and is currently a no-op because of incomplete implementation). This garbage collecting should be kept in sstable_directory where it can off-load this work onto lister component that is storage-aware.
Once g.c. code moved, it allows to clean the class sstable list of static helpers a bit.
refs: #13024
refs: #13020
refs: #12707Closes#13767
* github.com:scylladb/scylladb:
sstable: Toss tempdir extension usage
sstable: Drop pending_delete_dir_basename()
sstable: Drop is_pending_delete_dir() helper
sstable_directory: Make garbage_collect() non-static
sstable_directory: Move deletion log exists check
distributed_loader: Move garbage collecting into sstable_directory
distributed_loader: Collect garbace collecting in one call
sstable: Coroutinize remove_temp_dir()
sstable: Coroutinize touch_temp_dir()
sstable: Use storage::temp_dir instead of hand-crafted path
When a CQL expression is printed, it can be done using
either the `debug` mode, or the `user` mode.
`user` mode is basically how you would expect the CQL
to be printed, it can be printed and then parsed back.
`debug` mode is more detailed, for example in `debug`
mode a column name can be displayed as
`unresolved_identifier(my_column)`, which can't
be parsed back to CQL.
The default way of printing is the `debug` mode,
but this requires us to remember to enable the `user`
mode each time we're printing a user-facing message,
for example for an invalid_request_exception.
It's cumbersome and people forget about it,
so let's change the default to `user`.
There issues about expressions being printed
in a `strange` way, this fixes them.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#13916
The previous implementation didn't actually do a read barrier, because
the statement failed on an early prepare/validate step which happened
before read barrier was even performed.
Change it to a statement which does not fail and doesn't perform any
schema change but requires a read barrier.
This breaks one test which uses `RandomTables.verify_schema()` when only
one node is alive, but `verify_schema` performs a read barrier. Unbreak
it by skipping the read barrier in this case (it makes sense in this
particular test).
Closes#13933
This implicit link it pretty bad, because feature service is a low-level
one which lots of other services depend on. System keyspace is opposite
-- a high-level one that needs e.g. query processor and database to
operate. This inverse dependency is created by the feature service need
to commit enabled features' names into system keyspace on cluster join.
And it uses the qctx thing for that in a best-effort manner (not doing
anything if it's null).
The dependency can be cut. The only place when enabled features are
committed is when gossiper enables features on join or by receiving
state changes from other nodes. By that time the
sharded<system_keyspace> is up and running and can be used.
Despite gossiper already has system keyspace dependency, it's better not
to overload it with the need to mess with enabling and persisting
features. Instead, the feature_enabler instance is equipped with needed
dependencies and takes care of it. Eventually the enabler is also moved
to feature_service.cc where it naturally belongs.
Fixes: #13837Closes#13172
* github.com:scylladb/scylladb:
gossiper: Remove features and sysks from gossiper
system_keyspace: De-static save_local_supported_features()
system_keyspace: De-static load_|save_local_enabled_features()
system_keyspace: Move enable_features_on_startup to feature_service (cont)
system_keyspace: Move enable_features_on_startup to feature_service
feature_service: Open-code persist_enabled_feature_info() into enabler
gms: Move feature enabler to feature_service.cc
gms: Move gossiper::enable_features() to feature_service::enable_features_on_join()
gms: Persist features explicitly in features enabler
feature_service: Make persist_enabled_feature_info() return a future
system_keyspace: De-static load_peer_features()
gms: Move gossiper::do_enable_features to persistent_feature_enabler::enable_features()
gossiper: Enable features and register enabler from outside
gms: Add feature_service and system_keyspace to feature_enabler
Some state that is used to fill in 'peeers' table is still propagated
over gossiper. When moving a node into the normal state in raft
topology code use the data from the gossiper to populate peers table because
storage_service::on_change() will not do it in case the node was not in
normal state at the time it was called.
Fixes: #13911
Message-Id: <ZGYk/V1ymIeb8qMK@scylladb.com>
The `system_keyspace` has several methods to query the tables in it. These currently require a storage proxy parameter, because the read has to go through storage-proxy. This PR uses the observation that all these reads are really local-replica reads and they only actually need a relatively small code snippet from storage proxy. These small code snippets are exported into standalone function in a new header (`replica/query.hh`). Then the system keyspace code is patched to use these new standalone functions instead of their equivalent in storage proxy. This allows us to replace the storage proxy dependency with a much more reasonable dependency on `replica::database`.
This PR patches the system keyspace code and the signatures of the affected methods as well as their immediate callers. Indirect callers are only patched to the extent it was needed to avoid introducing new includes (some had only a forward-declaration of storage proxy and so couldn't get database from it). There are a lot of opportunities left to free other methods or maybe even entire subsystems from storage proxy dependency, but this is not pursued in this PR, instead being left for follow-ups.
This PR was conceived to help us break the storage proxy -> storage service -> system tables -> storage proxy dependency loop, which become a major roadblock in migrating from IP -> host_id. After this PR, system keyspace still indirectly depends on storage proxy, because it still uses `cql3::query_processor` in some places. This will be addressed in another PR.
Refs: #11870Closes#13869
* github.com:scylladb/scylladb:
db/system_keyspace: remove dependency on storage_proxy
db/system_keyspace: replace storage_proxy::query*() with replica:: equivalent
replica: add query.hh
Commit 8c4b5e4283 introduced an optimization which only
calculates max purgeable timestamp when a tombstone satisfy the
grace period.
Commit 'repair: Get rid of the gc_grace_seconds' inverted the order,
probably under the assumption that getting grace period can be
more expensive than calculating max purgeable, as repair-mode GC
will look up into history data in order to calculate gc_before.
This caused a significant regression on tombstone heavy compactions,
where most of tombstones are still newer than grace period.
A compaction which used to take 5s, now takes 35s. 7x slower.
The reason is simple, now calculation of max purgeable happens
for every single tombstone (once for each key), even the ones that
cannot be GC'ed yet. And each calculation has to iterate through
(i.e. check the bloom filter of) every single sstable that doesn't
participate in compaction.
Flame graph makes it very clear that bloom filter is a heavy path
without the optimization:
45.64% 45.64% sstable_compact sstable_compaction_test_g
[.] utils::filter::bloom_filter::is_present
With its resurrection, the problem is gone.
This scenario can easily happen, e.g. after a deletion burst, and
tombstones becoming only GC'able after they reach upper tiers in
the LSM tree.
Before this patch, a compaction can be estimated to have this # of
filter checks:
(# of keys containing *any* tombstone) * (# of uncompacting sstable
runs[1])
[1] It's # of *runs*, as each key tend to overlap with only one
fragment of each run.
After this patch, the estimation becomes:
(# of keys containing a GC'able tombstone) * (# of uncompacting
runs).
With repair mode for tombstone GC, the assumption, that retrieval
of gc_before is more expensive than calculating max purgeable,
is kept. We can revisit it later. But the default mode, which
is the "timeout" (i.e. gc_grace_seconds) one, we still benefit
from the optimization of deferring the calculation until
needed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13908
sstables_manager::get_component_lister() is used by sstable_directory.
and almost all the "ingredients" used to create a component lister
are located in sstable_directory. among the other things, the two
implementations of `components_lister` are located right in
`sstable_directory`. there is no need to outsource this to
sstables_manager just for accessing the system_keyspace, which is
already exposed as a public function of `sstables_manager`. so let's
move this helper into sstable_directory as a member function.
with this change, we can even go further by moving the
`components_lister` implementations into the same .cc file.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13853
There are several places that need to carry a pointer to a table that's shard-wide accessible -- database snapshot and truncate code and distributed loader. The database code uses `get_table_on_all_shards()` returning a vector of foreign lw-pointers, the loader code uses its own global_column_family_ptr class.
This PR generalizes both into global_table_ptr facility.
Closes#13909
* github.com:scylladb/scylladb:
replica: Use global_table_ptr in distributed loader
replica: Make global_table_ptr a class
replica: Add type alias for vector of foreign lw-pointers
replica: Put get_table_on_all_shards() to header
replica: Rewrite get_table_on_all_shards()
instead of encoding the fact that we are using generation identifier
as a hint where the SSTable with this generation should be processed
at the caller sites of `as_int()`, just provide an accessor on
sstable_generation_generator's side. this helps to encapsulate the
underlying type of generation in `generation_type` instead of exposing
it to its users.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13846
`ep` is std::move'ed to get_endpoint_state_for_endpoint_ptr
but it's used later for logger.warn()
Fixes#13921
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13920
The loader has very similar global_column_family_ptr class for its
distributed loadings. Now it can use the "standard" one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now all users of global_table know it's a vector and reference its
elements with this_shard_id() index. Making the global_table_ptr a class
makes it possible to stop using operator[] and "index" this_shard_id()
in its -> and * operators.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Use sharded<database>::invoke_on_all() instead of open-coded analogy.
Also don't access database's _column_families directly, use the
find_column_family() method instead.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The tempdir for filesystem-based sstables is {generation}.sstable one.
There are two places that need to know the ".sstable" extention -- the
tempdir creating code and the tempdir garbage-collecting code.
This patch simplifies the sstable class by patching the aforementioned
functions to use newly introduced tempdir_extension string directly,
without the help of static one-line helpers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The helper is used to return const char* value of the pending delete
dir. Callers can use it directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's only used by the sstable_directory::replay_pending_delete_log()
method. The latter is only called by the sstable_directory itself with
the path being pending-delete dir for sure. So the method can be made
private and the is_pending_delete_dir() can be removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When non static the call can use sstable_directory::_sstable_dir path,
not the provided argument. The main benefit is that the method can later
be moved onto lister so that filesystem and ownership-table listers can
process dangling bits differently.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Check if the deletion log exists in the handling helper, not outside of
it. This makes next patch shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's the directory that owns the components lister and can reason about
the way to pick up dangling bits, be it local directories or entries
from the ownership table.
First thing to do is to move the g.c. code into sstable_directory. While
at it -- convert ssting dir into fs::path dir and switch logger.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the loader starts it first scans the directory for sstables'
tempdirs and pending deletion logs. Put both into one call so that it
can be moved more easily later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When opening an sstable on filesystem it's first created in a temporary
directory whose path is saved in storage::temp_dir variable. However,
the opening method constructs the path by hand. Fix that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fixes https://github.com/scylladb/scylladb/issues/13915
This commit fixes broken links to the Enterprise docs.
They are links to the enterprise branch, which is not
published. The links to the Enterprise docs should include
"stable" instead of the branch name.
This commit must be backported to branch-5.2, because
the broken links are present in the published 5.2 docs.
Closes#13917
perform_cleanup may be waiting for those sstables
to become eligible for cleanup so signal it
when table::move_sstables_from_staging detects an
sstable that requires cleanup.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
cleanup_compaction should resolve only after all
sstables that require cleanup are cleaned up.
Since it is possible that some of them are in staging
and therefore cannot be cleaned up, retry once a second
until they become eligible.
Timeout if there is no progress within 5 minutes
to prevent hanging due to view building bug.
Fixes#9559
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is possible that cleanup will be executed
right after repair-based node operations,
in which case we have a 5 minutes timer
before off-strategy compaction is started.
After marking the sstables that need cleanup,
perform offstrategy compaction, if needed.
This will implicitly cleanup those sstables
as part of offstrategy compaction, before
they are even passed for view update (if the table
has views/secondary index).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Scan all sstables to determine which of them
requires cleanup before calling perform_task_on_all_files.
This allows for cheaper no-op return when
no sstable was identified as requiring cleanup,
and also it will allow triggering offstrategy
compaction if needed, after selecting the sstables
for cleanup, in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently callers of `for_each_sstable` need to
use a seastar thread to allow preemption
in the for_each_sstable loop.
Provide for_each_sstable_gently and
for_each_sstable_gently_until to make using this
facility from a coroutine easier, without requiring
a seastar thread.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
string_format_test was added in 1b5d5205c8,
so let's add it to CMake building system as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13912
with off-strategy, input list size can be close to 1k, which will
lead to unneeded reallocations when formatting the list for
logging.
in the past, we faced stalls in this area, and excessive reallocation
(log2 ~1k = ~10) may have contributed to that.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13907
Currently we may start to receive requests before group0 is configured
during boot. If that happens those requests may try to pull schema and
issue raft read barrier which will crash the system because group0 is
not yet available. Workaround it by pretending the raft is disabled in
this case and use non raft procedure. The proper fix should make sure
that storage proxy verbs are registered only after group0 is fully
functional.
Message-Id: <ZGOZkXC/MsiWtNGu@scylladb.com>
range_tombstone_change_generator::flush() mishandles the case when two range
tombstones are adjacent and flush(pos, end_of_range=true) is called with pos
equal to the end bound of the lesser-position range tombstone.
In such case, the start change of the greater-position rtc will be accidentally
emitted, and there won't be an end change, which breaks reader assumptions by
ending the stream with an unclosed range tombstone, triggering an assertion.
This is due to a non-strict inequality used in a place where strict inequality
should be used. The modified line was intended to close range tombstones
which end exactly on the flush position, but this is unnecessary because such
range tombstones are handled by the last `if` in the function anyway.
Instead, this line caused range tombstones beginning right after the flush
position to be emitted sometimes.
Fixes#12462Closes#13906
CI once failed due to mc being unable to configure minio server. There's currently no glues why it could happen, let's increase the minio.py verbosity a bit
refs: #13896Closes#13901
* github.com:scylladb/scylladb:
test,minio: Run mc with --debug option
test,minio: Log mc operations to log file
Currently, when a user creates a function or a keyspace, no
permissions on functions are update.
Instead, the user should gain all permissions on the function
that they created, or on all functions in the keyspace they have
created. This is also the behavior in Cassandra.
However, if the user is granted permissions on an function after
performing a CREATE OR REPLACE statement, they may
actually only alter the function but still gain permissions to it
as a result of the approach above, which requires another
workaround added to this series.
Lastly, as of right now, when a user is altering a function, they
need both CREATE and ALTER permissions, which is incompatible
with Cassandra - instead, only the ALTER permission should be
required.
This series fixes the mentioned issues, and the tests are already
present in the auth_roles_test dtest.
Fixes#13747Closes#13814
* github.com:scylladb/scylladb:
cql: adjust tests to the updated permissions on functions
cql: fix authorization when altering a function
cql: grant permissions on functions when creating a keyspace/function
cql: pass a reference to query processor in grant_permissions_to_creator
test_permissions: make tests pass on cassandra
This series introduces a new gossiper method: get_endpoints that returns a vector of endpoints (by value) based on the endpoint state map.
get_endpoints is used here by gossiper and storage_service for iterations that may preempt
instead of iterating direction over the endpoint state map (`_endpoint_state_map` in gossiper or via `get_endpoint_states()`) so to prevent use-after-free that may potentially happen if the map is rehashed while the function yields causing invalidation of the loop iterators.
Fixes#13899Closes#13900
* github.com:scylladb/scylladb:
storage_service: do not preempt while traversing endpoint_state_map
gossiper: do not preempt while traversing endpoint_state_map
It turns out that numeric_limits defines an implicit implementation
for std::numeric_limits<utils::tagged_integer<Tag, ValueType>>
which apprently returns a default-constructed tagged_integer
for min() and max(), and this broke
`gms::heart_beat_state::force_highest_possible_version_unsafe()`
since [gms: heart_beat_state: use generation_type and version_type](4cdad8bc8b)
(merged in [Merge 'gms: define and use generation and version types'...](7f04d8231d))
Implementing min/max correctly
Fixes#13801Closes#13880
* github.com:scylladb/scylladb:
storage_service: handle_state_normal: on_internal_error on "owns no tokens"
utils: tagged_integer: implement std::numeric_limits::{min,max}
test: add tagged_integer_test
The previous version of wasmtime had a vulnerability that possibly
allowed causing undefined behavior when calling UDFs.
We're directly updating to wasmtime 8.0.1, because the update only
requires a slight code modification and the Wasm UDF feature is
still experimental. As a result, we'll benefit from a number of
new optimizations.
Fixes#13807Closes#13804
The auto& db = proxy.local().get_db() is called few lines above this
patch, so the &db can be reused for invoke_on_all() call.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13896
Currently everything minio.py does goes to test.py log, while mc (and
minio) output go to another log file. That's inconvenient, better to
keep minio.py's messages in minio log file.
Also, while at it, print a message if local alias drop fails (it's
benign failure, but it's good to have the note anyway).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
this option allows user to use specified linker instead of the
default one. this is more flexible than adding more linker
candidates to the known linkers.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13874
For unknown reasons, clang 16 rejects equality comparison
(operator==) where the left-hand-side is an std::string and the
right-hand-side is an sstring. gcc and older clang versions first
convert the left-hand-side to an sstring and then call the symmetric
equality operator.
I was able to hack sstring to support this assymetric comparison,
but the solution is quite convoluted, and it may be that it's clang
at fault here. So instead this patch eliminates the three cases where
it happened. With is applied, we can build with clang 16.
Closes#13893
in this change, the type of the "generation" field of "sstable" in the
return value of RESTful API entry point at
"/storage_service/sstable_info" is changed from "long" to "string".
this change depends on the corresponding change on tools/jmx submodule,
so we have to include the submodule change in this very commit.
this API is used by our JMX exporter, which in turn exposes the
SSTable information via the "StorageService.getSSTableInfo" mBean
operation, which returns the retrieved SSTable info as a list of
CompositeData. and "generation" is a field of an element in the
CompositeData. in general, the scylla JMX exporter is consumed
by the nodetool, which prints out returned SSTable info list with
a pretty formatted table, see
tools/java/src/java/org/apache/cassandra/tools/nodetool/SSTableInfo.java.
the nodetool's formatter is not aware of the schema or type of the
SSTables to be printed, neither does it enforce the type -- it just
tries it best to pretty print them as a tabular.
But the fields in CompositeData is typed, when the scylla JMX exporter
translates the returned SSTables from the RESTful API, it sets the
typed fields of every `SSTableInfo` when constructing `PerTableSSTableInfo`.
So, we should be consistent on the type of "generation" field on both
the JMX and the RESTful API sides. because we package the same version
of scylla-jmx and nodetool in the same precompiled tarball, and enforce
the dependencies on exactly same version when shipping deb and rpm
packages, we should be safe when it comes to interoperability of
scylla-jmx and scylla. also, as explained above, nodetool does not care
about the typing, so it is not a problem on nodetool's front.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13834
Instead of simply throwing an exception. With just the exception, it is
impossible to find out what went wrong, as this API is very generic and
is used in a variety of places. The backtrace printed by
`on_internal_error()` will help zero in on the problem.
Fixes: #13876Closes#13883
Although this condition should not happen,
we suspect that certain timing conditions might
lead this state of node in handle_normal_state
(possibly when shutdown) has no tokens.
Currently we call on_internal_error_noexcept, so
if abort_on_internal_error is false, we will just
print an error and continue on with handle_state_normal.
Change that to `on_internal_error` so to throw an
exception in production in this unexpected state.
Refs #13801
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Fixes https://github.com/scylladb/scylladb/issues/13857
This commit adds the OS support for ScyllaDB Enterprise 2023.1.
The support is the same as for ScyllaDB Open Source 5.2, on which
2023.1 is based.
After this commit is merged, it must be backported to branch-5.2.
In this way, it will be merged to branch-2023.1 and available in
the docs for Enterprise 2023.1
Closes: #13858
Separate cluster_size into a cluster section and specify this value as
initial_size.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13440
Add add a respective unit test.
It turns out that numeric_limits defines an implicit implementation
for std::numeric_limits<utils::tagged_integer<Tag, ValueType>>
which apprently returns a default-constructed tagged_integer
for min() and max(), and this broke
`gms::heart_beat_state::force_highest_possible_version_unsafe()`
since 4cdad8bc8b
(merged in 7f04d8231d)
Implementing min/max correctly
Fixes#13801
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
before this change, alternator_timeout_in_ms is not live-updatable,
as after setting executor's default timeout right before creating
sharded executor instances, they never get updated with this option
anymore. but many users would like to set the driver timers based on
server timers. we need to enable them to configure timeout even
when the server is still running.
in this change,
* `alternator_timeout_in_ms` is marked as live-updateable
* `executor::_s_default_timeout` is changed to a thread_local variable,
so it can be updated by a per-shard updateable_value. and
it is now a updateable_value, so its variable name is updated
accordingly. this value is set in the ctor of executor, and
it is disconnected from the corresponding named_value<> option
in the dtor of executor.
* alternator_timeout_in_ms is passed to the constructor of
executor via sharded_parameter, so `executor::_timeout_in_ms` can
be initialized on per-shard basis
* `executor::set_default_timeout()` is dropped, as we already pass
the option to executor in its ctor.
Fixes#12232Closes#13300
* github.com:scylladb/scylladb:
alternator: split the param list of executor ctor into multi lines
alternator,config: make alternator_timeout_in_ms live-updateable
in this series, instead of hardwiring to integer, we switch to generation generator for creating new generations. this should helps us to migrate to a generation identifier which can also represented by UUID. and potentially can help to improve the testing coverage once we switch over to UUID-based generation identifier. will need to parameterize these tests by then, for sure.
Closes#13863
* github.com:scylladb/scylladb:
test: sstable: use generator to generate generations
test: sstable: pass generation_type in helper functions
test: sstable: use generator to generate generations
It is useful to know which node has the error. For example, when a node
has a corrupted sstable, with this patch, repair master node can tell
which node has the corrupted sstable.
```
WARN 2023-05-15 10:54:50,213 [shard 0] repair -
repair[2df49b2c-219d-411d-87c6-2eae7073ba61]: get_combined_row_hash: got
error from node=127.0.0.2, keyspace=ks2a, table=tb,
range=(8992118519279586742,9031388867920791714],
error=seastar::rpc::remote_verb_error (some error)
```
Fixes#13881Closes#13882
Currently, error injections can be enabled either through HTTP or CQL.
While these mechanisms are effective for injecting errors after a node
has already started, it can't be reliably used to trigger failures
shortly after node start. In order to support this use case, this commit
adds possibility to enable some error injections via config.
A configuration option `error_injections_at_startup` is added. This
option uses our existing configuration framework, so it is possible to
supply it either via CLI or in the YAML configuration file.
- When passed in commandline, the option is parsed as a
semicolon-separated list of error injection names that should be
enabled. Those error injections are enabled in non-oneshot mode.
The CLI option is marked as not used in release mode and does not
appear in the option list.
Example:
--error-injections-at-startup failure_point1;failure_point2
- When provided in YAML config, the option is parsed as a list of items.
Each item is either a string or a map or parameters. This method is
more flexible as it allows to provide parameters for each injection
point. At this time, the only benefit is that it allows enabling
points in oneshot mode, but more parameters can be added in the future
if needed.
Explanatory example:
error_injections_at_startup:
- failure_point1 # enabled in non-oneshot mode
- name: failure_point2 # enabled in oneshot mode
one_shot: true # due to one_shot optional parameter
The primary goal of this feature is to facilitate testing of raft-based
cluster features. An error injection will be used to enable an
additional feature to simulate node upgrade.
Tests: manual
Closes#13861
Constructors of trace_state class initialize most of the fields in constructor body with the help of non-inline helper method. It's possible and is better to initialize as much as possible with initializer lists.
Closes#13871
* github.com:scylladb/scylladb:
tracing: List-initialize trace_state::_records
tracing: List-initialize trace_state::_props
tracing: List-initialize trace_state::_slow_query_threshold
tracing: Reorder trace_state fields initialization
tracing: Remove init_session_records()
tracing: List-initialize one_session_records::ttl
tracing: List-initialize one_session_records
tracing: List-initialize session_record
There are two layers of stables deletion -- delete-atomically and wipe. The former is in fact the "API" method, it's called by table code when the specific sstable(s) are no longer needed. It's called "atomically" because it's expected to fail in the middle in a safe manner so that subsequent boot would pick the dangling parts and proceed. The latter is a low-level removal function that can fail in the middle, but it's not of _its_ care.
Currently the atomic deletion is implemented with the help of sstable_directory::delete_atomically() method that commits sstables files names into deletion log, then calls wipe (indirectly), then drops the deletion log. On boot all found deletion logs are replayed. The described functionality is used regardless of the sstable storage type, even for S3, though deletion log is an overkill for S3, it's better be implemented with the help of ownership table. In fact, S3 storage already implements atomic deletion in its wipe method thus being overly careful.
So this PR
- makes atomic deletion be storage-specific
- makes S3 wipe non-atomic
fixes: #13016
note: Replaying sstables deletion from ownership table on boot is not here, see #13024Closes#13562
* github.com:scylladb/scylladb:
sstables: Implement atomic deleter for s3 storage
sstables: Get atomic deleter from underlying storage
sstables: Move delete_atomically to manager and rename
Add basic test for tagged+integer arithmetic operations.
Remove const qualifier from `tagged_integer::operator[+-]=`
as these are add/sub-assign operators that need to modify
the value in place.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Similarly to how we handle Roles and Tables, we do not
allow permissions on non-existent objects, so the CREATE
permission on a specific function is meaningless, because
for the permission to be granted to someone, the function
must be already created.
This patch removes the CREATE permission from the set of
permissions applicable to a specific function.
Fixes#13822Closes#13824
This is a translation of Cassandra's CQL unit test source file
validation/entities/UFTypesTest.java into our cql-pytest framework.
There are 7 tests, which reproduce one known bug:
Refs #13746: UDF can only be used in SELECT, and abort when used in WHERE, or in INSERT/UPDATE/DELETE commands
And uncovered two previously unknown bugs:
Refs #13855: UDF with a non-frozen collection parameter cannot be called on a frozen value
Refs #13860: A non-frozen collection returned by a UDF cannot be used as a frozen one
Additionally, we encountered an issue that can be treated as either a bug or a hole in documentation:
Refs #13866: Argument and return types in UDFs can be frozen
Closes#13867
Adding new APIs /column_family/tombstone_gc and /storage_service/tombstone_gc, that will allow for disabling tombstone garbage collection (GC) in compaction.
Mimicks existing APIs /column_family/autocompaction and /storage_service/autocompaction.
column_family variant must specify a single table only, following existing convention.
whereas the storage_service one can specify an entire keyspace, or a subset of a tables in a keyspace.
column_family API usage
-----
```
The table name must be in keyspace:name format
Get status:
curl -s -X GET "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf"
Enable GC
curl -s -X POST "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf"
Disable GC
curl -s -X DELETE "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf"
```
storage_service API usage
-----
```
Tables can be specified using a comma-separated list.
Enable GC on keyspace
curl -s -X POST "http://127.0.0.1:10000/storage_service/tombstone_gc/ks"
Disable GC on keyspace
curl -s -X DELETE "http://127.0.0.1:10000/storage_service/tombstone_gc/ks"
Enable GC on a subset of tables
curl -s -X POST
"http://127.0.0.1:10000/storage_service/tombstone_gc/ks?cf=table1,table2"
```
Closes#13793
* github.com:scylladb/scylladb:
test: Test new API for disabling tombstone GC
test: rest_api: extract common testing code into generic functions
Add API to disable tombstone GC in compaction
api: storage_service: restore indentation
api: storage_service: extract code to set attribute for a set of tables
tests: Test new option for disabling tombstone GC in compaction
compaction_strategy: bypass tombstone compaction if tombstone GC is disabled
table: Allow tombstone GC in compaction to be disabled on user request
Schema pull may fail because the pull does not contain everything that
is needed to instantiate a schema pointer. For instance it does not
contain a keyspace. This series changes the code to issue raft read
barrier before the pull which will guaranty that the keyspace is created
before the actual schema pull is performed.
database_test is failing sporadically and the cause was traced back
to commit e3e7c3c7e5.
The commit forces a subset of tests in database_test, to run once
for each of predefined x_log2_compaction_group settings.
That causes two problems:
1) test becomes 240% slower in dev mode.
2) queries on system.auth is timing out, and the reason is a small
table being spread across hundreds of compaction groups in each
shard. so to satisfy a range scan, there will be multiple hops,
making the overhead huge. additionally, the compaction group
aware sstable set is not merged yet. so even point queries will
unnecessarily scan through all the groups.
Fixes#13660.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13851
This PR contains some small improvements to the safety of consuming/releasing resources to/from the semaphore:
* reader_permit: make the low-level `consume()/signal()` API private, making the only user (an RAII class) friend.
* reader_resources: split `reset()` into `noexcept` and potentially throwing variant.
* reader_resources::reset_to(): try harder to avoid calling `consume()` (when the new resource amount is smaller then the previous one)
Closes#13678
* github.com:scylladb/scylladb:
reader_permit: resource_units::reset_to(): try harder to avoid calling consume()
reader_permit: split resource_units::reset()
reader_permit: make consume()/signal() API private
It consists of two parts -- call for do_read_simple() with lambda and handling of its results. PR coroutinizes it in two steps for review simplicity -- first the lambda, then the outer caller. Then restores indentation.
Closes#13862
* github.com:scylladb/scylladb:
sstables: Restore indentation after previous patches
sstables: Coroutinuze read_toc() outer part
sstables: Coroutinuze read_toc() inner part
Currently s3::client is created for each sstable::storage. It's later shared between sstable's files and upload sink(s). Also foreign_sstable_open_info can produce a file from a handle making a new standalone client. Coupled with the seastar's http client spawning connections on demand, this makes it impossible to control the amount of opened connections to object storage server.
In order to put some policy on top of that (as well as apply workload prioritization) s3 clients should be collected in one place and then shared by users. Since s3::client uses seastar::http::client under the hood which, in turn, can generate many connections on demand, it's enough to produce a single s3::client per configured endpoint one each shard and then share it between all the sstables, files and sinks.
There's one difficulty however, solving which is most of what this PR does. The file handle, that's used to transfer sstable's file across shards, should keep aboard all it needs to re-create a file on another shard. Since there's a single s3::client per shard, creation of a file out of a handle should grab that shard's client somehow. The meaningful shard-local object that can help is the sstables_manager and there are three ways to make use of it. All deal with the fact that sstables_manager-s are not sharded<> services, but are owner by the database independently on each shard.
1. walk the client -> sst.manager -> database -> container -> database -> sst.manager -> client chain by keeping its first half on the handle and unrolling the second half to produce a file
2. keep sharded peering service referenced by the sstables_manager that's initialized in main and passed though the database constructor down to sstables_manager(s)
3. equip file_handle::to_file with the "context" argument and teach sstables foreign info opener to push sstables_manager down to s3 file ... somehow
This PR chooses the 2nd way and introduces the sstables::storage_manager main-local sharded peering service that maintains all the s3::clients. "While at it" the new manager gets the object_storage_config updating facilities from the database (it's overloaded even without it already). Later the manager will also be in charge of collecting and exporting S3 metrics. In order to limit the number of S3 connections it also needs a patch seastar http::client, there's PR already doing that, once (if) merged there'll come one more fix on top.
refs: #13458
refs: #13369
refs: scylladb/seastar#1652Closes#13859
* github.com:scylladb/scylladb:
s3: Pick client from manager via handle
s3: Generalize s3 file handle
s3: Live-update clients' configs
sstables: Keep clients shared across sstables
storage_manager: Rewrap config map
sstables, database: Move object storage config maintenance onto storage_manager
sstables: Introduce sharded<storage_manager>
The existing storage::wipe() method of s3 is in fact atomic deleter --
it commits "deleting" status into ownership table, deletes the objects
from server, then removes the entry from ownership table. So the atomic
deleter does the same and the .wipe() just removes the objects, because
it's not supposed to be atomic.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
While the driver isn't known without the sstable itself, we have a
vector of them can can get it from the front element. This is not very
generic, but fortunately all sstables here belong to the same table and,
respectively, to the same storage and even prefix. The latter is also
assert-checked by the sstable_directory atomic deleter code.
For now S3 storage returns the same directory-based deleter, but next
patch will change that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is to let manager decide which storage driver to call for atomic
sstables deletion in the next patch. While at it -- rename the
sstable_directory's method into something more descriptive (to make
compiler catch all callers of it).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This field needs to call trace_state::ttl_by_type() which, in turn,
looks into _props. The latter should have been initialized already
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It takes props from constructor args and tunes them according to the
constructing "flavor" -- primary or secondary state. Adding two static
helpers code-document the intent and make list-initialization possible
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
compaction strategies know how to pick files that are most likely to
satisfy tombstone purge conditions (i.e. not shadow data in uncompacting
files).
This logic can be bypassed if tombstone GC was disabled by user,
as it's a waste of effort to proceed with it until re-enabled.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If tombstone GC was disabled, compaction will ensure that fully expired
sstables won't be bypassed and that no expired tombstones will be
purged. Changing the value takes immediate effect even on ongoing
compactions.
Not wired into an API yet.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The instance ptr and props have to be set up early, because other
members' initialization depends on them. It's currently OK, because
other members are initialized in the constructor body, but moving them
into initializer list would require correct ordering
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It now does nothing but wraps make_lw_shared<one_session_records>()
call. Callers can do it on their own thus facilitating further
list-initialization patching
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
For that to happen the value evaluation is moved from the
init_session_records() into a private trace_state helper as it checks
the props values initialized earlier
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This object is constructed via one_session_records thus the latter needs
to pass some arguments along
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The methods that take storage_proxy as argument can now accept a
replica::database instead. So update their signatures and update all
callers. With that, system_keyspace.* no longer depends on storage_proxy
directly.
Use the recently introduced replica side query utility functions to
query the content of the system tables. This allows us to cut the
dependency of the system keyspace on storage proxy.
The methods still take storage proxy parameter, this will be replaced
with replica::database in the next patch.
There is still one hidden storage proxy dependency left, via
clq3::query_processor. This will be addressed later.
Containing utility methods to query data from the local replica.
Intended to be used to read system tables, completely bypassing storage
proxy in the process.
This duplicates some code already found in storage proxy, but that is a
small price to pay, to be able to break some circular dependencies
involving storage proxy, that have been plaguing us since time
immemorial.
One thing we lose with this, is the smp service level using in storage
proxy. If this becomes a problem, we can create one in database and use
it in these methods too.
Another thing we lose is increasing `replica_cross_shard_ops` storage
proxy stat. I think this is not a problem at all as these new functions
are meant to be used by internal users, which will reduce the internal
noise in this metric, which is meant to indicate users not using
shard-aware clients.
As a result of the preceding patches, permissions on a function
are now granted to its creator. As a result, some permissions may
appear which we did not expect before.
In the test_udf_permissions_serialization, we create a function
as the superuser, and as a result, when we compare the permissions
we specifically granted to the ones read from the LIST PERMISSIONS
result, we get more than expected - this is fixed by granting
permissions explicitly to a new user and only checking this user's
permissions list.
In the test_grant_revoke_udf_permissions case, we test whether
the DROP permission in enforced on a function that we have previously
created as the same user - as a result we have the DROP permission
even without granting it directly. We fix this by testing the DROP
permission on a function created by a different user.
In the test_grant_revoke_alter_udf_permissions case, we previously
tested that we require both ALTER and CREATE permissions when executing
a CREATE OR REPLACE FUNCTION statement. The new permissions required
for this statement now depend on whether we actually CREATE or REPLACE
a function, so now we test that the ALTER permission is required when
REPLACING a function, and the CREATE permission is required when
CREATING a function. After the changes, the case no longer needs to
be arfitifially extracted from the previous one, so they are merged
now. Analogous adjustments are made in the test case
test_grant_revoke_alter_uda_permissions.
Currently, when a user is altering a function, they need
both CREATE and ALTER permissions, instead of just ALTER.
Additionally, after altering a function, the user is
treated as an owner of this function, gaining all access
permissions to it.
This patch fixes these 2 issues, by checking only the ALTER
permission when actually altering, and by not modifying
user's permisssions if the user did not actually create
the function.
When a user creates a function, they should have all permissions on
this function.
Similarly, when a user creates a keyspace, they should have all
permissions on functions in the keyspace.
This patch introduces GRANTs on the missing permissions.
In the following patch, the grant_permissions_to_creator method is going
to be also used to grant permissions on a newly created function. The
function resource may contain user-defined types which need the
query processor to be prepared, so we add a reference to it in advance
in this patch for easier review.
Despite the cql-pytests being intended to pass on both Scylla and
Cassandra, the test_permissions.py case was actually failing on
Cassandra in a few cases. The most common issue was a different
exception type returned by Scylla and Cassandra for an invalid
query. This was fixed by accepting 2 types of exceptions when
necessary.
The second issue was java UDF code that did not compile, which was
fixed simply by debugging the code.
The last issue was a case that was scylla_only with no good reason.
The missing java UDFs were added to that case, and the test was
adjusted so that the ALTER permission was only checked in a
CREATE OR REPLACE statement only if the UDF was already existing -
- Scylla requires it in both cases, which will get resolved in the
next patch.
instead of assuming the integer-based generation id, let's use
the generation generator for creating a new generation id. this
helps us to improve the testing coverity once we migrate to the
UUID-based generation identifier.
this change uses generator to generate generations for
`make_sstable_for_all_shards()`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
always avoid using generation_type if possible. this helps us to
hide the underlying type of generation identifier, which could also
be a UUID in future.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead of assuming the integer-based generation id, let's use
the generation generator for creating a new generation id. this
helps us to improve the testing coverity once we migrate to the
UUID-based generation identifier.
this change uses generator to create generations for
`make_sstable_for_this_shard()`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Add the global-factory onto the client that is
- cross-shard copyable
- generates a client from local storage_manager by given endpoint
With that the s3 file handle is fixed and also picks up shared s3
clients from the storage manager instead of creating its own one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the s3 file handle tries to carry client's info via explicit
host name and endpoint config pointer. This is buggy, the latter pointer
is shard-local can cannot be transferred across shards.
This patch prepares the fix by abstracting the client handle part.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the client is accessible directli via the storage_manager, when
the latter is requested to update its endpoint config, it can kick the
client to do the same.
The latter, in turn, can only update the AWS creds info for now. The
endpoint port and https usage are immutable for now.
Also, updating the endpoint address is not possible, but for another
reason -- the endpoint itself is the part of keyspace configuration and
updating one in the object_storage.yaml will have no effect on it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Nowadays each sstable gets its own instance of an s3::client. This patch
keeps clients on storage_manager's endpoints map and when creating a
storage for an sstable -- grab the shared pointer from the map, thus
making one client serve all sstables over there (except for those that
duplicated their files with the help of foreign-info, but that's to be
handled by next patches).
Moving the ownership of a client to the storage_manager level also means
that the client has to be closed on manager's stop, not on sstable
destroy.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now the map is endpoint -> config_ptr. Wrap the config_ptr into an
s3_endpoint struct. Next patch will keep the client on this new wrapper
struct thus making them shared between sstables.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now the map<endpoint, config> sits on the sstables manager and its
update is governed by database (because it's peering and can kick other
shards to update it as well).
Having the sharded<storage_manager> at hand lets freeing database from
the need to update configs and keeps sstables_manager a bit smaller.
Also this will allow keeping s3 clients shared between sstables via this
map by next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The manager in question keeps track of whatever sstables_manager needs
to work with the storage (spoiler: only S3 one). It's main-local sharded
peering service, so that container() call can be used by next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It just needs to catch the system_error of ENOENT and re-throw it as
malformed_sstable_exception.
Indentatil is deliberately left broken. Again.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
One non-trivial change is the removal of buf temporary variable. That's
because it existed under the same name in the .then() lambda generating
name conflict after coroutinization.
Other than that it's pretty straightforward.
Indentation is deliberately left broken.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now which schema pull may issues raft read barrier it may stuck if
majority is not available. Make the operation abortable and abort it
during queries if timeout is reached.
The immediate mode is similar to timeout mode with gc_grace_seconds
zero. Thus, the gc_before returned should be the query_time instead of
gc_clock::time_point::max in immediate mode.
Setting gc_before to gc_clock::time_point::max, a row could be dropped
by compaction even if the ttl is not expired yet.
The following procedure reproduces the issue:
- Start 2 nodes
- Insert data
```
CREATE KEYSPACE ks2a WITH REPLICATION = { 'class' : 'SimpleStrategy',
'replication_factor' : 2 };
CREATE TABLE ks2a.tb (pk int, ck int, c0 text, c1 text, c2 text, PRIMARY
KEY(pk, ck)) WITH tombstone_gc = {'mode': 'immediate'};
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (10 ,1, 'x', 'y', 'z')
USING TTL 1000000;
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (20 ,1, 'x', 'y', 'z')
USING TTL 1000000;
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (30 ,1, 'x', 'y', 'z')
USING TTL 1000000;
```
- Run nodetool flush and nodetool compact
- Compaction drops all data
```
~128 total partitions merged to 0.
```
Fixes#13572Closes#13800
It is possible that a node will have no owned token ranges
in some keyspaces based on their replication strategy,
if the strategy is configured to have no replicas in
this node's data center.
In this case we should go ahead with cleanup that will
effectively delete all data.
Note that this is current very inefficient as we need
to filter every partition and drop it as unowned.
It can be optimized by either special casing this case
or, better, use skip forward to the next owned range.
This will skip to end-of-stream since there are no
owned ranges.
Fixes#13634
Also, add a respective rest_api unit test
Closes#13849
* github.com:scylladb/scylladb:
test: rest_api: test_storage_service: add test_storage_service_keyspace_cleanup_with_no_owned_ranges
compaction_manager: perform_cleanup: handle empty owned ranges
Schema pull may fail because the pull does not contain everything that
is needed to instantiate a schema pointer. For instance it does not
contain a keyspace. This patch changes the code to issue raft read
barrier before the pull which will guaranty that the keyspace is created
before the actual schema pull is performed.
Refs: #3760Fixes: #13211
Fixes https://github.com/scylladb/scylladb/issues/13805
This commit fixes the redirection required by moving the Glossary
page from the top of the page tree to the Reference section.
As the change was only merged to master (not to branch-5.2),
it is not working for version 5.2, which is now the latest stable
version.
For this reason, "stable" in the path must be replaced with "master".
Closes#13847
the series drops some of the callers using SSTable generation as integer. as the generation of SSTable is but an identifier, we should not use it as an integer out of generation_type's implementation.
Closes#13845
* github.com:scylladb/scylladb:
test: drop unused helper functions
test: sstable_mutation_test: avoid using helper using generation_type::int_t
test: sstable_move_test: avoid using helper using generation_type::int_t
test: sstable_*test: avoid using helper using generation_type::int_t
test: sstable_3_x_test: do not use reuseable_sst() accepting integer
Updates to the compaction_group sstable sets are
never done in place. Instead, the update is done
on a mutable copy of the sstable set, and the lw_shared
result is set back in the compaction_group.
(see for example compaction_group::set_main_sstables)
Therefore, there's currently a risk in perform_cleanup
`get_sstables` lambda that if it yield while in
set.for_each_sstable, the sstable_set might be replaced
and the copy it is traversing may be destroyed.
This was introduced in c2bf0e0b72.
To prevent that, hold on to set.shared_from_this()
around set.for_each_sstable.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13852
all users of these two helpers have switched to their alternatives,
so there is no need to keep them.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using `generation_type::int_t` in the helper functions, we just
pass `generation_type` in place of integer. also, since
`generate_clustered()` is only used by functions in the same
compilation unit, let's take the opportunity to mark it `static`.
and there is no need to pass generation as a template parameter,
we just pass it as a regular parameter.
we will divert other callers of `reusable_sst(...,
generation_type::int)` in following-up changes in different ways.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using `generation_type::int_t` in helper functions, we just use
`generation_type`. please note, despite that we'd prefer generating
the generations using generator, the SSTables used by the tests
modified by this change are stored in the repo, to ensure that the
tests are always able to find the SSTable files, we keep them
unchanged instead of using generation_generator, or a random
generation for the testing.
we will divert other callers of `reusable_sst(...,
generation_type::int)` in following-up changes in different ways.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using the helper accepting int, we switch to the one which accepts
generation_type by offering a default paramter, which is a
generation created using 1. this preserves the existing behavior.
we will divert other callers of `reusable_sst(...,
generation_type::int)` in following-up changes in different ways.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using the helper accepting int, we switch to the one which accepts
generation_type.
also, as no callers are using the last parameter of `make_test_sstable()`,
let's drop it .
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This series fixes an issue with altering permissions on UDFs with
parameter types that are UDTs with quoted names and adds
a test for it.
The issue was caused by the format of the temporary string
that represented the UDT in `auth::resource`. After parsing the
user input to a raw type, we created a string representing the
UDT using `ut_name::to_string()`. The segment of the resulting
string that represented the name of the UDT was not quoted,
making us unable to parse it again when the UDT was being
`prepare`d. Other than for this purpose, the `ut_name::to_string()`
is used only for logging, so the solution was modifying it to
maybe quote the UDT name.
Ref: https://github.com/scylladb/scylladb/pull/12869Closes#13257
* github.com:scylladb/scylladb:
cql-pytest: test permissions for UDTs with quoted names
cql: maybe quote user type name in ut_name::to_string()
cql: add a check for currently used stack in parser
cql-pytest: add an optional name parameter to new_type()
Currently, when creating a UDA, we only check for permissions
for creating functions. However, the creator gains all permissions
to the UDA, including the EXECUTE permission. This enables the
user to also execute the state/reduce/final functions that were
used in the UDA, even if they don't have the EXECUTE permissions
on them.
This patch adds checks for the missing EXECUTE permissions, so
that the UDA can be only created if the user has all required
permissions.
The new permissions that are now required when creating a UDA
are now granted in the existing UDA test.
Fixes#13818Closes#13819
Currently, when a function has no arguments, the function_args()
method, which is supposed to return a vector of string_views
representing the arguments of the function, returns a nullopt
instead, as if it was a functions_resource on all functions
or all functions in a keyspace. As a result, the functions_resource
can't be properly formatted.
This is fixed in this patch by returning an empty vector instead,
and the fix is confirmed in a cql-pytest.
Fixes#13842Closes#13844
data_consume_rows keeps an input_stream member that must be closed.
In particular, on the error path, when we destroy it possibly
with readaheads in flight.
Fixes#13836
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13840
It is possible that a node will have no owned token ranges
in some keyspaces based on their replication strategy,
if the strategy is configured to have no replicas in
this node's data center.
In this case we should go ahead with cleanup that will
effectively delete all data.
Note that this is current very inefficient as we need
to filter every partition and drop it as unowned.
It can be optimized by either special casing this case
ot, better, use skip forward to the next owned range.
This will skip to end-of-stream since there are no
owned ranges.
Fixes#13634
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
CQL evolved several expression evaluation mechanisms: WHERE clause,
selectors (the SELECT clause), and the LWT IF clause are just some
examples. Most now use expressions, which use managed_bytes_opt
as the underlying value representation, but selectors still use bytes_opt.
This poses two problems:
1. bytes_opt generates large contiguous allocations when used with large blobs, impacting latency
2. trying to use expressions with bytes_opt will incur a copy, reducing performance
To solve the problem, we harmonize the data types to managed_bytes_opt
(#13216 notwithstanding). This is somewhat difficult since the source of the values
are views into a bytes_ostream. However, luckily bytes_ostream and managed_bytes_view
are mostly compatible so with a little effort this can be done.
The series is neutral wrt performance:
before:
```
222118.61 tps ( 61.1 allocs/op, 12.1 tasks/op, 43092 insns/op, 0 errors)
224250.14 tps ( 61.1 allocs/op, 12.1 tasks/op, 43094 insns/op, 0 errors)
224115.66 tps ( 61.1 allocs/op, 12.1 tasks/op, 43092 insns/op, 0 errors)
223508.70 tps ( 61.1 allocs/op, 12.1 tasks/op, 43107 insns/op, 0 errors)
223498.04 tps ( 61.1 allocs/op, 12.1 tasks/op, 43087 insns/op, 0 errors)
```
after:
```
220708.37 tps ( 61.1 allocs/op, 12.1 tasks/op, 43118 insns/op, 0 errors)
225168.99 tps ( 61.1 allocs/op, 12.1 tasks/op, 43081 insns/op, 0 errors)
222406.00 tps ( 61.1 allocs/op, 12.1 tasks/op, 43088 insns/op, 0 errors)
224608.27 tps ( 61.1 allocs/op, 12.1 tasks/op, 43102 insns/op, 0 errors)
225458.32 tps ( 61.1 allocs/op, 12.1 tasks/op, 43098 insns/op, 0 errors)
```
Though I expect with some more effort we can eliminate some copies.
Closes#13637
* github.com:scylladb/scylladb:
cql3: untyped_result_set: switch to managed_bytes_view as the cell type
cql3: result_set: switch cell data type from bytes_opt to managed_bytes_opt
cql3: untyped_result_set: always own data
types: abstract_type: add mixed-type versions of compare() and equal()
utils/managed_bytes, serializer: add conversion between buffer_view<bytes_ostream> and managed_bytes_view
utils: managed_bytes: add bidirectional conversion between bytes_opt and managed_bytes_opt
utils: managed_bytes: add managed_bytes_view::with_linearized()
utils: managed_bytes: mark managed_bytes_view::is_linearized() const
Currently, when a user has permissions on a function/all functions in
keyspace, and the function/keyspace is dropped, the user keeps the
permissions. As a result, when a new function/keyspace is created
with the same name (and signature), they will be able to use it even
if no permissions on it are granted to them.
Simliarly to regular UDFs, the same applies to UDAs.
After this patch, the corresponding permissions on functions are dropped
when a function/keyspace is dropped.
Fixes#13820Closes#13823
Task manager's tasks covering scrub compaction on top,
shard and table level.
For this levels we have common scrub tasks for each scrub
mode since they share code. Scrub modes will be differentiated
on compaction group level.
Closes#13694
* github.com:scylladb/scylladb:
test: extend test_compaction_task.py to test scrub compaction
compaction: add table_scrub_sstables_compaction_task_impl
compaction: add shard_scrub_sstables_compaction_task_impl
compaction: add scrub_sstables_compaction_task_impl
api: get rid of unnecessary std::optional in scrub
compaction: rename rewrite_sstables_compaction_task_impl
in this series, `data_dictionary::storage_options` is refactored so that each dedicated storage option takes care of itself, instead of putting all the logic into `storage_options`. cleaner this way. as the next step, i will add yet another set of options for the tiered_storage which is backed by the s3_storage and the local filesystem_storage. with this change, we will be able to group the per-option functionalities together by the option thy are designed for, instead of sharding them by the actual function.
Closes#13826
* github.com:scylladb/scylladb:
data_dictionary: define helpers in options
data_dictionary: only define operator== for storage options
The validator classes have their definition in a header located in mutation/, while their implementation is located in a .cc in readers/mutation_reader.cc.
This PR fixes this inconsistency by moving the implementation into mutation/mutation_fragment_stream_validator.cc. The only change is that the validator code gets a new logger instance (but the logger variable itself is left unchanged for now).
Closes#13831
* github.com:scylladb/scylladb:
mutation/mutation_fragment_stream_validator.cc: rename logger
readers,mutation: move mutation_fragment_stream_validator to mutation/
When new nodes are added or existing nodes are deleted, the topology
state machine needs to shunt reads from the old nodes to the new ones.
This happens in the `write_both_read_new` state. The problem is that
previously this state was not handled in any way in `token_metadata` and
the read nodes were only changed when the topology state machine reached
the final 'owned' state.
To handle `write_both_read_new` an additional `interval_map` inside
`token_metadata` is maintained similar to `pending_endpoints`. It maps
the ranges affected by the ongoing topology change operation to replicas
which should be used for reading. When topology state sm reaches the
point when it needs to switch reads to a new topology, it passes
`request_read_new=true` in a call to `update_pending_ranges`. This
forces `update_pending_ranges` to compute the ranges based on new
topology and store them to the `interval_map`. On the data plane, when a
read on coordinator needs to decide which endpoints to use, it first
consults this `interval_map` in `token_metadata`, and only if it doesn't
contain a range for current token it uses normal endpoints from
`effective_replication_map`.
Closes#13376
* github.com:scylladb/scylladb:
storage_proxy, storage_service: use new read endpoints
storage_proxy: rename get_live_sorted_endpoints->get_endpoints_for_reading
token_metadata: add unit test for endpoints_for_reading
token_metadata: add endpoints for reading
sequenced_set: add extract_set method
token_metadata_impl: extract maybe_migration_endpoints helper function
token_metadata_impl: introduce migration_info
token_metadata_impl: refactor update_pending_ranges
token_metadata: add unit tests
token_metadata: fix indentation
token_metadata_impl: return unique_ptr from clone functions
Change f5f566bdd8 introduced
tagged_integer and replaced raft::internal::tagged_uint64
with utils::tagged_integer.
However, the idl type for raft::internal::tagged_uint64
was not marked as final, but utils::tagged_integer is, breaking
the on-the-wire compatibility.
This change restores the use of raft::internal::tagged_uint64
for the raft types and adds back an idl definition for
it that is not marked as final, similar to the way
raft::internal::tagged_id extends utils::tagged_uuid.
Fixes#13752Closes#13774
* github.com:scylladb/scylladb:
raft, idl: restore internal::tagged_uint64 type
raft: define term_t as a tagged uint64_t
idl: gossip_digest: include required headers
in this series, we encode the value of generation using UUID to prepare for the UUID generation identifier. simpler this way, as we don't need to have two ways to encode integer or a timeduuid: uuid with a zero timestamp, and a variant. also, add a `from_string()` factory method to convert string to generation to hide the the underlying type of value from generation_type's users.
Closes#13782
* github.com:scylladb/scylladb:
sstable: use generation_type::from_string() to convert from string
sstable: encode int using UUID in generation_type
Let's say that we have a prepared statement with a token restriction:
```cql
SELECT * FROM some_table WHERE token(p1, p2) = ?
```
After calling `prepare` the drivers receives some information about the prepared statment, including names of values bound to each bind marker.
In case of a partition token restriction (`token(p1, p2) = ?`) there's an expectation that the name assigned to this bind marker will be `"partition key token"`.
In a recent change the code handling `token()` expressions has been unified with the code that handles generic function calls, and as a result the name has changed to `token(p1, p2)`.
It turns out that the Java driver relies on the name being `"partition key token"`, so a change to `token(p1, p2)` broke some things.
This patch sets the name back to `"partition key token"`. To achieve this we detect any restrictions that match the pattern `token(p1, p2, p3) = X` and set the receiver name for X to `"partition key token"`.
Fixes: #13769Closes#13815
* github.com:scylladb/scylladb:
cql-pytest: test that bind marker is partition key token
cql3/prepare_expr: force token() receiver name to be partition key token
in this change,
* instead of using "\d+" to match the generation, use "[^-]",
* let generation_type to convert a string to generation
before this change, we casts the matched string in SSTable file name
to integer and then construct a generation identifier from the integer.
this solution has a strong assumption that the generation is represented
with an integer, we should not encode this assumption in sstable.cc,
instead we'd better let generation_type itself to take care of this. also,
to relax the restriction of regex for matching generation, let's
just use any characters except for the delimeter -- "-".
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
since we already use UUID for encoding an bigint in SSTable registry
table, let's just use the same approach for encoding bigint in generation_type,
to be more consistent, and less repeatings this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
We use set_topology_transition_state to set read_new state
in storage_service::topology_state_load
based on _topology_state_machine._topology.tstate.
This triggers update_pending_ranges to compute and store new ranges
for read requests. We use this information in
storage_proxy::get_endpoints_for_reading
when we need to decide which nodes to use for reading.
We are going to use remapped_endpoints_for_reading, we need
to make sure we use it in the right place. The
get_live_sorted_endpoints function looks like what we
need - it's used in all read code paths.
From its name, however, this was not obvious.
Also, we add the parameter ks_name as we'll need it
to pass to remapped_endpoints_for_reading.
In this patch we add
token_metadata::set_topology_transition_state method.
If the current state is
write_both_read_new update_pending_ranges
will compute new ranges for read requests. The default value
of topology_transition_state is null, meaning no read
ranges are computed. We will add the appropriate
set_topology_transition_state calls later.
Also, we add endpoints_for_reading method to get
read endpoints based on the computed ranges.
instead of dispatching and implementing the per-option handling
right in `storage_option`, define these helpers in the dedicated
option themselves, so `storage_option` is only responsible for
dispatching.
much cleaner this way. this change also makes it easier to add yet
another storage backend.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
as the only user of these comparison operators is
`storage_options::can_update_to()`, which just check if the given
`storage_options` is equal to the stored one. so no need to define
the <=> operator.
also, no need to add the `friend` specifier, as the options are plain
struct, all the member variables are public.
make the comparison operator a member function instead of a free
function, as in C++20 comparision operators are symmetric.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The validator classes have their definition in a header located in mutation/,
while their implementation is located in a .cc in readers/mutation_reader.cc.
This patch fixes this inconsistency by moving the implementation into
mutation/mutation_fragment_stream_validator.cc. The only change is that
the validator code gets a new logger instance (but the logger variable itself
is left unchanged for now).
this change extracts the storage class and its derived classes
out into their own source files. for couple reasons:
- for better readability. the sstables.hh is over 1005 lines.
and sstables.cc 3602 lines. it's a little bit difficult to figure
out how the different parts in these sources interact with each
other. for instance, with this change, it's clear some of helper
functions are only used by file_system_storage.
- probably less inter-source dependency. by extracting the sources
files out, they can be compiled individually, so changing one .cc
file does not impact others. this could speed up the compilation
time.
Closes#13785
* github.com:scylladb/scylladb:
sstables: storage: coroutinize idempotent_link_file()
sstables: extract storage out
When preparing a query each bind marker gets a name.
For a query like:
```cql
SELECT * FROM some_table WHERE token(p1, p2) = ?
```
The bind marker's name should be `"partition key token"`.
Java driver relies on this name, having something else,
like `"token(p1, p2)"` be the name breaks the Java driver.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Let's say that we have a prepared statement with a token restriction:
```cql
SELECT * FROM some_table WHERE token(p1, p2) = ?
```
After calling `prepare` the drivers receives some information
about the prepared statment, including names of values bound
to each bind marker.
In case of a partition token restriction (`token(p1, p2) = ?`)
there's an expectation that the name assigned to this bind marker
will be `"partition key token"`.
In a recent change the code handling `token()` expressions has been
unified with the code that handles generic function calls,
and as a result the name has changed to `token(p1, p2)`.
It turns out that the Java driver relies on the name being
`"partition key token"`, so a change to `token(p1, p2)`
broke some things.
This patch sets the name back to `"partition key token"`.
To achieve this we detect any restrictions that match
the pattern `token(p1, p2, p3) = X` and set the receiver
name for X to `"partition key token"`.
Fixes: #13769
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
We are going to add a function in token_metadata to get read endpoints,
similar to pending_endpoints_for. So in this commit we extract
the maybe_migration_endpoints helper function, which will be
used in both cases.
We are going to store read_endpoints in a way similar
to pending ranges, so in this commit we add
migration_info - a container for two
boost::icl::interval_map.
Also, _pending_ranges_interval_map is renamed to
_keyspace_to_migration_info, since it captures
the meaning better.
Now update_pending_ranges is quite complex, mainly
because it tries to act efficiently and update only
the affected intervals. However, it uses the function
abstract_replication_strategy::get_ranges, which calls
calculate_natural_endpoints for every token
in the ring anyway.
Our goal is to start reading from the new replicas for
ranges in write_both_read_new state. In the current
code structure this is quite difficult to do, so
in this commit we first simplify update_pending_ranges.
The main idea of the refactoring is to build a new version
of token_metadata based on all planned changes
(join, bootstrap, replace) and then for each token
range compare the result of calculate_natural_endpoints on
the old token_metadata and on the new one.
Those endpoints that are in the new version and
are not in the old version should be added to the pending_ranges.
The add_mapping function is extracted for the
future - we are going to use it to handle read mappings.
Special care is taken when replacing with the same IP.
The coordinator employs the
get_natural_endpoints_without_node_being_replaced function,
which excludes such endpoints from its result. If we compare
the new (merged) and current token_metadata configurations, such
endpoints will also be absent from pending_endpoints since
they exist in both. To address this, we copy the current
token_metadata and remove these endpoints prior to comparison.
This ensures that nodes being replaced are treated
like those being deleted.
Change f5f566bdd8 introduced
tagged_integer and replaced raft::internal::tagged_uint64
with utils::tagged_integer.
However, the idl type for raft::internal::tagged_uint64
was not marked as final, but utils::tagged_integer is, breaking
the on-the-wire compatibility.
This change defines the different raft tagged_uint64
types in idl/raft_storage.idl.hh as non-final
to restore the way they were serialized prior to
f5f566bdd8Fixes#13752
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
All Raft verbs include `dst_id`, the ID of the destination server, but
it isn't checked. `append_entries` will work even if it arrives at
completely the wrong server (but in the same group). It can cause
problems, e.g. in the scenario of replacing a dead node.
This commit adds verifying if `dst_id` matches the server's ID and if it
doesn't, the Raft verb is rejected.
Closes#12179
Testing
---
Testcase and scylla's configuration:
57d3ef14d8
It artificially lengthens the duration of replacing the old node. It
increases the chance of getting the RPC command sent to a replaced node,
by the new node.
In the logs of the node that replaced the old one, we can see logs in
the form:
```
DEBUG <time> [shard 0] raft_group_registry - Got message for server <dst_id>, but my id is <my_id>
```
It indicates that the Raft verb with the wrong `dst_id` was rejected.
This test isn't included in the PR because it doesn't catch any specific error.
Closes#13575
* github.com:scylladb/scylladb:
service/raft: raft_group_registry: Add verification of destination ID
service/raft: raft_group_registry: `handle_raft_rpc` refactor
this change extracts the storage class and its derived classes
out into storage.cc and storage.hh. for couple reasons:
- for better readability. the sstables.hh is over 1005 lines.
and sstables.cc 3602 lines. it's a little bit difficult to figure
out how the different parts in these sources interact with each
other. for instance, with this change, it's clear some of helper
functions are only used by file_system_storage.
- probably less inter-source dependency. by extracting the sources
files out, they can be compiled individually, so changing one .cc
file does not impact others. this could speed up the compilation
time.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Rename rewrite_sstables_compaction_task_impl to
sstables_compaction_task_impl as a new name describes the class
of tasks better. Rewriting sstables is a slightly more fine-grained
type of sstable compaction task then the one needed here.
this series prepares for the UUID based generation by replacing the general `value()` function with the function with more specific name: `as_int()`.
Closes#13796
* github.com:scylladb/scylladb:
test: drop a reusable_sst() variant which accepts int as generation
treewide: replace generation_type::value() with generation_type::as_int()
In addition to the data file itself. Currently validation avoids the
index altogether, using the crawling reader which only relies on the
data file and ignores the index+summary. This is because a corrupt
sstable usually has a corrupt index too and using both at the same time
might hide the corruption. This patch adds targeted validation of the
index, independent of and in addition to the already existing data
validation: it validates the order of index entries as well as whether
the entry points to a complete partition in the data file.
This will usually result in duplicate errors for out-of-order
partitions: one for the data file and one for the index file.
Fixes: #9611Closes#11405
* github.com:scylladb/scylladb:
test/cql-pytest: add test_sstable_validation.py
test/cql-pytest: extract scylla_path,temp_workdir fixtures to conftest.py
tools/scylla-sstables: write validation result to stdout
sstables/sstable: validate(): delegate to mx validator for mx sstables
sstables/mx/reader: add mx specific validator
mutation/mutation_fragment_stream_validator: add validator() accessor to validating filter
sstables/mx/reader: template data_consume_rows_context_m on the consumer
sstables/mx/reader: move row_processing_result to namespace scope
sstables/mx/reader: use data_consumer::proceed directly
sstables/mx/reader.cc: extend namespace to end-of-file (cosmetic)
compaction/compaction: remove now unused scrub_validate_mode_validate_reader()
compaction/compaction: move away from scrub_validate_mode_validate_reader()
tools/scylla-sstable: move away from scrub_validate_mode_validate_reader()
test/boost/sstable_compaction_test: move away from scrub_validate_mode_validate_reader()
sstables/sstable: add validate() method
compaction/compaction: scrub_sstables_validate_mode(): validate sstables one-by-one
compaction: scrub: use error messages from validator
mutation_fragment_stream_validator: produce error messages in low-level validator
The execution loop consumes permits from the _ready_list and executes
them. The _ready_list usually contains a single permit. When the
_ready_list is not empty, new permits are queued until it becomes empty.
The execution loops relies on admission checks triggered by the read
releasing resouces, to bring in any queued read into the _ready_list,
while it is executing the current read. But in some cases the current
read might not free any resorces and thus fail to trigger an admission
check and the currently queued permits will sit in the queue until
another source triggers an admission check.
I don't yet know how this situation can occur, if at all, but it is
reproducible with a simple unit test, so it is best to cover this
corner-case in the off-chance it happens in the wild.
Add an explicit admission check to the execution loop, after the
_ready_list is exhausted, to make sure any waiters that can be admitted
with an empty _ready_list are admitted immediately and execution
continues.
Fixes: #13540Closes#13541
The discussion on the thread says, when we reformat a volume with another
filesystem, kernel and libblkid may skip to populate /dev/disk/by-* since it
detected two filesystem signatures, because mkfs.xxx did not cleared previous
filesystem signature.
To avoid this, we need to run wipefs before running mkfs.
Note that this runs wipefs twice, for target disks and also for RAID device.
wipefs for RAID device is needed since wipefs on disks doesn't clear filesystem signatures on /dev/mdX (we may see previous filesystem signature on /dev/mdX when we construct RAID volume multiple time on same disks).
Also dropped -f option from mkfs.xfs, it will check wipefs is working as we
expected.
Fixes#13737
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#13738
The test performs an `INSERT` followed by a `SELECT`, checking if the
previously inserted data is returned.
This may fail because we're using `ring_delay = 0` in tests and the two
queries may arrive at different nodes, whose `token_metadata` didn't
converge yet (it's eventually consistent based on gossiping).
I illustrated this here:
https://github.com/scylladb/scylladb/issues/12937#issuecomment-1536147455
Ensure that the nodes' token rings are synchronized (by waiting until
the token ring members on each node is the same as group 0
configuration).
Fixes#12937Closes#13791
`RandomTables.verify_schema` is often called in topology tests after
performing a schema change. It compares the schema tables fetched from
some node to the expected latest schema stored by the `RandomTables`
object.
However there's no guarantee that the latest schema change has already
propagated to the node which we query. We could have performed the
schema change on a different node and the change may not have been
applied yet on all nodes.
To fix that, pick a specific node and perform a read barrier on it, then
use that node to fetch the schema tables.
Fixes#13788Closes#13789
Currently, when we deal with a Wasm program, we store
it in its final WebAssembly Text form. This causes a lot
of code bloat and is hard to read. Instead, we would like
to store only the source codes, and build Wasm when
necessary. This series adds build commands that
compile C/Rust sources to Wasm and uses them for Wasm
programs that we're already using.
After these changes, adding a new program that should be
compiled to Rust, requires only adding the source code
of it and updating the `wasms` and `wasm_deps` lists in
`configure.py`.
All Wasm programs are build by default when building all
artifacts, artifacts in a given mode, or when building
tests. Additionally, a {mode}-wasm target is added, so that
it's possible to build just the wasm files.
The generated files are saved in $builddir/{mode}/wasm,
and are accessed in cql-pytests similarly to the way we're
accessing the scylla binary - using glob.
Closes#13209
* github.com:scylladb/scylladb:
wasm: replace wasm programs with their source programs
build: prepare rules for compiling wasm files
build: set the type of build_artifacts
test: extend capabilities of Wasm reading helper funciton
token_metadata takes token_metadata_impl as unique_ptr,
so it makes sense to create it that way in the first place
to avoid unnecessary moves.
token_metadata_impl constructor with shallow_copy parameter
was made public for std::make_unique. The effective
accessibility of this constructor hasn't changed though since
shallow_copy remains private.
After recent changes, we are able to store only the
C/Rust source codes for Wasm programs, and only build
them when neccessary. This patch utilizes this
opportunity by removing most of the currently stored
raw Wasm programs, replacing them with C/Rust sources
and adding them to the new build system.
Currently, when we deal with a Wasm program, we store
it in its final WebAssembly Text form. This causes a lot
of code bloat and is hard to read. Instead, we would like
to store only the (C/Rust) source codes, and build Wasm
when neccessary. This patch adds build commands that
compile C/Rust sources to Wasm.
After these changes, adding a new program that should be
compiled to Rust, requires only adding the source code
of it and updating the wasms and wasm_deps lists in
configure.py.
All Wasm programs are build by default when building all
artifacts, all artifacts in a given mode, or when building
tests. Additionally, a ninja wasm target is added, so that
it's possible to build just the wasm files.
The generated files are saved in $builddir/wasm.
Currently, build_artifacts are of type set[str] | list, which prevents
us from performing set operations on it. In a future patch, we will
want to take a set difference and set intersections with it, so we
initialize the type of build_artifacts to a set in all cases.
Currently, we require that the Wasm file is named the same
as the funciton. In the future we may want multiple functions
with the same name, which we can't currently do due to this
limitation.
This patch allows specifying the function name, so that multiple
files can have a function with the same name.
Additionally, the helper method now escapes "'" characters, so
that they can appear in future Wasm files.
SSTable relies on st.st_mtime for providing creation time of data
file, which in turn is used by features like tombstone compaction.
Therefore, let's implement it.
Fixes https://github.com/scylladb/scylladb/issues/13649.
Closes#13713
* github.com:scylladb/scylladb:
s3: Provide timestamps in the s3 file implementation
s3: Introduce get_object_stats()
s3: introduce get_object_header()
SSTable relies on st.st_mtime for providing creation time of data
file, which in turn is used by features like tombstone compaction.
Fixes#13649.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
get_object_stats() will be used for retrieving content size and
also last modified.
The latter is required for filling st_mtim, etc, in the
s3::client::readable_file::stat() method.
Refs #13649.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
since #13452, we switched most of the caller sites from std::regex
to boost::regex. in this change, all occurences of `#include <regex>`
are dropped unless std::regex is used in the same source file.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13765
functinoality wise, `uint64_t_tri_compare()` is identical to the
three-way comparison operator, so no need to keep it. in this change,
it is dropped in favor of <=>.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13794
The expression system uses managed_bytes_opt for values, but result_set
uses bytes_opt. This means that processing values from the result set
in expressions requires a copy.
Out of the two, managed_bytes_opt is the better choice, since it prevents
large contiguous allocations for large blobs. So we switch result_set
to use managed_bytes_opt. Users of the result_set API are adjusted.
The db::function interface is not modified to limit churn; instead we
convert the types on entry and exit. This will be adjusted in a following
patch.
untyped_result_set is used for internal queries, where ease-of-use is more
important than performance. Currently, cells are held either by value or
by reference (managed_bytes_view). An upcoming change will cause the
result set to be built from managed_bytes_view, making it non-owning, but
the source data is not actually held, resulting in a use-after-free.
Rather than chase the source and force the data to be owned in this case,
just drop the possibility for a non-owning untyped_result_set. It's only
used in non-performance-critical paths and safety is more important than
saving a few cycles.
This also results in simplification: previously, we had a variant selecting
monostate (for NULL), managed_bytes_view (for a reference), and bytes (for
owning data); now we only have a bytes_opt since that already signifies
data-or-NULL.
Once result_set transitions to managed_bytes_opt, untyped_result_set
will follow. For now it's easier to use bytes_opt.
compare() and equal() can compare two unfragmented values or two
fragmented values, but a mix of a fragmented value and an unfragmented
value runs afoul of C++ conversion rules. Add more overloads to
make it simpler for users.
The codebase evolved to have several different ways to hold a fragmented
buffer: fragmented_temporary_buffer (for data received from the network;
not relevant for this discussion); bytes_ostream (for fragmented data that
is built incrementally; also used for a serialized result_set), and
managed_bytes (used for lsa and serialized individual values in
expression evaluation).
One problem with this state of affairs is that using data in one
fragmented form with functions that accept another fragmented form
requires either a copy, or templating everything. The former is
unpalatable for fast-path code, and the latter is undesirable for
compile time and run-time code footprint. So we'd like to make
the various forms compatible.
In 53e0dc7530 ("bytes_ostream: base on managed_bytes") we changed
bytes_ostream to have the same underlying data structure as
managed_bytes, so all that remains is to add the right API. This
is somewhat difficult as the data is hidden in multiple layers:
ser::buffer_view<> is used to abstract a slice of bytes_ostream,
and this is further abstracted by using iterators into bytes_ostream
rather than directly using the internals. Likewise, it's impossible
to construct a managed_bytes_view from the internals.
Hack through all of these by adding extract_implementation() methods,
and a build_managed_bytes_view_from_internals() helper. These are all
used by new APIs buffer_view_to_managed_bytes_view() that extract
the internals and put them back together again.
Ideally we wouldn't need any of this, but unifying the type system
in this area is quite an undertaking, so we need some shortcuts.
Becomes useful in later patches.
To avoid double-compiling the call to func(), use an
immediately-invoked lambda to calculate the bytes_view we'll be
calling func() with.
The code was incorrectly passing a data_value of type bytes due to
implicit conversion of the result of serialize() (bytes_opt) to a
data_value object of type bytes_type via:
data_value(std::optional<NativeType>);
mutation::set_static_cell() accepts a data_value object, which is then
serialized using column's type in abstract_type::decompose(data_value&):
bytes b(bytes::initialized_later(), serialized_size(*this, value._value));
auto i = b.begin();
value.serialize(i);
Notice that serialized_size() is taken from the column type, but
serialization is done using data_value's type. The two types may have
a compatible CQL binary representation, but may differ in native
types. serialized_size() may incorrectly interpret the native type and
come up with the wrong size. If the size is too smaller, we end up with
stack or heap corruption later after serialize().
For example, if the column type is utf8 but value holds bytes, the
size will be wrong because even though both use the basic_sstring
type, they have a different layout due to max_size (15 vs 31).
Fixes#13717Closes#13787
When requesting memory via `reader_permit::request_memory()`, the
requested amount is added to `_requested_memory` member of the permit
impl. This is because multiple concurrent requests may be blocked and
waiting at the same time. When the requests are fulfilled, the entire
amount is consumed and individual requests track their requested amount
with `resource_units` to release later.
There is a corner-case related to this: if a reader permit is registered
as inactive while it is waiting for memory, its active requests are
killed with `std::bad_alloc`, but the `_requested_memory` fields is not
cleared. If the read survives because the killed requests were part of
a non-vital background read-ahead, a later memory request will also
include amount from the failed requests. This extra amount wil not be
released and hence will cause a resource leak when the permit is
destroyed.
Fix by detecting this corner case and clearing the `_requested_memory`
field. Modify the existing unit test for the scenario of a permit
waiting on memory being registered as inactive, to also cover this
corner case, reproducing the bug.
Fixes: #13539Closes#13679
this is one of the changes to reduce the usage of integer based generation
test. in future, we will need to expand the test to exercise the UUID
based generation, or at least to be neutral to the underlying generation's
identifier type. so, to remove the helpers which only accept `generation_type::int_t`
would helps us to make this happen.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* replace generation_type::value() with generation_type::as_int()
* drop generation_value()
because we will switch over to UUID based generation identifier, the member
function or the free function generation_value() cannot fulfill the needs
anymore. so, in this change, they are consolidated and are replaced by
"as_int()", whose name is more specific, and will also work and won't be
misleading even after switching to UUID based generation identifier. as
`value()` would be confusing by then: it could be an integer or a UUID.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
For some reason Scylla crashes on `aarch64` in release mode when calling
`fmt::format` in `raft_removenode` and `raft_decommission`. E.g. on this
line:
```
group0_command g0_cmd = _group0->client().prepare_command(std::move(change), guard, fmt::format("decomission: request decomission for {}", raft_server.id()));
```
I found this in our configure.py:
```
def get_clang_inline_threshold():
if args.clang_inline_threshold != -1:
return args.clang_inline_threshold
elif platform.machine() == 'aarch64':
# we see miscompiles with 1200 and above with format("{}", uuid)
# also coroutine miscompiles with 600
return 300
else:
return 2500
```
but reducing it to `0` didn't help.
I managed to get the following backtrace (with inline threshold 0):
```
void boost::intrusive::list_impl<boost::intrusive::mhtraits<seastar::thread_context, boost::intrusive::list_member_hook<>, &seastar::thread_context::_all_link>, unsigned long, false, void>::clear_and_dispose<boost::intrusive::detail::null_disposer>(boost::intrusive::detail::null_disposer) at /usr/include/boost/intrusive/list.hpp:751
(inlined by) boost::intrusive::list_impl<boost::intrusive::mhtraits<seastar::thread_context, boost::intrusive::list_member_hook<>, &seastar::thread_context::_all_link>, unsigned long, false, void>::clear() at /usr/include/boost/intrusive/list.hpp:728
(inlined by) ~list_impl at /usr/include/boost/intrusive/list.hpp:255
void fmt::v9::detail::buffer<wchar_t>::append<wchar_t>(wchar_t const*, wchar_t const*) at ??:?
void fmt::v9::detail::vformat_to<char>(fmt::v9::detail::buffer<char>&, fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<std::conditional<std::is_same<fmt::v9::type_identity<char>::type, char>::value, fmt::v9::appender, std::back_insert_iterator<fmt::v9::detail::buffer<fmt::v9::type_identity<char>::type> > >::type, fmt::v9::type_identity<char>::type> >, fmt::v9::detail::locale_ref) at ??:?
fmt::v9::vformat[abi:cxx11](fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<fmt::v9::appender, char> >) at ??:?
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > fmt::v9::format<utils::tagged_uuid<raft::server_id_tag>&>(fmt::v9::basic_format_string<char, fmt::v9::type_identity<utils::tagged_uuid<raft::server_id_tag>&>::type>, utils::tagged_uuid<raft::server_id_tag>&) at /usr/include/fmt/core.h:3206
(inlined by) service::storage_service::raft_removenode(utils::tagged_uuid<locator::host_id_tag>) at ./service/storage_service.cc:3572
```
Maybe it's a bug in `fmt` library?
In any case replacing the call with `::format` (i.e. `seastar::format`
from seastar/core/print.hh) helps.
Do it for the entire file for consistency (and avoiding this bug).
Also, for the future, replace `format` calls with `::format` - now it's
the same thing, but the latter won't clash with `std::format` once we
switch to libstdc++13.
Fixes#13707Closes#13711
This is a follow-up to #13399, the patch
addresses the issues mentioned there:
* linesep can be split between blocks;
* linesep can be part of UTF-8 sequence;
* avoid excessively long lines, limit to 256 chars;
* the logic of the function made simpler and more maintainable.
Closes#13427
* github.com:scylladb/scylladb:
pylib_test: add tests for read_last_line
pytest: add pylib_test directory
scylla_cluster.py: fix read_last_line
scylla_cluster.py: move read_last_line to util.py
to avoid the FTBFS after we bump up the Seastar submodule which bumped up its API level to v7. and API v7 is a breaking change. so, in order to unbreak the build, we have to hardwire the API level to 6. `configure.py` also does this.
Closes#13780
* github.com:scylladb/scylladb:
build: cmake: disable deprecated warning
build: cmake: use Seastar API level 6
we introduced the linkage to Boost::unit_test_framework in
fe70333c19, this library is used by
test/lib/test_utils.cc, so update CMake accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13781
This is a follow-up to #13399, the patch
addresses the issues mentioned there:
* linesep can be split between blocks;
* linesep can be part of UTF-8 sequence;
* avoid excessively long lines, limit to 512 chars;
* the logic of the function made simpler and more
maintainable.
There are two of them currently with slightly different declaration. Better to leave only one.
Closes#13772
* github.com:scylladb/scylladb:
test: Deduplicate test::filename() static overload
test: Make test::filename return fs::path
The method in question suffers from scylladb/seastar#1298. The PR fixes it and makes a bit shorter along the way
Closes#13776
* github.com:scylladb/scylladb:
sstable: Close file at the end
sstables: Use read_entire_stream_cont() helper
This commit changes the configuration in the conf.py
file to make branch-5.2 the latest version and
remove it from the list of unstable versions.
As a result, the docs for version 5.2 will become
the default for users accessing the ScyllaDB Open Source
documentation.
This commit should be merged as soon as version 5.2
is released.
Closes#13681
One is unused, the other one is not really required in public
Closes#13771
* github.com:scylladb/scylladb:
file_writer: Remove static make() helper
sstable: Use toc_filename() to print TOC file path
The test case consists of two internal sub-test-cases. Making them
explicit kills three birds with one stone
- improves parallelizm
- removes env's tempdir wiping
- fixes code indentation
refs: #12707
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13768
since Seastar now deprecates a bunch of APIs which accept io_priority_class,
we started to have deprecated warnings. before migrating to V7 API,
let's disable this warning.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
to avoid the FTBFS after we bump up the Seastar submodule
which bumped up its API level to v7. and API v7 is a breaking
change. so, in order to unbreak the build, we have to hardwire
the API level to 6. `configure.py` also does this.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* seastar 02d5a0d7c...f94b1bb9c (12):
> Merge 'Unify CPU scheduling groups and IO priority classes' from Pavel Emelyanov
> scripts: addr2line: relax regular expression for matching kernel traces
> add dirs for clangd to .gitignore
> http::client: Log failed requests' body
> build: always quote the ENVIRONMENT with quotes
> exception_hacks: Change guard check order to work around static init fail
> shared_future: remove support for variadic futures
> iotune: Don't close file that wasn't opened
Fixes#13439
> Merge 'Relax per tick IO grab threshold' from Pavel Emelyanov
> future: simplify constraint on then() a little
> Merge 'coroutine: generator: initialize const member variable and enable generator tests' from Kefu Chai
> future: drop libc++ std::tuple compatibility hack
Closes#13777
The thing is than when closing file input stream the underlying file is
not .close()-d (see scylladb/seastar#1298). The remove_by_toc_name() is
buggy in this sense. Using with_closeable() fixes it and makes the code
shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The remove_by_toc_name() wants to read the whole stream into a sstring.
There's a convenience helper to facilitate that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In https://github.com/scylladb/scylladb/pull/13482 we renamed the reader permit states to more descriptive names. That PR however only covered only the states themselves and their usages, as well as the documentation in `docs/dev`.
This PR is a followup to said PR, completing the name changes: renaming all symbols, names, comments etc, so all is consistent and up-to-date.
Closes#13573
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: misc updates w.r.t. recent permit state name changes
reader_concurrency_semaphore: update permit members w.r.t. recent permit state name changes
reader_concurrency_semaphore: update RAII state guard classes w.r.t. recent permit state name changes
reader_concurrency_semaphore: update API w.r.t. recent permit state name changes
reader_concurrency_semaphore: update stats w.r.t. recent permit state name changes
e2c9cdb576 moved the validation of the range tombstone change to the place where it is actually consumed, so we don't attempt to pass purged or discarded range tombstones to the validator. In doing so however, the validate pass was moved after the consume call, which moves the range tombstone change, the validator having been passed a moved-from range tombstone. Fix this by moving he validation to before the consume call.
Refs: #12575Closes#13749
* github.com:scylladb/scylladb:
test/boost/mutation_test: add sanity test for mutation compaction validator
mutation/mutation_compactor: add validation level to compaction state query constructor
mutation/mutation_compactor: validate range tombstone change before it is moved
Current S3 client was tested over minio and it takes few more touches to work with amazon S3.
The main challenge here is to support singed requests. The AWS S3 server explicitly bans unsigned multipart-upload requests, which in turn is the essential part of the sstables S3 backend, so we do need signing. Signing a request has many options and requirements, one of them is -- request _body_ can be or can be not included into signature calculations. This is called "(un)signed payload". Requests sent over plain HTTP require payload signing (i.e. -- request body should be included into signature calculations), which can a bit troublesome, so instead the PR uses unsigned payload (i.e. -- doesn't include the request body into signature calculation, only necessary headers and query parameters), but thus also needs HTTPS.
So what this set does is makes the existing S3 client code sign requests. In order to sign the request the code needs to get AWS key and secret (and region) from somewhere and this somewhere is the conf/object_storage.yaml config file. The signature generating code was previously merged (moved from alternator code) and updated to suit S3 client needs.
In order to properly support HTTPS the PR adds special connection factory to be used with seastar http client. The factory makes DNS resolving of AWS endpoint names and configures gnutls systemtrust.
fixes: #13425Closes#13493
* github.com:scylladb/scylladb:
doc: Add a document describing how to configure S3 backend
s3/test: Add ability to run boost test over real s3
s3/client: Sign requests if configured
s3/client: Add connection factory with DNS resolve and configurable HTTPS
s3/client: Keep server port on config
s3/client: Construct it with config
s3/client: Construct it with sstring endpoint
sstables: Make s3_storage with endpoint config
sstables_manager: Keep object storage configs onboard
code: Introduce conf/object_storage.yaml configuration file
Commit ecbd112979
`distributed_loader: reshard: consider sstables for cleanup`
caused a regression in loading new sstables using the `upload`
directory, as seen in e.g. https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/230/testReport/migration_test/TestMigration/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split000___test_migrate_sstable_without_compression_3_0_md_/
```
query = "SELECT COUNT(*) FROM cf"
statement = SimpleStatement(query)
s = self.patient_cql_connection(node, 'ks')
result = list(s.execute(statement))
> assert result[0].count == expected_number_of_rows, \
"Expected {} rows. Got {}".format(expected_number_of_rows, list(s.execute("SELECT *
FROM ks.cf")))
E AssertionError: Expected 1 rows. Got []
E assert 0 == 1
E +0
```
The reason for the regression is that the call to `do_for_each_sstable` in `collect_all_shared_sstables` to search for sstables that need cleanup caused the list of sstables in the sstable directory to be moved and cleared.
parallel_for_each_restricted moves the container passed to it into a `do_with` continuation. This is required for parallel_for_each_restricted.
However, moving the container is destructive and so, the decision whether to move or not needs to be the caller's, not the callee.
This patch changes the signature of parallel_for_each_restricted to accept a container rather than a rvalue reference, allowing the callers to decide whether to move or not.
Most callers are converted to move the container, except for `do_for_each_sstable` that copies `_unshared_local_sstables`, allowing callers to call `dir.do_for_each_sstable` multiple times without moving the list contents.
Closes#13526
* github.com:scylladb/scylladb:
sstable_directory: coroutinize parallel_for_each_restricted
sstable_directory: parallel_for_each_restricted: use std::ranges for template definition
sstable_directory: parallel_for_each_restricted: do not move container
There are two of them currently, both returning fs::path for sstable
components. One is static and can be dropped, callers are patched to use
the non-static one making the code tiny bit shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sstable::filename() is private and is not supposed to be used as a
path to open any files. However, tests are different and they sometimes
know it is. For that they use test wrapper that has access to private
members and may make assumptions about meaning of sstable::filename().
Said that, the test::filename() should return fs::path, not sstring.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Commit 1cb95b8cf caused a small regression in the debug printer.
After that commit, range tombstones are printed to stdout,
instead of the target stream.
In practice, this causes range tombstones to appear in test logs
out of order with respect to other parts of the debug message.
Fix that.
Closes#13766
The sstable::write_toc() gets TOC filename from file writer, while it
can get it from itself. This makes the file_writer::get_filename()
private and actually improves logging, as the writer is not required
to have the filename onboard, while sstable always has it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All Raft verbs include dst_id, the ID of the destination server, but it isn't checked.
`append_entries` will work even if it arrives at completely the wrong server (but in the same group).
It can cause problems, e.g. in the scenario of replacing a dead node.
This commit adds verifying if `dst_id` matches the server's ID and if it doesn't,
the Raft verb is rejected.
Closes#12179
storage_service uses raft_group0 but the during shutdown the later is
destroyed before the former is stopped. This series move raft_group0
destruction to be after storage_service is stopped already. For the
move to work some existing dependencies of raft_group0 are dropped
since they do not really needed during the object creation.
Fixes#13522
In case an sstable unit test case is run individually, it would fail
with exception saying that S3_... environment is not set. It's better to
skip the test-case rather than fail. If someone wants to run it from
shell, it will have to prepare S3 server (minio/AWS public bucket) and
provide proper environment for the test-case.
refs: #13569
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13755
One-way RPC and two-way RPC have different semantics, i.e. in the first one
client doesn't need to wait for an answer.
This commit splits the logic of `handle_raft_rpc` to enable handle differences
in semantics, e.g. errors handling.
Raft replication doesn't guarantee that all replicas see
identical Raft state at all times, it only guarantees the
same order of events on all replicas.
When comparing raft state with gossip state on a node, first
issue a read barrier to ensure the node has the latest raft state.
To issue a read barrier it is sufficient to alter a non-existing
state: in order to validate the DDL the node needs to sync with the
leader and fetch its latest group0 state.
Fixes#13518 (flaky topology test).
Closes#13756
raft_group0 does not really depends on cdc::generation_service, it needs
it only transiently, so pass it to appropriate methods of raft_group0
instead of during its creation.
Commit ecbd112979
`distributed_loader: reshard: consider sstables for cleanup`
caused a regression in loading new sstables using the `upload`
directory, as seen in e.g. https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/230/testReport/migration_test/TestMigration/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split000___test_migrate_sstable_without_compression_3_0_md_/
```
query = "SELECT COUNT(*) FROM cf"
statement = SimpleStatement(query)
s = self.patient_cql_connection(node, 'ks')
result = list(s.execute(statement))
> assert result[0].count == expected_number_of_rows, \
"Expected {} rows. Got {}".format(expected_number_of_rows, list(s.execute("SELECT * FROM ks.cf")))
E AssertionError: Expected 1 rows. Got []
E assert 0 == 1
E +0
E -1
```
The reason for the regression is that the call to `do_for_each_sstable`
in `collect_all_shared_sstables` to search for sstables that need
cleanup caused the list of sstables in the sstable directory to be
moved and cleared.
parallel_for_each_restricted moves the container passed to it
into a `do_with` continuation. This is required for
parallel_for_each_restricted.
However, moving the container is destructive and so,
the decision whether to move or not needs to be the
caller's, not the callee.
This patch changes the signature of parallel_for_each_restricted
to accept a lvalue reference to the container rather than a rvalue reference,
allowing the callers to decide whether to move or not.
Most callers are converted to move the container, as they effectively do
today, and a new method, `filter_sstables` was added for the
`collect_all_shared_sstables` us case, that allows the `func` that
processes each sstable to decide whether the sstable is kept
in `_unshared_local_sstables` or not.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the validate command uses the logger to output the result of
validation. This is inconsistent with other commands which all write
their output to stdout and log any additional information/errors to
stderr. This patch updates the validate command to do the same. While at
it, remove the "Validating..." message, it is not useful.
We have a more in-depth validator for the mx format, so delegate to that
if the validated sstable is of that format. For kl/la we fall-back to
the reader-level validator we used before.
Working with the low-level sstable parser and index reader, this
validator also cross-checks the index with the data file, making sure
all partitions are located at the position and in the order the index
describes. Furthermore, if the index also has promoted index, the order
and position of clustering elements is checked against it.
This is above the usual fragment kind order, partition key order and
clustering order checks that we already had with the reader-level
validator.
it turns out the only places where we have compiler warnings of -W-parentheses-equality is the source code generated by ANTLR. strictly speaking, this is valid C++ code, just not quite readable from the hygienic point of view. so let's enable this warning in the source tree, but only disable it when compiling the sources generated by ANTLR.
please note, this warning option is supported by both GCC and Clang, so no need to test if it is supported.
for a sample of the warnings, see:
```
/home/kefu/dev/scylladb/build/cmake/cql3/CqlLexer.cpp:21752:38: error: equality comparison with extraneous parentheses [-Werror,-Wparentheses-equality]
if ( (LA4_0 == '$'))
~~~~~~^~~~~~
/home/kefu/dev/scylladb/build/cmake/cql3/CqlLexer.cpp:21752:38: note: remove extraneous parentheses around the comparison to silence this warning
if ( (LA4_0 == '$'))
~ ^ ~
```
Closes#13762
* github.com:scylladb/scylladb:
build: only apply -Wno-parentheses-equality to ANTLR generated sources
compaction: disambiguate format_to()
it turns out the only places where we have compiler warnings of
-W-parentheses-equality is the source code generated by ANTLR. strictly
speaking, this is valid C++ code, just not quite readable from the
hygienic point of view. so let's enable this warning in the source tree,
but only disable it when compiling the sources generated by ANTLR.
please note, this warning option is supported by both GCC and Clang,
so no need to test if it is supported.
for a sample of the warnings, see:
```
/home/kefu/dev/scylladb/build/cmake/cql3/CqlLexer.cpp:21752:38: error: equality comparison with extraneous parentheses [-Werror,-Wparentheses-equality]
if ( (LA4_0 == '$'))
~~~~~~^~~~~~
/home/kefu/dev/scylladb/build/cmake/cql3/CqlLexer.cpp:21752:38: note: remove extraneous parentheses around the comparison to silence this warning
if ( (LA4_0 == '$'))
~ ^ ~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
we should always qualify `format_to` with its namespace. otherwise
we'd have following failure when compiling with libstdc++ from GCC-13:
```
/home/kefu/dev/scylladb/compaction/table_state.hh:65:16: error: call to 'format_to' is ambiguous
return format_to(ctx.out(), "{}.{} compaction_group={}", s->ks_name(), s->cf_name(), t.get_group_id());
^~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13760
Support the AWS_S3_EXTRA environment vairable that's :-split and the
respective substrings are set as endpoint AWS configuration. This makes
it possible to run boost S3 test over real S3.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If the endpoint config specifies AWS key, secret and region, all the
S3 requests get signed. Signature should have all the x-amz-... headers
included and should contain at least three of them. This patch includes
x-ams-date, x-amz-content-sha256 and host headers into the signing list.
The content can be unsigned when sent over HTTPS, this is what this
patch does.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Existing seastar's factories work on socket_address, but in S3 we have
endpoint name which's a DNS name in case of real S3. So this patch
creates the http client for S3 with the custom connection factory that
does two things.
First, it resolves the provided endpoint name into address.
Second, it loads trust-file from the provided file path (or sets system
trust if configured that way).
Since s3 client creation is no-waiting code currently, the above
initialization is spawned in afiber and before creating the connection
this fiber is waited upon.
This code probably deserves living in seastar, but for now it can land
next to utils/s3/client.cc.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the code temporarily assumes that the endpoint port is 9000.
This is what tests' local minio is started with. This patch keeps the
port number on endpoint config and makes test get the port number from
minio starting code via environment.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similar to previous patch -- extent the s3::client constructor to get
the endpoint config value next to the endpoint string. For now the
configs are likely empty, but they are yet unused too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the client is constructed with socket_address which's prepared
by the caller from the endpoint string. That's not flexible engouh,
because s3 client needs to know the original endpoint string for two
reasons.
First, it needs to lookup endpoint config for potential AWS creds.
Second, it needs this exact value as Host: header in its http requests.
So this patch just relaxes the client constructor to accept the endpoint
string and hard-code the 9000 port. The latter is temporary, this is how
local tests' minio is started, but next patch will make it configurable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Continuation of the previous patch. The sstables::s3_storage gets the
endpoint config instance upon creation.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The user sstables manager will need to provide endpoint config for
sstables' storage drivers. For that it needs to get it from db::config
and keep in-sync with its updates.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In order to access real S3 bucket, the client should use signed requests
over https. Partially this is due to security considerations, partially
this is unavoidable, because multipart-uploading is banned for unsigned
requests on the S3. Also, signed requests over plain http require
signing the payload as well, which is a bit troublesome, so it's better
to stick to secure https and keep payload unsigned.
To prepare signed requests the code needs to know three things:
- aws key
- aws secret
- aws region name
The latter could be derived from the endpoint URL, but it's simpler to
configure it explicitly, all the more so there's an option to use S3
URLs without region name in them we could want to use some time.
To keep the described configuration the proposed place is the
object_storage.yaml file with the format
endpoints:
- name: a.b.c
port: 443
aws_key: 12345
aws_secret: abcdefghijklmnop
...
When loaded, the map gets into db::config and later will be propagated
down to sstables code (see next patch).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Allowing the validation level to be customized by whoever creates the
compaction state. Add a default value (the previous hardcoded level) to
avoid the churn of updating all call sites.
e2c9cdb576 moved the validation of the
range tombstone change to the place where it is actually consumed, so we
don't attempt to pass purged or discarded range tombstones to the
validator. In doing so however, the validate pass was moved after the
consume call, which moves the range tombstone change, the validator
having been passed a moved-from range tombstone. Fix this by moving he
validation to before the consume call.
Refs: #12575
in this series, we try to use `generation_type` as a proxy to hide the consumers from its underlying type. this paves the road to the UUID based generation identifier. as by then, we cannot assume the type of the `value()` without asking `generation_type` first. better off leaving all the formatting and conversions to the `generation_type`. also, this series changes the "generation" column of sstable registry table to "uuid", and convert the value of it to the original generation_type when necessary, this paves the road to a world with UUID based generation id.
Closes#13652
* github.com:scylladb/scylladb:
db: use uuid for the generation column in sstable registry table
db, sstable: add operator data_value() for generation_type
db, sstable: print generation instead of its value
Currently there are only 2 tests for S3 -- the pure client test and compound object_store test that launches scylla, creates s3-backed table and CQL-queries it. At the same time there's a whole lot of small unit test for sstables functionality, part of it can run over S3 storage too.
This PR adds this support and patches several test cases to use it. More test cases are to come later on demand.
fixes: #13015Closes#13569
* github.com:scylladb/scylladb:
test: Make resharding test run over s3 too
test: Add lambda to fetch bloom filter size
test: Tune resharding test use of sstable::test_env
test: Make datafile test case run over s3 too
test: Propagate storage options to table_for_test
test: Add support for s3 storage_options in config
test: Outline sstables::test_env::do_with_async()
test: Keep storage options on sstable_test_env config
sstables: Add and call storage::destroy()
sstables: Coroutinize sstable::destroy()
Currently mp_row_consumer_m creates an alias to data_consumer::proceed.
Code in the rest of the file uses both unqualified name and
mp_row_consumer_m::proceed. Remove the alias and just use
`data_consumer::proceed` directly everywhere, leads to cleaner code.
Test sstable::validate() instead. Also rename the unit test testing said
method from scrub_validate_mode_validate_reader_test to
sstable_validate_test to reflect the change.
At this point this test should probably be moved to
sstable_datafile_test.cc, but not in this patch.
Sadly this transition means we loose some test scenarios. Since now we
have to write the invalid data to sstables, we have to drop scenarios
which trigger errors on either the write or read path.
To replace the validate code currently in compaction/compaction.cc (not
in this commit). We want to push down this logic to the sstable layer,
so that:
* Non compaction code that wishes to validate sstables (tests, tools)
doesn't have to go through compaction.
* We can abstract how sstables are validated, in particular we want to
add a new more low-level validation method that only the more recent
sstable versions (mx) will support.
Currently said method creates a combined reader from all the sstables
passed to it then validates this combined reader.
Change it to validate each sstable (reader) individually in preparation
of the new validate method which can handle a single sstable at a time.
Note that this is not going to make much impact in practice, all callers
pass a single sstable to this method already.
Currently, error messages for validation errors are produced in several
places:
* the high-level validator (which is built on the low-level one)
* scrub compaction and validation compaction (scrub in validate mode)
* scylla-sstable's validate operation
We plan to introduce yet another place which would use the low-level
validator and hence would have to produce its own error messages. To cut
down all this duplication, centralize the production of error messages
in the low-level validator, which now returns a `validation_result`
object instead of bool from its validate methods. This object can be
converted to bool (so its backwards compatible) and also contains an
error message if validation failed. In the next patches we will migrate
all users of the low level validator (be that direct or indirect) to use
the error messages provided in this result object instead of coming up
with one themselves.
The evictable reader must ensure that each buffer fill makes forward
progress, i.e. the last fragment in the buffer has a position larger
than the last fragment from the last buffer-fill. Otherwise, the reader
could get stuck in an infinite loop between buffer fills, if the reader
is evicted in-between.
The code guranteeing this forward change has a bug: when the next
expected position is a partition-start (another partition), the code
would loop forever, effectively reading all there is from the underlying
reader.
To avoid this, add a special case to ignore the progress guarantee loop
altogether when the next expected position is a partition start. In this
case, progress is garanteed anyway, because there is exactly one
partition-start fragment in each partition.
Fixes: #13491Closes#13563
This mini-series cleans up printing of ranges in utils/to_string.hh
It generalizes the helper function to work on a std::ranges::range,
with some exceptions, and adds a helper for boost::transformed_range.
It also changes the internal interface by moving `join` the the utils namespace
and use std::string rather than seastar::sstring.
Additional unit tests were added to test/boost/json_test
Fixes#13146Closes#13159
* github.com:scylladb/scylladb:
utils: to_string: get rid of utils::join
utils: to_string: get rid of to_string(std::initializer_list)
utils: to_string: get rid of to_string(const Range&)
utils: to_string: generalize range helpers
test: add string_format_test
utils: chunked_vector: add std::ranges::range ctor
DynamoDB limits the allowed magnitude and precision of numbers - valid
decimal exponents are between -130 and 125 and up to 38 significant
decimal digitst are allowed. In contrast, Scylla uses the CQL "decimal"
type which offers unlimited precision. This can cause two problems:
1. Users might get used to this "unofficial" feature and start relying
on it, not allowing us to switch to a more efficient limited-precision
implementation later.
2. If huge exponents are allowed, e.g., 1e-1000000, summing such a
number with 1.0 will result in a huge number, huge allocations and
stalls. This is highly undesirable.
This series adds more tests in this area covering additional corner cases,
and then fixes the issue by adding the missing verification where it's
needed. After the series, all 12 tests in test/alternator/test_number.py now pass.
Fixes#6794Closes#13743
* github.com:scylladb/scylladb:
alternator: unit test for number magnitude and precision function
alternator: add validation of numbers' magnitude and precision
test/alternator: more tests for limits on number precision and magnitude
test/alternator: reproducer for DoS in unlimited-precision addition
* change the "generation" column of sstable registry table from
bigint to uuid
* from helper to convert UUID back to the original generation
in the long run, we encourage user to use uuid based generation
identifier. but in the transition period, both bigint based and uuid
based identifiers are used for the generation. so to cater both
needs, we use a hackish way to store the integer into UUID. to
differentiate the was-integer UUID from the geniune UUID, we
check the UUID's most_significant_bits. because we only support
serialize UUID v1, so if the timestamp in the UUID is zero,
we assume the UUID was generated from an integer when converting it
back to a generation identififer.
also, please note, the only use case of using generation as a
column is the sstable_registry table, but since its schema is fixed,
we cannot store both a bigint and a UUID as the value of its
`generation` column, the simpler way forward is to use a single type
for the generation. to be more efficient and to preserve the type of
the generation, instead of using types like ascii string or bytes,
we will always store the generation as a UUID in this table, if the
generation's identifier is a int64_t, the value of the integer will
be used as the least significant bits of the UUID.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This is a translation of Cassandra's CQL unit test source file
validation/operations/InsertUpdateIfConditionTest.java into our cql-pytest
framework.
This test file checks various LWT conditional updates which involve
collections or UDTs (there is a separate test file for LWT conditional
updates which do not involve collections, which I haven't translated
yet).
The tests reproduce one known bug:
Refs #5855: lwt: comparing NULL collection with empty value in IF
condition yields incorrect results
And also uncovered three previously-unknown bugs:
Refs #13586: Add support for CONTAINS and CONTAINS KEY in LWT expressions
Refs #13624: Add support for UDT subfields in LWT expression
Refs #13657: Misformatted printout of column name in LWT error message
Beyond those bona-fide bugs, this test also demonstrates several places
where we intentionally deviated from Cassandra's behavior, forcing me
to comment out several checks. These deviations are known, and intentional,
but some of them are undocumented and it's worth listing here the ones
re-discovered by this test:
1. On a successful conditional write, Cassandra returns just True, Scylla
also returns the old contents of the row. This difference is officially
documented in docs/kb/lwt-differences.rst.
2. Scylla allows the test "l = [null]" or "s = {null}" with this weird
null element (the result is false), whereas Cassandra prints an error.
3. Scylla allows "l[null]" or "m[null]" (resulting in null), Cassandra
prints an error.
4. Scylla allows a negative list index, "l[-2]", resulting in null.
Cassandra prints an error in this case.
5. Cassandra allows in "IF v IN (?, ?)" to bind individual values to
UNSET_VALUE and skips them, Scylla treats this as an error. Refs #13659.
6. Scylla allows "IN null" (the condition just fails), Cassandra prints
an error in this case.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13663
Now when the test case and used lib/utils code is using storage-agnostic
approach, it can be extended to run over S3 storage as well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The resharding test compares bloom filter sizes before and after reshard
runs. For that it gets the filter on-disk filename and stat()s it. That
won't work with S3 as it doesn't have its accessable on-disk files.
Some time ago there existed the storage::get_stats() method, but now
it's gone. The new s3::client::get_object_stat() is coming, but it will
take time to switch to it. For now, generalize filter size fetching into
a local lambda. Next patch will make a stub in it for S3 case, and once
the get_object_stat() is there we'll be able to smoothly start using it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
`clustering_key_columns()` returns a range view, and `front()` returns
the reference to its first element. so we cannot assume the availability
of this reference after the expression is evaluated. to address this
issue, let's capture the returned range by value, and keep the first
element by reference.
this also silences warning from GCC-13:
```
/home/kefu/dev/scylladb/db/schema_tables.cc:3654:30: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
3654 | const column_definition& first_view_ck = v->clustering_key_columns().front();
| ^~~~~~~~~~~~~
/home/kefu/dev/scylladb/db/schema_tables.cc:3654:79: note: the temporary was destroyed at the end of the full expression ‘(& v)->view_ptr::operator->()->schema::clustering_key_columns().boost::iterator_range<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> > >::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::random_access_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::bidirectional_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::incrementable_traversal_tag>::front()’
3654 | const column_definition& first_view_ck = v->clustering_key_columns().front();
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
```
Fixes#13720
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13721
The test case in question spawns async context then makes the test_env
instance on the stack (and stopper for it too). There's helper for the
above steps, better to use them.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Most of the sstable_datafile test cases are capable of running with S3
storage, so this patch makes the simplest of them do it. Patching the
rest from this file is optional, because mostly the cases test how the
datafile data manipulations work without checking the files
manipulations. So even if making them all run over S3 is possible, it
will just increase the testing time w/o real test of the storage driver.
So this patch makes one test case run over local and S3 storages, more
patches to update more test cases with files manipulations are yet to
come.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Teach table_for_tests use any storage options, not just local one. For
now the only user that passes non-local options is sstables::test_env.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the sstable test case wants to run over S3 storage it needs to
specify that in test config by providing the S3 storage options. So
first thing this patch adds is the helper that makes these options based
on the env left by minio launcher from test.py.
Next, in order to make sstables_manager work with S3 it needs the
plugged system keyspace which, in turn, needs query processor, proxy,
database, etc. All this stuff lives in cql_test_env, so the test case
running with S3 options will run in a sstables::test_env nested inside
cql_test_env. The latter would also need to plug its system keyspace to
the former's sstables manager and turn the experimental feature ON.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We have known for a long time (see issue #1703) that the quality of our
CQL "syntax error" messages leave a lot to be desired, especially when
compared to Cassandra. This patch doesn't yet bring us great error
messages with great context - doing this isn't easy and it appears that
Antlr3's C++ runtime isn't as good as the Java one in this regard -
but this patch at least fixes **garbage** printed in some error messages.
Specifically, when the parser can deduce that a specific token is missing,
it used to print
line 1:83 missing ')' at '<missing '
After this patch we get rid of the meaningless string '<missing ':
line 1:83 : Missing ')'
Also, when the parser deduced that a specific token was unneeded, it
used to print:
line 1:83 extraneous input ')' expecting <invalid>
Now we got rid of this silly "<invalid>" and write just:
line 1:83 : Unexpected ')'
Refs #1703. I didn't yet marked that issue "fixed" because I think a
complete fix would also require printing the entire misparsed line and the
point of the parse failure. Scylla still prints a generic "Syntax Error"
in most cases now, and although the character number (83 in the above
example) can help, it's much more useful to see the actual failed
statement and where character 83 is.
Unfortunately some tests enshrine buggy error messages and had to be
fixed. Other tests enshrined strange text for a generic unexplained
error message, which used to say " : syntax error..." (note the two
spaces and elipses) and after this patch is " : Syntax error". So
these tests are changed. Another message, "no viable alternative at
input" is deliberately kept unchanged by this patch so as not to break
many more tests which enshrined this message.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13731
So that it could be set to s3 by the test case on demand. Default is
local storage which uses env's tempdir or explicit path argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The s3_storage leaks client when sstable gets destoryed. So far this
came unnoticed, but debug-mode unit test ran over minio captured it. So
here's the fix.
When sstable is destroyed it also kicks the storage to do whatever
cleanup is needed. In case of s3 storage the cleanup is in closing the
on-boarded client. Until #13458 is fixed each sstable has its own
private version of the client and there's no other place where it can be
close()d in co_await-able mannter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In the previous patch we added a limit in Alternator for the magnitude
and precision of numbers, based on a function get_magnitude_and_precision
whose implementation was, unfortunately, rather elaborate and delicate.
Although we did add in the previous patches some end-to-end tests which
confirmed that the final decision made based on this function, to accept or
reject numbers, was a correct decision in a few cases, such an elaborate
function deserves a separate unit test for checking just that function
in isolation. In fact, this unit tests uncovered some bugs in the first
implementation of get_magnitude_and_precision() which the other tests
missed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
DynamoDB limits the allowed magnitude and precision of numbers - valid
decimal exponents are between -130 and 125 and up to 38 significant
decimal digitst are allowed. In contrast, Scylla uses the CQL "decimal"
type which offers unlimited precision. This can cause two problems:
1. Users might get used to this "unofficial" feature and start relying
on it, not allowing us to switch to a more efficient limited-precision
implementation later.
2. If huge exponents are allowed, e.g., 1e-1000000, summing such a
number with 1.0 will result in a huge number, huge allocations and
stalls. This is highly undesirable.
After this patch, all tests in test/alternator/test_number.py now
pass. The various failing tests which verify magnitude and precision
limitations in different places (key attributes, non-key attributes,
and arithmetic expressions) now pass - so their "xfail" tags are removed.
Fixes#6794
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We already have xfailing tests for issue #6794 - the missing checks on
precision and magnitudes of numbers in Alternator - but this patch adds
checks for additional corner cases. In particular we check the case that
numbers are used in a *key* column, which goes to a different code path
than numbers used in non-key columns, so it's worth testing as well.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
As already noted in issue #6794, whereas DynamoDB limits the magnitude
of numbers to between 10^-130 and 10^125, Scylla does not. In this patch
we add yet another test for this problem, but unlike previous tests
which just shown too much magnitude being allowed which always sounded
like a benign problem - the test in this patch shows that this "feature"
can be used to DoS Scylla - a user user can send a short request that
causes arbitrarily-large allocations, stalls and CPU usage.
The test is currently marked "skip" because it cause cause Scylla to
take a very long time and/or run out of memory. It passes on DynamoDB
because the excessive magnitude is simply not allowed there.
Refs #6794
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
It's unused.
Just in case, add a unit test case for using the fmt library to
format it (that includes fmt::to_string(std::initializer_list)).
Note that the existing to_string implementation
used square brackets to enclose the initializer_list
but the new, standardized form uses curly braces.
This doesn't break anything since to_string(initializer_list)
wasn't used.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
As seen in https://github.com/scylladb/scylladb/issues/13146
the current implementation is not general enough
to provide print helpers for all kind of containers.
Modernize the implementation using templates based
on std::ranges::range and using fmt::join.
Extend unit test for formatting different types of ranges,
boost::transformed ranges, deque.
Fixes#13146
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, when altering permissions on a functions resource, we
only check if it's a builtin function and not if it's all functions
in the "system" keyspace, which contains all builtin functions.
This patch adds a check of whether the function resource keyspace
is "system". This check actually covers both "single function"
and "all functions in keyspace" cases, so the additional check
for single functions is removed.
Closes#13596
In many cases we trigger offstrategy compaction opportunistically
also when there's nothing to do. In this case we still print
to the log lots of info-level message and call
`run_offstrategy_compaction` that wastes more cpu cycles
on learning that it has nothing to do.
This change bails out early if the maintenance set is empty
and prints a "Skipping off-strategy compaction" message in debug
level instead.
Fixes#13466
Also, add an group_id class and return it from compaction_group and table_state.
Use that to identify the compaction_group / table_state by "ks_name.cf_name compaction_group=idx/total" in log messages.
Fixes#13467Closes#13520
* github.com:scylladb/scylladb:
compaction_manager: print compaction_group id
compaction_group, table_state: add group_id member
compaction_manager: offstrategy compaction: skip compaction if no candidates are found
This reverts commit d85af3dca4. It
restores the linear search algorithm, as we expect the search to
terminate near the origin. In this case linear search is O(1)
while binary search is O(log n).
A comment is added so we don't repeat the mistake.
Closes#13704
No point in going through the vector<mutation> entry-point
just to discover in run time that it was called
with a single-element vector, when we know that
in advance.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13733
this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `cdc::generation_id` and `db_clock::time_point` without the help of `operator<<`.
the formatter of `cdc::generation_id` uses that of `db_clock::time_point` , so these two commits are posted together in a single pull request.
the corresponding `operator<<()` is removed in this change, as all its callers are now using fmtlib for formatting now.
Refs #13245Closes#13703
* github.com:scylladb/scylladb:
db_clock: specialize fmt::formatter<db_clock::time_point>
cdc: generation: specialize fmt::formatter<generation_id>
This series fixes a few issues caused by f1bbf705f9
(f1bbf705f9):
- table, compaction_manager: prevent cross shard access to owned_ranges_ptr
- Fixes#13631
- distributed_loader: distribute_reshard_jobs: pick one of the sstable shard owners
- compaction: make_partition_filter: do not assert shard ownership
- allow the filtering reader now used during resharding to process tokens owned by other shards
Closes#13635
* github.com:scylladb/scylladb:
compaction: make_partition_filter: do not assert shard ownership
distributed_loader: distribute_reshard_jobs: pick one of the sstable shard owners
table, compaction_manager: prevent cross shard access to owned_ranges_ptr
this series silences the warnings from GCC 13. some of these changes are considered as critical fixes, and posted separately.
see also #13243Closes#13723
* github.com:scylladb/scylladb:
cdc: initialize an optional using its value type
compaction: disambiguate type name
db: schema_tables: drop unused variable
reader_concurrency_semaphore: fix signed/unsigned comparision
locator: topology: disambiguate type names
raft: disambiguate promise name in raft::awaited_conf_changes
when compiling using GCC-13, it warns that:
```
/home/kefu/dev/scylladb/utils/s3/client.cc:224:9: error: stack usage might be 66352 bytes [-Werror=stack-usage=]
224 | sstring parse_multipart_upload_id(sstring& body) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
```
so it turns out that `rapidxml::xml_document<>` could be very large,
let's allocate it on heap instead of on the stack to address this issue.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13722
`data_dictionary::database::find_keyspace()` returns a temporary
object, and `data_dictionary::keyspace::user_types()` returns a
references pointing to a member of this temporary object. so we
cannot use the reference after the expression is evaluated. in
this change, we capture the return value of `find_keyspace()` using
universal reference, and keep the return value of `user_types()`
with a reference, to ensure us that we can use it later.
this change silences the warning from GCC-13, like:
```
/home/kefu/dev/scylladb/cql3/statements/authorization_statement.cc:68:21: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
68 | const auto& utm = qp.db().find_keyspace(*keyspace).user_types();
| ^~~
```
Fixes#13725
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13726
`current_scheduling_group()` returns a temporary value, and `name()`
returns a reference, so we cannot capture the return value by reference,
and use the reference after this expression is evaluated. this would
cause undefined behavior. so let's just capture it by value.
this change also silence following warning from GCC-13:
```
/home/kefu/dev/scylladb/transport/server.cc:204:11: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
204 | auto& cur_sg_name = current_scheduling_group().name();
| ^~~~~~~~~~~
/home/kefu/dev/scylladb/transport/server.cc:204:56: note: the temporary was destroyed at the end of the full expression ‘seastar::current_scheduling_group().seastar::scheduling_group::name()’
204 | auto& cur_sg_name = current_scheduling_group().name();
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
```
Fixes#13719
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13724
`data_dictionary::database::find_column_family()` return a temporary value,
and `data_dictionary::table::get_index_manager()` returns a reference in
this temporary value, so we cannot capture this reference and use it after
the expression is evaluated. in this change, we keep the return value
of `find_column_family()` by value, to extend the lifecycle of the return
value of `get_index_manager()`.
this should address the warning from GCC-13, like:
```
/home/kefu/dev/scylladb/cql3/restrictions/statement_restrictions.cc:519:15: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
519 | auto& sim = db.find_column_family(_schema).get_index_manager();
| ^~~
```
Fixes#13727
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13728
Let's remove `expr::token` and replace all of its functionality with `expr::function_call`.
`expr::token` is a struct whose job is to represent a partition key token.
The idea is that when the user types in `token(p1, p2) < 1234`, this will be internally represented as an expression which uses `expr::token` to represent the `token(p1, p2)` part.
The situation with `expr::token` is a bit complicated.
On one hand side it's supposed to represent the partition token, but sometimes it's also assumed that it can represent a generic call to the `token()` function, for example `token(1, 2, 3)` could be a `function_call`, but it could also be `expr::token`.
The query planning code assumes that each occurence of expr::token
represents the partition token without checking the arguments.
Because of this allowing `token(1, 2, 3)` to be represented as `expr::token` is dangerous - the query planning might think that it is `token(p1, p2, p3)` and plan the query based on this, which would be wrong.
Currently `expr::token` is created only in one specific case.
When the parser detects that the user typed in a restriction which has a call to `token` on the LHS it generates `expr::token`.
In all other cases it generates an `expr::function_call`.
Even when the `function_call` represents a valid partition token, it stays a `function_call`. During preparation there is no check to see if a `function_call` to `token` could be turned into `expr::token`. This is a bit inconsistent - sometimes `token(p1, p2, p3)` is represented as `expr::token` and the query planner handles that, but sometimes it might be represented as `function_call`, which the query planner doesn't handle.
There is also a problem because there's a lot of code duplication between a `function_call` and `expr::token`.
All of the evaluation and preparation is the same for `expr::token` as it's for a `function_call` to the token function.
Currently it's impossible to evaluate `expr::token` and preparation has some flaws, but implementing it would basically consist of copy-pasting the corresponding code from token `function_call`.
One more aspect is multi-table queries.
With `expr::token` we turn a call to the `token()` function into a struct that is schema-specific.
What happens when a single expression is used to make queries to multiple tables? The schema is different, so something that is represented as `expr::token` for one schema would be represented as `function_call` in the context of a different schema.
Translating expressions to different tables would require careful manipulation to convert `expr::token` to `function_call` and vice versa. This could cause trouble for index queries.
Overall I think it would be best to remove `expr::token`.
Although having a clear marker for the partition token is sometimes nice for query planning, in my opinion the pros are outweighted by the cons.
I'm a big fan of having a single way to represent things, having two separate representations of the same thing without clear boundaries between them causes trouble.
Instead of having both `expr::token` and `function_call` we can just have the `function_call` and check if it represents a partition token when needed.
Refs: #12906
Refs: #12677Closes: #12905Closes#13480
* github.com:scylladb/scylladb:
cql3: remove expr::token
cql3: keep a schema in visitor for extract_clustering_prefix_restrictions
cql3: keep a schema inside the visitor for extract_partition_range
cql3/prepare_expr: make get_lhs_receiver handle any function_call
cql3/expr: properly print token function_call
expr_test: use unresolved_identifier when creating token
cql3/expr: split possible_lhs_values into column and token variants
cql3/expr: fix error message in possible_lhs_values
cql3: expr: reimplement is_satisfied_by() in terms of evaluate()
cql3/expr: add a schema argument to expr::replace_token
cql3/expr: add a comment for expr::has_partition_token
cql3/expr: add a schema argument to expr::has_token
cql3: use statement_restrictions::has_token_restrictions() wherever possible
cql3/expr: add expr::is_partition_token_for_schema
cql3/expr: add expr::is_token_function
cql3/expr: implement preparing function_call without a receiver
cql3/functions: make column family argument optional in functions::get
cql3/expr: make it possible to prepare expr::constant
cql3/expr: implement test_assignment for column_value
cql3/expr: implement test_assignment for expr::constant
We change the meaning and name of `replication_state`: previously it was meant
to describe the "state of tokens" of a specific node; now it describes the
topology as a whole - the current step in the 'topology saga'. It was moved
from `ring_slice` into `topology`, renamed into `transition_state`, and the
topology coordinator code was modified to switch on it first instead of node
state - because there may be no single transitioning node, but the topology
itself may be transitioning.
This PR was extracted from #13683, it contains only the part which refactors
the infrastructure to prepare for non-node specific topology transitions.
Closes#13690
* github.com:scylladb/scylladb:
raft topology: rename `update_replica_state` -> `update_topology_state`
raft topology: remove `transition_state::normal`
raft topology: switch on `transition_state` first
raft topology: `handle_ring_transition`: rename `res` to `exec_command_res`
raft topology: parse replaced node in `exec_global_command`
raft topology: extract `cleanup_group0_config_if_needed` from `get_node_to_work_on`
storage_service: extract raft topology coordinator fiber to separate class
raft topology: rename `replication_state` to `transition_state`
raft topology: make `replication_state` a topology-global state
as this syntax is not supported by the standard, it seems clang
just silently construct the value with the initializer list and
calls the operator=, but GCC complains:
```
/home/kefu/dev/scylladb/cdc/split.cc:392:54: error: converting to ‘std::optional<partition_deletion>’ from initializer list would use explicit constructor ‘constexpr std::optional<_Tp>::optional(_Up&&) [with _Up = const tombstone&; typename std::enable_if<__and_v<std::__not_<std::is_same<std::optional<_Tp>, typename std::remove_cv<typename std::remove_reference<_Iter>::type>::type> >, std::__not_<std::is_same<std::in_place_t, typename std::remove_cv<typename std::remove_reference<_Iter>::type>::type> >, std::is_constructible<_Tp, _Up>, std::__not_<std::is_convertible<_Iter, _Iterator> > >, bool>::type <anonymous> = false; _Tp = partition_deletion]’
392 | _result[t.timestamp].partition_deletions = {t};
| ^
```
to silences the error, and to be more standard compliant,
let's use emplace() instead.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Let's remove expr::token and replace all of its functionality with expr::function_call.
expr::token is a struct whose job is to represent a partition key token.
The idea is that when the user types in `token(p1, p2) < 1234`,
this will be internally represented as an expression which uses
expr::token to represent the `token(p1, p2)` part.
The situation with expr::token is a bit complicated.
On one hand side it's supposed to represent the partition token,
but sometimes it's also assumed that it can represent a generic
call to the token() function, for example `token(1, 2, 3)` could
be a function_call, but it could also be expr::token.
The query planning code assumes that each occurence of expr::token
represents the partition token without checking the arguments.
Because of this allowing `token(1, 2, 3)` to be represented
as expr::token is dangerous - the query planning
might think that it is `token(p1, p2, p3)` and plan the query
based on this, which would be wrong.
Currently expr::token is created only in one specific case.
When the parser detects that the user typed in a restriction
which has a call to `token` on the LHS it generates expr::token.
In all other cases it generates an `expr::function_call`.
Even when the `function_call` represents a valid partition token,
it stays a `function_call`. During preparation there is no check
to see if a `function_call` to `token` could be turned into `expr::token`.
This is a bit inconsistent - sometimes `token(p1, p2, p3)` is represented
as `expr::token` and the query planner handles that, but sometimes it might
be represented as `function_call`, which the query planner doesn't handle.
There is also a problem because there's a lot of duplication
between a `function_call` and `expr::token`. All of the evaluation
and preparation is the same for `expr::token` as it's for a `function_call`
to the token function. Currently it's impossible to evaluate `expr::token`
and preparation has some flaws, but implementing it would basically
consist of copy-pasting the corresponding code from token `function_call`.
One more aspect is multi-table queries. With `expr::token` we turn
a call to the `token()` function into a struct that is schema-specific.
What happens when a single expression is used to make queries to multiple
tables? The schema is different, so something that is representad
as `expr::token` for one schema would be represented as `function_call`
in the context of a different schema.
Translating expressions to different tables would require careful
manipulation to convert `expr::token` to `function_call` and vice versa.
This could cause trouble for index queries.
Overall I think it would be best to remove expr::token.
Although having a clear marker for the partition token
is sometimes nice for query planning, in my opinion
the pros are outweighted by the cons.
I'm a big fan of having a single way to represent things,
having two separate representations of the same thing
without clear boundaries between them causes trouble.
Instead of having expr::token and function_call we can
just have the function_call and check if it represents
a partition token when needed.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The schema will be needed once we remove expr::token
and switch to using expr::is_partition_token_for_schema,
which requires a schema arguments.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The schema will be needed once we remove expr::token
and switch to using expr::is_partition_token_for_schema,
which requires a schema arguments.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
get_lhs_receiver looks at the prepared LHS of a binary operator
and creates a receiver corresponding to this LHS expression.
This receiver is later used to prepare the RHS of the binary operator.
It's able to handle a few expression types - the ones that are currently
allowed to be on the LHS.
One of those types is `expr::token`, to handle restrictions like `token(p1, p2) = 3`.
Soon token will be replaced by `expr::function_call`, so the function will need
to handle `function_calls` to the token function.
Although we expect there to be only calls to the `token()` function,
as other functions are not allowed on the LHS, it can be made generic
over all function calls, which will help in future grammar extensions.
The functions call that it can currently get are calls to the token function,
but they're not validated yet, so it could also be something like `token(pk, pk, ck)`.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Printing for function_call is a bit strange.
When printing an unprepared function it prints
the name and then the arguments.
For prepared function it prints <anonymous function>
as the name and then the arguments.
Prepared functions have a name() method, but printing
doesn't use it, maybe not all functions have a valid name(?).
The token() function will soon be represent as a function_call
and it should be printable in a user-readable way.
Let's add an if which prints `token(arg1, arg2)`
instead of `<anonymous function>(arg1, arg2)` when printing
a call to the token function.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
One test for expr::token uses raw column identifier
in the test.
Let's change it to unresloved_identifier, which is
a standard representation of unresolved column
names in expressions.
Once expr::token is removed it will be possible
to create a function_call with unresolved_identifiers
as arguments.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The possible_lhs_values takes an expression and a column
and finds all possible values for the column that make
the expression true.
Apart from finding column values it's also capable of finding
all matching values for the partition key token.
When a nullptr column is passed, possible_lhs_values switches
into token values mode and finds all values for the token.
This interface isn't ideal.
It's confusing to pass a nullptr column when one wants to
find values for the token. It would be better to have a flag,
or just have a separate function.
Additionally in the future expr::token will be removed
and we will use expr::is_partition_token_for_schema
to find all occurences of the partition token.
expr::is_partition_token_for_schema takes a schema
as an argument, which possible_lhs_values doesn't have,
so it would have to be extended to get the schema from
somewhere.
To fix these two problems let's split possible_lhs_values
into two functions - one that finds possible values for a column,
which doesn't require a schema, and one that finds possible values
for the partition token and requires a schema:
value_set possible_column_values(const column_definition* col, const expression& e, const query_options& options);
value_set possible_partition_token_values(const expression& e, const query_options& options, const schema& table_schema);
This will make the interface cleaner and enable smooth transition
once expr::token is removed.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
In possible_lhs_values there was a message talking
about is_satisifed_by. It looks like a badly
copy-pasted message.
Change it to possibel_lhs_values as it should be.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Just like has_token, replace_token will use
expr::is_partition_token_for_schema to find all instance
of the partition token to replace.
Let's prepare for this change by adding a schema argument
to the function before making the big change.
It's unsued at the moment, but having a separate commit
should make it easier to review.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
In the future expr::token will be removed and checking
whether there is a partition token inside an expression
will be done using expr::is_partition_token_for_schema.
This function takes a schema as an argument,
so all functions that will call it also need
to get the schema from somewhere.
Right now it's an unused argument, but in the future
it will be used. Adding it in a separate commit
makes it easier to review.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The statement_restrictions class has a method called has_token_restriction().
This method checks whether the partition key restrictions contain expr::token.
Let's use this function in all applicable places instead of manually calling has_token().
In the future has_token() will have an additional schema argument,
so eliminating calls to has_token() will simplify the transition.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add a function to check whether the expression
represents a partition token - that is a call
to the token function with consecutive partition
key columns as the arguments.
For example for `token(p1, p2, p3)` this function
would return `true`, but for `token(1, 2, 3)` or `token(p3, p2, p1)`
the result would be `false`.
The function has a schema argument because a schema is required
to get the list of partition columns that should be passed as
arguments to token().
Maybe it would be possible to infer the schema from the information
given earlier during prepare_expression, but it would be complicated
and a bit dangerous to do this. Sometimes we operate on multiple tables
and the schema is needed to differentiate between them - a token() call
can represent the base table's partition token, but for an index table
this is just a normal function call, not the partition token.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add a function that can be used to check
whether a given expression represents a call
to the token() function.
Note that a call to token() doesn't mean
that the expression represents a partition
token - it could be something like token(1, 2, 3),
just a normal function_call.
The code for checking has been taken from functions::get.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Currently trying to do prepare_expression(function_call)
with a nullptr receiver fails.
It should be possible to prepare function calls without
a known receiver.
When the user types in: `token(1, 2, 3)`
the code should be able to figure out that
they are looking for a function with name `token`,
which takes 3 integers as arguments.
In order to support that we need to prepare
all arguments that can be prepared before
attempting to find a function.
Prepared expressions have a known type,
which helps to find the right function
for the given arguments.
Additionally the current code for finding
a function requires all arguments to be
assignment_testable, which requires to prepare
some expression types, e.g column_values.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The method `functions::get` is used to get the `functions::function` object
of the CQL function called using `expr::function_call`.
Until now `functions::get` required the caller to pass both the keyspace
and the column family.
The keyspace argument is always needed, as every CQL function belongs
to some keyspace, but the column family isn't used in most cases.
The only case where having the column family is really required
is the `token()` function. Each variant of the `token()` function
belongs to some table, as the arguments to the function are the
consecutive partition key columns.
Let's make the column family argument optional. In most cases
the function will work without information about column family.
In case of the `token()` function there's gonna be a check
and it will throw an exception if the argument is nullopt.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
this also silence the warning from GCC-13:
```
/home/kefu/dev/scylladb/db/schema_tables.cc:1489:10: error: variable ‘ts’ set but not used [-Werror=unused-but-set-variable]
1489 | auto ts = db_clock::now();
| ^~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
a signed/unsigned comparsion can overflow. and GCC-13 rightly points
this out. so let's use `std::cmp_greater_equal()` when comparing
unsigned and signed for greater-or-equal.
```
/home/kefu/dev/scylladb/reader_concurrency_semaphore.cc:931:76: error: comparison of integer expressions of different signedness: ‘long int’ and ‘uint64_t’ {aka ‘long unsigned int’} [-Werror=sign-compare]
931 | if (_resources.memory <= 0 && (consumed_resources().memory + r.memory) >= get_kill_limit()) [[unlikely]] {
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
otherwise GCC 13 complains that
```
/home/kefu/dev/scylladb/raft/server.cc:42:15: error: declaration of ‘seastar::promise<void> raft::awaited_index::promise’ changes meaning of ‘promise’ [-Wchanges-meaning]
42 | promise<> promise;
| ^~~~~~~
/home/kefu/dev/scylladb/raft/server.cc:42:5: note: used here to mean ‘class seastar::promise<void>’
42 | promise<> promise;
| ^~~~~~~~~
```
see also cd4af0c722
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change silences the warning of `-Wmissing-braces` from
clang. in general, we can initialize an object without constructor
with braces. this is called aggregate initialization.
but the standard does allow us to initialize each element using
either copy-initialization or direct-initialization. but in our case,
neither of them applies, so the clang warns like
```
suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
options.elements.push_back({bytes(k.begin(), k.end()), bytes(v.begin(), v.end())});
^~~~~~~~~~~~~~~~~~~~~~~~~
{ }
```
in this change,
also, take the opportunity to use structured binding to simplify the
related code.
Closes#13705
* github.com:scylladb/scylladb:
build: reenable -Wmissing-braces
treewide: add braces around subobject
cql3/stats: use zero-initialization
S3 wasn't providing filter size and accurate size for all SSTable components on disk.
First, filter size is provided by taking advantage that its in-memory representation is roughly the same as on-disk one.
Second, size for all components is provided by piggybacking on sstable parser and writer, so no longer a need to do a separate additional step after Scylla have either parsed or written all components.
Finally, sstable::storage::get_stats() is killed, so the burden is no longer pushed on the storage type implementation.
Refs #13649.
Closes#13682
* github.com:scylladb/scylladb:
test: Verify correctness of sstable::bytes_on_disk()
sstable: Piggyback on sstable parser and writer to provide bytes_on_disk
sstable: restore indentation in read_digest() and read_checksum()
sstable: make all parsing of simple components go through do_read_simple()
sstable: Add missing pragma once to random_access_reader.hh
sstable: make all writing of simple components go through do_write_simple()
test: sstable_utils: reuse set_values()
sstable: Restore indentation in read_simple()
sstable: Coroutinize read_simple()
sstable: Use filter memory footprint in filter_size()
so we can apply `execute_cql()` on `generation_type` directly without
extracting its value using `generation.value()`. this paves the road to
adding UUID based generation id to `generation_type`. as by then, we
will have both UUID based and integer based `generation_type`, so
`generation_type::value()` will not be able to represent its value
anymore. and this method will be replaced by `operator data_value()` in
this use case.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change prepares for the change to use `variant<UUID, int64_t>`
as the value of `generation_type`. as after this change, the "value"
of a generation would be a UUID or an integer, and we don't want to
expose the variant in generation's public interface. so the `value()`
method would be changed or removed by then.
this change takes advantage of the fact that the formatter of
`generation_type` always prints its value. also, it's better to
reuse `generation_type` formatter when appropriate.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
try_prepare_expression(constant) used to throw an error
when trying to prepeare expr::constant.
It would be useful to be able to do this
and it's not hard to implement.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Make it possible to do test_assignment for column_values.
It's implemented using the generic expression assignment
testing function.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
test_assignment checks whether a value of some type
can be assigned to a value of different type.
There is no implementation of test_assignment
for expr::constant, but I would like to have one.
Currently there is a custom implementation
of test_assignment for each type of expression,
but generally each of them boils down to checking:
```
type1->is_value_compatible_with(type2)
```
Instead of implementing another type-specific funtion
I added expresion_test_assignment and used it to
implement test_assignment for constant.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
this change helps to silence the warning of `-Wmissing-braces` from
clang. in general, we can initialize an object without constructor
with braces. this is called aggregate initialization.
but the standard does allow us to initialize each element using
either copy-initialization or direct-initialization. but in our case,
neither of them applies, so the clang warns like
```
suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
options.elements.push_back({bytes(k.begin(), k.end()), bytes(v.begin(), v.end())});
^~~~~~~~~~~~~~~~~~~~~~~~~
{ }
```
in this change,
also, take the opportunity to use structured binding to simplify the
related code.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
use {} instead of {0ul} for zero initialization. as `_query_cnt`
is a multi-dimension array, each elements in `_query_cnt` is yet
another array. so we cannot initialize it with a `{0ul}`. but
to zero-initialize this array, we can just use `{}`, as per
https://en.cppreference.com/w/cpp/language/zero_initialization
> If T is array type, each element is zero-initialized.
so this should recursively zero-initialize all arrays in `_query_cnt`.
this change should silence following warning:
stats.hh:88:60: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
[statements::statement_type::MAX_VALUE + 1] = {0ul};
^~~
{ }
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
@annastuchlik please review
Closes#13691
* github.com:scylladb/scylladb:
adding documentation for integration with MindsDB
adding documentation for integration with MindsDB
this series syncs the CMake building system with `configure.py` which was updated for introducing the tablets feature. also, this series include a couple cleanups.
Closes#13699
* github.com:scylladb/scylladb:
build: cmake: remove dead code
build: move test-perf down to test/perf
build: cmake: pick up tablets related changes
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `db_clock::time_point` without the help of `operator<<`.
the corresponding `operator<<()` is removed in this change, as all its
callers are now using fmtlib for formatting now.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `generation_id` without the help of `operator<<`.
the corresponding `operator<<()` is removed in this change, as all its
callers are now using fmtlib for formatting now.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Loading cores from Scylla executables installed in a non-standard
location can cause gdb to fail reading required libraries.
This is an example of a warning I've got after trying to load core
generated by dtest jenkins job (using ./scripts/open-coredump.sh):
> warning: Can't open file /jenkins/workspace/scylla-master/dtest-daily-debug/scylla/.ccm/scylla-repository/0d64f327e1af9bcbb711ee217eda6df16e517c42/libreloc/libboost_system.so.1.78.0 during file-backed mapping note processing
Invocations of `scylla threads` command ended with an error:
> (gdb) scylla threads
> Python Exception <class 'gdb.error'>: Cannot find thread-local storage for LWP 2758, executable file (...)/scylla-debug-unstripped-5.3.0~dev-0.20230121.0d64f327e1af.x86_64/scylla/libexec/scylla:
> Cannot find thread-local variables on this target
> Error occurred in Python: Cannot find thread-local storage for LWP 2758, executable file (...)/scylla-debug-unstripped-5.3.0~dev-0.20230121.0d64f327e1af.x86_64/scylla/libexec/scylla:
> Cannot find thread-local variables on this target
An easy fix for this is to set solib-search-path to
/opt/scylladb/libreloc/.
This commit adds that set command to suggested command line gdb
arguments. I guess it's a good idea to always suggest setting
solib-search-path to that path, as it can save other people from wasting
their time on looking why does coredump opening does not work.
Closes#13696
the removed CMake script was designed to cater the needs when
Seastar's CMake script is not included in the parent project, but
this part is never tested and is dysfunctional as the `target_source()`
misses the target parameter. we can add it back when it is actually needed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
We may catch exceptions that are not `marshal_exception`.
Print std::current_exception() in this case to provide
some context about the marshalling error.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13693
Consider
- n1, n2, n3
- n3 is down
- n4 replaces n3 with the same ip address 127.0.0.3
- Inside the storage_service::handle_state_normal callback for 127.0.0.3 on n1/n2
```
auto host_id = _gossiper.get_host_id(endpoint);
auto existing = tmptr->get_endpoint_for_host_id(host_id);
```
host_id = new host id
existing = empty
As a result, del_replacing_endpoint() will not be called.
This means 127.0.0.3 will not be removed as a pending node on n1 and n2 when
replacing is done. This is wrong.
This is a regression since commit 9942c60d93
(storage_service: do not inherit the host_id of a replaced a node), where
replacing node uses a new host id than the node to be replaced.
To fix, call del_replacing_endpoint() when a node becomes NORMAL and existing
is empty.
Before:
n1:
storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3
token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Marked ops done from coordinator=127.0.0.3
storage_service - Node 127.0.0.3 state jump to normal
storage_service - Set host_id=6f9ba4e8-9457-4c76-8e2a-e2be257fe123 to be owned by node=127.0.0.3
After:
n1:
storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3
token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Marked ops done from coordinator=127.0.0.3
storage_service - Node 127.0.0.3 state jump to normal
token_metadata - Removed node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - Set host_id=72219180-e3d1-4752-b644-5c896e4c2fed to be owned by node=127.0.0.3
Tests: https://github.com/scylladb/scylla-dtest/pull/3126Closes#13677
bytes_on_disk is the sum of all sstable components.
As read_simple() fetches the file size before parsing the component,
bytes_on_disk can be added incrementally rather than an additional
step after all components were already parsed.
Likewise, write_simple() tracks the offset for each new component,
and therefore bytes_on_disk can also be added incrementally.
This simplifies s3 life as it no longer have to care about feeding
a bytes_on_disk, which is currently limited to data and index
sizes only.
Refs #13649.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
With all parsing of simple components going through do_read_simple(),
common infrastructure can be reused (exception handling, debug logging,
etc), and also statistics spanning all components can be easily added.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
With all writing of simple components going through do_write_simple(),
common infrastructure can be reused (exception handling, debug logging,
etc), and also statistics spanning all components can be easily added.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
For S3, filter size is currently set to zero, as we want to avoid
"fstat-ing" each file.
On-disk representation of bloom filter is similar to the in-memory
one, therefore let's use memory footprint in filter_size().
User of filter_size() is API implementing "nodetool cfstats" and
it cares about the size of bloom filter data (that's how it's
described).
This way, we provide the filter data size regardless of the
underlying storage type.
Refs #13649.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Updated the empty() function in the struct fsm_output to include the
max_read_id_with_quorum field when checking whether the fsm output is
empty or not. The change was made in order maintain consistency with the
codebase and adding completeness to the empty check. This change has no
impact on other parts of the codebase.
Closes#13656
The new name is more generic and appropriate for topology transitions
which don't affect any specific replica but the entire cluster as a
whole (which we'll introduce later).
Also take `guard` directly instead of `node_to_work_on` in this more
generic function. Since we want `node_to_work_on` to die when we steal
its guard, introduce `take_guard` which takes ownership of the object
and returns the guard.
Previously the code assumed that there was always a 'node to work on' (a
node which wants to change its state) or there was no work to do at all.
It would find such a node, switch on its state (e.g. check if it's
bootstrapping), and in some states switch on the topology
`transition_state` (e.g. check if it's `write_both_read_old`).
We want to introduce transitions that are not node-specific and can work
even when all nodes are 'normal' (so there's no 'node to work on'). As a
first step, we refactor the code so it switches on `transition_state`
first. In some of these states, like `write_both_read_old`, there must
be a 'node to work on' for the state to make sense; but later in some
states it will be optional (such as `commit_cdc_generation`).
The lambdas defined inside the fiber are now methods of this class.
Currently `handle_node_transition` is calling `handle_ring_transition`,
in a later commit we will reverse this: `handle_ring_transition` will
call `handle_node_transition`. We won't have to shuffle the functions
around because they are members of the same class, making the change
easier to review. In general, the code will be easier to maintain in
this new form (no need to deal with so many lambda captures etc.)
Also break up some lines which exceeded the 120 character limit (as per
Seastar coding guidelines).
if the visitor clauses are the same, we can just use the generic version
of it by specifying the parameter with `auto&`. simpler this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13626
The new name is more generic - it describes the current step of a
'topology saga` (a sequence of steps used to implement a larger topology
operation such as bootstrap).
Previously it was part of `ring_slice`, belonging to a specific node.
This commit moves it into `topology`, making it a cluster-global
property.
The `replication_state` column in `system.topology` is now `static`.
This will allow us to easily introduce topology transition states that
do not refer to any specific node. `commit_cdc_generation` will be such
a state, allowing us to commit a new CDC generation even though all
nodes are normal (none are transitioning). One could argue that the
other states are conceptually already cluster-global: for example,
`write_both_read_new` doesn't affect only the tokens of a bootstrapping
(or decommissioning etc.) node; it affects replica sets of other tokens
as well (with RFs greater than 1).
This PR introduces an experimental feature called "tablets". Tablets are
a way to distribute data in the cluster, which is an alternative to the
current vnode-based replication. Vnode-based replication strategy tries
to evenly distribute the global token space shared by all tables among
nodes and shards. With tablets, the aim is to start from a different
side. Divide resources of replica-shard into tablets, with a goal of
having a fixed target tablet size, and then assign those tablets to
serve fragments of tables (also called tablets). This will allow us to
balance the load in a more flexible manner, by moving individual tablets
around. Also, unlike with vnode ranges, tablet replicas live on a
particular shard on a given node, which will allow us to bind raft
groups to tablets. Those goals are not yet achieved with this PR, but it
lays the ground for this.
Things achieved in this PR:
- You can start a cluster and create a keyspace whose tables will use
tablet-based replication. This is done by setting `initial_tablets`
option:
```
CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy',
'replication_factor': 3,
'initial_tablets': 8};
```
All tables created in such a keyspace will be tablet-based.
Tablet-based replication is a trait, not a separate replication
strategy. Tablets don't change the spirit of replication strategy, it
just alters the way in which data ownership is managed. In theory, we
could use it for other strategies as well like
EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy
is augmented to support tablets.
- You can create and drop tablet-based tables (no DDL language changes)
- DML / DQL work with tablet-based tables
Replicas for tablet-based tables are chosen from tablet metadata
instead of token metadata
Things which are not yet implemented:
- handling of views, indexes, CDC created on tablet-based tables
- sharding is done using the old method, it ignores the shard allocated in tablet metadata
- node operations (topology changes, repair, rebuild) are not handling tablet-based tables
- not integrated with compaction groups
- tablet allocator piggy-backs on tokens to choose replicas.
Eventually we want to allocate based on current load, not statically
Closes#13387
* github.com:scylladb/scylladb:
test: topology: Introduce test_tablets.py
raft: Introduce 'raft_server_force_snapshot' error injection
locator: network_topology_strategy: Support tablet replication
service: Introduce tablet_allocator
locator: Introduce tablet_aware_replication_strategy
locator: Extract maybe_remove_node_being_replaced()
dht: token_metadata: Introduce get_my_id()
migration_manager: Send tablet metadata as part of schema pull
storage_service: Load tablet metadata when reloading topology state
storage_service: Load tablet metadata on boot and from group0 changes
db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata()
migration_notifier: Introduce before_drop_keyspace()
migration_manager: Make prepare_keyspace_drop_announcement() return a future<>
test: perf: Introduce perf-tablets
test: Introduce tablets_test
test: lib: Do not override table id in create_table()
utils, tablets: Introduce external_memory_usage()
db: tablets: Add printers
db: tablets: Add persistence layer
dht: Use last_token_of_compaction_group() in split_token_range_msb()
locator: Introduce tablet_metadata
dht: Introduce first_token()
dht: Introduce next_token()
storage_proxy: Improve trace-level logging
locator: token_metadata: Fix confusing comment on ring_range()
dht, storage_proxy: Abstract token space splitting
Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries"
db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms()
db: Introduce get_non_local_vnode_based_strategy_keyspaces()
service: storage_proxy: Avoid copying keyspace name in write handler
locator: Introduce per-table replication strategy
treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type
locator: Introduce effective_replication_map
locator: Rename effective_replication_map to vnode_effective_replication_map
locator: effective_replication_map: Abstract get_pending_endpoints()
db: Propagate feature_service to abstract_replication_strategy::validate_options()
db: config: Introduce experimental "TABLETS" feature
db: Log replication strategy for debugging purposes
db: Log full exception on error in do_parse_schema_tables()
db: keyspace: Remove non-const replication strategy getter
config: Reformat
in C++20, compiler generate operator!=() if the corresponding
operator==() is already defined, the language now understands
that the comparison is symmetric in the new standard.
fortunately, our operator!=() is always equivalent to
`! operator==()`, this matches the behavior of the default
generated operator!=(). so, in this change, all `operator!=`
are removed.
in addition to the defaulted operator!=, C++20 also brings to us
the defaulted operator==() -- it is able to generated the
operator==() if the member-wise lexicographical comparison.
under some circumstances, this is exactly what we need. so,
in this change, if the operator==() is also implemented as
a lexicographical comparison of all memeber variables of the
class/struct in question, it is implemented using the default
generated one by removing its body and mark the function as
`default`. moreover, if the class happen to have other comparison
operators which are implemented using lexicographical comparison,
the default generated `operator<=>` is used in place of
the defaulted `operator==`.
sometimes, we fail to mark the operator== with the `const`
specifier, in this change, to fulfil the need of C++ standard,
and to be more correct, the `const` specifier is added.
also, to generate the defaulted operator==, the operand should
be `const class_name&`, but it is not always the case, in the
class of `version`, we use `version` as the parameter type, to
fulfill the need of the C++ standard, the parameter type is
changed to `const version&` instead. this does not change
the semantic of the comparison operator. and is a more idiomatic
way to pass non-trivial struct as function parameters.
please note, because in C++20, both operator= and operator<=> are
symmetric, some of the operators in `multiprecision` are removed.
they are the symmetric form of the another variant. if they were
not removed, compiler would, for instance, find ambiguous
overloaded operator '=='.
this change is a cleanup to modernize the code base with C++20
features.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13687
Common compression libraries work on contiguous buffers.
Contiguous buffers are a problem for the allocator. However, as long as they are short-lived,
we can avoid the expensive allocations by reusing buffers across tasks.
This idea is already applied to the compression of CQL frames, but with some deficiencies.
`utils: redesign reusable_buffer` attempts to improve upon it in a few ways. See its commit message for an extended discussion.
Compression buffer reuse also happens in the zstd SSTable compressor, but the implementation is misguided. Every `zstd_processor` instance reuses a buffer, but each instance has its own buffer. This is very bad, because a healthy database might have thousands of concurrent instances (because there is one for each sstable reader). Together, the buffers might require gigabytes of memory, and the reuse actually *increases* memory pressure significantly, instead of reducing it.
`zstd: share buffers between compressor instances` aims to improve that by letting a single buffer be shared across all instances on a shard.
Closes#13324
* github.com:scylladb/scylladb:
zstd: share buffers between compressor instances
utils: redesign reusable_buffer
This commit moves the Glossary page to the Reference
section. In addition, it adds the redirection so that
there are no broken links because of this change
and fixes a link to a subsection of Glossary.
Closes#13664
The zstd implementation of `compressor` has a separate decompression and
compression context per instance. This is unreasonably wasteful. One
decompression buffer and one compression buffer *per shard* is enough.
The waste is significant. There might exist thousands of SSTable readers, each
containing its own instance of `compressor` with several hundred KiB worth of
unneeded buffers. This adds up to gigabytes of wasted memory and gigapascals
of allocator pressure.
This patch modifies the implementation of zstd_processor so that all its
instances on the shard share their contexts.
Fixes#11733
Large contiguous buffers put large pressure on the allocator
and are a common source of reactor stalls. Therefore, Scylla avoids
their use, replacing it with fragmented buffers whenever possible.
However, the use of large contiguous buffers is impossible to avoid
when dealing with some external libraries (i.e. some compression
libraries, like LZ4).
Fortunately, calls to external libraries are synchronous, so we can
minimize the allocator impact by reusing a single buffer between calls.
An implementation of such a reusable buffer has two conflicting goals:
to allocate as rarely as possible, and to waste as little memory as
possible. The bigger the buffer, the more likely that it will be able
to handle future requests without reallocation, but also the memory
memory it ties up.
If request sizes are repetitive, the near-optimal solution is to
simply resize the buffer up to match the biggest seen request,
and never resize down.
However, if we anticipate pathologically large requests, which are
caused by an application/configuration bug and are never repeated
again after they are fixed, we might want to resize down after such
pathological requests stop, so that the memory they took isn't tied
up forever.
The current implementation of reusable buffers handles this by
resizing down to 0 every 100'000 requests.
This patch attempts to solve a few shortcomings of the current
implementation.
1. Resizing to 0 is too aggressive. During regular operation, we will
surely need to resize it back to the previous size again. If something
is allocated in the hole left by the old buffer, this might cause
a stall. We prefer to resize down only after pathological requests.
2. When resizing, the current implementation allocates the new buffer
before freeing the old one. This increases allocator pressure for no
reason.
3. When resizing up, the buffer is resized to exactly the requested
size. That is, if the current size is 1MiB, following requests
of 1MiB+1B and 1MiB+2B will both cause a resize.
It's preferable to limit the set of possible sizes so that every
reset doesn't tend to cause multiple resizes of almost the same size.
The natural set of sizes is powers of 2, because that's what the
underlying buddy allocator uses. No waste is caused by rounding up
the allocation to a power of 2.
4. The interval of 100'000 uses is both too low and too arbitrary.
This is up for discussion, but I think that it's preferable to base
the dynamics of the buffer on time, rather than the number of uses.
It's more predictable to humans.
The implementation proposed in this patch addresses these as follows:
1. Instead of resizing down to 0, we resize to the biggest size
seen in the last period.
As long as at least one maximal (up to a power of 2) "normal" request
appears each period, the buffer will never have to be resized.
2. The capacity of the buffer is always rounded up to the nearest
power of 2.
3. The resize down period is no longer measured in number of requests
but in real time.
Additionally, since a shared buffer in asynchronous code is quite a
footgun, some rudimentary refcounting is added to assert that only
one reference to the buffer exists at a time, and that the buffer isn't
downsized while a reference to it exists.
Fixes#13437
Fixes https://github.com/scylladb/scylladb/issues/13578
Now that the documentation is versioned, we can remove
the .. versionadded:: and .. versionchanged:: information
(especially that the latter is hard to maintain and now
outdated), as well as the outdated information about
experimental features in very old releases.
This commit removes that information and nothing else.
Closes#13680
std::rel_ops was deprecated in C++20, as C++20 provides a better solution for defining comparison operators. and all the use cases previously to be addressed by `using namespace std::rel_ops` have been addressed either by `operator<=>` or the default-generated `operator!=`.
so, in this series, to avoid using deprecated facilities, let's drop all these `using namespace std::rel_ops`. there are many more cases where we could either use `operator<=>` or the default-generated `operator!=` to simplify the implementation. but here, we care more about `std::rel_ops`, we will drop the most (if not all of them) of the explicitly defined `operator!=` and other comparison operators later.
Closes#13676
* github.com:scylladb/scylladb:
treewide: do not use std::rel_ops
dht: token: s/tri_compare/operator<=>/
Currently, the `reset_to()` implementation calls `consume(new_amount)` (if
not zero), then calls `signal(old_amount)`. This means that even if
`reset_to()` is a net reduction in the amount of resources, there is a
call to `consume()` which can now potentially throw.
Add a special case for when the new amount of resources is strictly
smaller than the old amount. In this case, just call `signal()` with the
difference. This not just avoids a potential `std::bad_alloc`, but also
helps relieving memory pressure when this is most needed, by not failing
calls to release memory.
Into reset_to() and reset_to_zero(). The latter replaces `reset()` with
the default 0 resources argument, which was often called from noexcept
contexts. Splitting it out from `reset()` allows for a specialized
implementation that is guaranteed to be `noexcept` indeed and thus
peace of mind.
This API is dangerous, all resource consumption should happen via RAII
objects that guarantee that all consumed resources are appropriately
released.
At this poit, said API is just a low-level building block for
higher-level, RAII objects. To ensure nobody thinks of using it for
other purposes, make it private and make external users friends instead.
Fix two issues with the replace operation introduced by recent PRs.
Add a test which performs a sequence of basic topology operations (bootstrap,
decommission, removenode, replace) in a new suite that enables the `raft`
experimental feature (so that the new topology change coordinator code is used).
Fixes: #13651Closes#13655
* github.com:scylladb/scylladb:
test: new suite for testing raft-based topology
test: remove topology_custom/test_custom.py
raft topology: don't require new CDC generation UUID to always be present
raft topology: include shard_count/ignore_msb during replace
std::rel_ops was deprecated in C++20, as C++20 provides a better
solution for defining comparison operators. and all the use cases
previously to be addressed by `using namespace std::rel_ops` have
been addressed either by `operator<=>` or the default-generated
`operator!=`.
so, in this change, to avoid using deprecated facilities, let's
drop all these `using namespace std::rel_ops`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
now that C++20 is able to generate the default-generated comparing
operators for us. there is no need to define them manually. and,
`std::rel_ops::*` are deprecated in C++20.
also, use `foo <=> bar` instead of `tri_compare(foo, bar)` for better
readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `range_tombstone_list` and `range_tombstone_entry`
without the help of `operator<<`.
the corresponding `operator<<()` for `range_tombstone_entry` is moved
into test, where it is used. and the other one is dropped in this change,
as all its callers are now using fmtlib for formatting now.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13627
there are two variants of `query_processor::for_each_cql_result()`,
both of them perform the pagination of results returned by a CQL
statement. the one which accepts a function returning an instant
value is not used now. so let's drop it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13675
Now gossiper doesn't need those two as its dependencies, they can be
removed making code shorter and dependencies graph simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
That's, in fact, an independent change, because feature enabler doesn't
need this method. So this patch is like "while at it" thing, but on the
other hand it ditches one more qctx usage.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All callers now have the system keyspace instance at hand.
Unfortunately, this de-static doesn't allow more qctx drops, because
both methods use set_|get_scylla_local_param helpers that do use qctx
and are still in use by other static methods.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This code belongs to feature service, system keyspace shoulnd't be aware
of any pecularities of startup features enabling, only loading and
saving the feature lists.
For now the move happens only in terms of code declarations, the
implementation is kept in its old place to reduce the patch churn.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method in question is only called by the enabler and is short enough
to be merged into the caller. This kills two birds with one stone --
makes less walks over features list and will make it possible to
de-static system keyspace features load and save methods.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
No functional changes, just move the code. Thie makes gossiper not mess
with enabling/persisting features, but just gossiping them around.
Feature intersection code is still in gossiper, but can be moved in
more suitable location any time later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Nowadays features are persisted in feature_service::enable() and there
are four callers of it
- feature enabler via gossiper notifications
- boot kicks feature enabler too
- schema loader tool
- cql test env
None but the first case need to persist features. The loader tool in
fact doesn't do it even now it by keeping qctx uninitialized. Cql test
env wires up the qctx, but it makes no differences for the test cases
themselves if the features are persisted or not.
Boot-time is a bit trickier -- it loads the feature list from system
keyspace and may filter-out some of them, then enable. In that case
committing the list back into system keyspace makes no sense, as the
persisted list doesn't extend.
The enabler, in turn, can call system keyspace directly via its explicit
dependency reference. This fixes the inverse dependency between system
keyspace and feature service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It now knows that it runs inside async context, but things are changing
and soon it will be moved out of it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's the enabler that's responsible for enabling the features and,
implicitly, persisting them into the system keyspace. This patch moves
this logic from gossiper to feature_enabler, further patching will make
the persisting code be explicit.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's a bit hairy. The maybe_enable_features() is called from two places
-- the feature_enabler upon notifications from gossiper and directory by
gossiper from wait_for_gossip_to_settle().
The _latter_ is called only when the wait_for_gossip_to_settle() is
called for the first time because of the _gossip_settled checks in it.
For the first time this method is called by storage_service when it
tries to join the ring (next it's called from main, but that's not of
interest here).
Next, despite feature_enabler is registered early -- when gossiper
instance is constructed by sharded<gossiper>::start() -- it checks for
the _gossip_settled to be true to take any actions.
Considering both -- calling maybe_enable_features() _and_ registering
enabler after storage_service's call to wait_for_gossip_to_settle()
doesn't break the code logic, but make further patching possible. In
particular, the feature_enabler will move to feature_service not to
pollute gossiper code with anything that's not gossiping.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
And rename the guy. These dependencies will be used further, both are
available and started when the enabler is constructed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fix a few cases where instead of printing column names in error messages, we printed weird stuff like ASCII codes or the address of the name.
Fixes#13657Closes#13658
* github.com:scylladb/scylladb:
cql3: fix printing of column_specification::name in some error messages
cql3: fix printing of column_definition::name in some error messages
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `storage_service::mode` without the help of `operator<<`.
the corresponding `operator<<()` for `storage_service::mode` is removed
in this change, as all its callers are now using fmtlib for formatting
now.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13640
Adds a reproducer for #12462, which doesn't manifest in master any
more after f73e2c992f. It's still useful
to keep the test to avoid regresions.
The bug manifests by reader throwing:
std::logic_error: Stream ends with an active range tombstone: {range_tombstone_change: pos={position: clustered,ckp{},-1}, {tombstone: timestamp=-9223372036854775805, deletion_time=2}}
The reason is that prior to the rework of the cache reader,
range_tombstone_generator::flush() was used with end_of_range=true to
produce the closing range_tombstone_change and it did not handle
correctly the case when there are two adjacent range tombstones and
flush(pos, end_of_range=true) is called such that pos is the boundary
between the two.
Closes#13665
raft_group0 does not really depends on migration_manager, it needs it only
transiently, so pass it to appropriate methods of raft_group0 instead
of during its creation.
raft_group0 does not really depends on query_processor, it needs it only
transiently, so pass it to appropriate methods of raft_group0 instead
of during its creation.
Introduce new test suite for testing the new topology coordinator
(runs under `raft` experimental flag). Add a simple test that performs a
basic sequence of topology operations.
raft_group0 does not really depends on storage_service, it needs it only
transiently, so pass it to appropriate methods of raft_group0 instead
of during its creation.
Consolidate `bytes_view_hasher` and abstract_replication_strategy `factory_key_hasher` which are the same into a reusable utils::basic_xx_hasher.
To be used in a followup series for netw:msg_addr.
Closes#13530
* github.com:scylladb/scylladb:
utils: hashing: use simple_xx_hasher
utils: hashing: add simple_xx_hasher
utils: hashers: add HasherReturning concept
hashing: move static_assert to source file
Add a test for get/enable/disable auto_compaction via to column_family api.
And add log messages for admin operations over that api.
Closes#13566
* github.com:scylladb/scylladb:
api: column_family: add log messages for admin operation
test: rest_api: add test_column_family
After the addition of the rust-std-static-wasm32-wasi target, we're
able to compile the Rust programs to Wasm binaries. However, we're still
only able to handle the Wasm UDFs in the Text format, so we need a tool
to translate the .wasm files to .wat. Additionally, the .wasm files
generated by default are unnecessarily large, which can be helped
using wasm-opt and wasm-strip.
The tool for translating wasm to wat (wasm2wat), and the tool for
stripping the wasm binaries (wasm-strip) are included in the `wabt`
package, and the optimization tool (wasm-opt) is included in the
`binaryen` package. Both packages are added to install-dependencies.sh
Closes#13282
[avi: regenerate frozen toolchain]
Closes#13605
The s3::readable_file::stat() call returns a hand-crafted stat structure
with some fields set to some sane values, most are constants. However,
other fields remain not initialized which leads to troubles sometimes.
Better to fill the stat with zeroes and later revisit it for more sane
values.
fixes: #13645
refs: #13649
Using designated initializers is not an option here, see PR #13499
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13650
the temporary directory holding the log file collecting the scylla
subprocess's output is specified by the test itself, and it is
`test_tempdir`. but unfortunately, cql-pytest/run.py is not aware
of this. so `cleanup_all()` is not able to print out the logging
messages at exit. as, please note, cql-pytest/run.py always
collect "log" file under the directory created using `pid_to_dir()`
where pid is the spawned subprocesses. but `object_store/run` uses
the main process's pid for its reusable tempdir.
so, with this change, we also register a cleanup func to printout
the logging message when the test exits.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
For concurrent schema changes test, log when the different stages of the
test are finished.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13654
this change ensures that `dk._key` is formatted with the "pk" prefix.
as in 3738fcb, the `operator<<` for partition_key was removed. so the
compiler has to find an alternative when trying to fulfill the needs
when this operator<< is called. fortunately, from the compiler's
perspective, `partition_key` has an `operator managed_bytes_view`, and
this operator does not have the explicit specifier, and,
`managed_bytes_view` does support `operator<<`. so this ends up with a
change in the format of `decorated_key` when it is printed using
`operator<<`. the code compiles. but unfortunately, the behavior is
changed, and it breaks scylla-dtest/cdc_tracing_info_test.py where the
partition_key is supposed to be printed like "pk{010203}" instead of
"010203". the latter is how `managed_bytes_view` is formatted.
a test is added accordingly to avoid future changes which break the
dtest.
Fixes scylladb#13628
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13653
column_specification::name is a shared pointer, so it should be
dereferenced before printing - because we want to print the name, not
the pointer.
Fix a few instances of this mistake in prepare_expr.cc. Other instances
were already correct.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Printing a column_definition::name() in an error message is wrong,
because it is "bytes" and printed as hexadecimal ASCII codes :-(
Some error messages in cql3/operation.cc incorrectly used name()
and should be changed to name_as_text(), as was correctly done in
a few other error messages in the same file.
This patch also fixes a few places in the test/cql approval tests which
"enshrined" the wrong behavior - printing things like 666c697374696e74
in error messages - and now needs to be fixed for the right behavior.
Fixes#13657
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
During node replace we don't introduce a new CDC generation, only during
regular bootstrap. Instead of checking that `new_cdc_generation_uuid`
must be present whenever there's a topology transition, only check it
when we're in `commit_cdc_generation` state.
Use simple_xx_hasher for bytes_view and effective_replication_map::factory_key
appending hashers instead of their custom, yet identical implementations.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And a more specific HasherReturningBytes for hashers
that return bytes in finalize().
HasherReturning will be used by the following patch
also for simple hashers that return size_t from
finalize().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
when reading this source code, there are a handful issues reported by my flycheck plugin. none of them is critical, but better off fixing them.
Closes#13612
* github.com:scylladb/scylladb:
test: object_store: specify timeout
test: object_store: s/exit/sys.exit/
test: object_store: do not declare a global variable for read
test: object_store: remove unused imports
When moving to the base directory, the printout currently looks broken:
```
INFO 2023-04-16 09:15:58,631 [shard 0] sstable - Moving sstable .../data/ks/cf-4c1bb670dc3711ed96733daf102e4aab/upload/md-1-big-Data.db to in ".../data/ks/cf-4c1bb670dc3711ed96733daf102e4aab/"
```
Since `path` already contains `to`, the message can be just simplified
and `to` need not be printed explicitly.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13525
this change fixes the regression introduced by
ebf5e138e8, which
* initialized `truncate_timeout_in_ms` with
`counter_write_request_timeout_in_ms`,
* returns `cas_timeout_in_ms` in the place of
`other_timeout_in_ms`.
in this change, these two misconfigurations are fixed.
Fixes#13633
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13639
in 5a9b4c02e3, the iostream based
formatter was dropped, there is no need to include `<iostream>`
or `<iosfwd>` in these source files anymore.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13643
Currently, responsible for injecting mutations of system.tablets to
schema changes.
Note that not all migrations are handled currently. Dependant view or
cdc table drops are not handled.
tablet_aware_replication_strategy is a trait class meant to be
inherited by replication strategy which want to work with tablets. The
trait produces per-table effective_replication_map which looks at
tablet metadata to determine replicas.
No replication startegy is changed to use tablets yet in this patch.
This change puts the reloading into topology_state_load(), which is a
function which reloads token_metadata from system.topology (the new
raft-based topology management). It clears the metadata, so needs to
reload tablet map too. In the future, tablet metadata could change as
part of topology transaction too, so we reload rather than preserve.
It is already set by schema_maker. In tablets_test we will depend on
the id being the same as that set in the schema_builder, so don't
change it to something else.
Currently, scans are splitting partition ranges around tokens. This
will have to change with tablets, where we should split at tablet
boundaries.
This patch introduces token_range_splitter which abstracts this
task. It is provided by effective_replication_map implementation.
This reverts commit 95bf8eebe0.
Later patches will adapt this code to work with token_range_splitter,
and the unit test added by the reverted commit will start to fail.
The unit test asks the query_ranges_to_vnodes_generator to split the range:
[t:end, t+1:start)
around token t, and expects the generator to produce an empty range
[t:end, t:end]
After adapting this code to token_range_splitter, the input range will
not be split because it is recognized as adjacent to t:end, and the
optimization logic will not kick in. Rather than adding more logic to
handle this case, I think it's better to drop the optimization, as it
is not very useful (rarely happens) and not required for correctness.
This allows update_pending_ranges(), invoked on keyspace creation, to
succeed in the presence of keyspaces with per-table replication
strategy. It will update only vnode-based erms, which is intended
behavior, since only those need pending ranges updated.
This change will also make node operations like bootstrap, repair,
etc. to work (not fail) in the presence of keyspaces with per-table
erms, they will just not be replicated using those algorithms.
Before, these would fail inside get_effective_replication_map(), which
is forbidden for keyspaces with per-table replication.
It's meant to be used in places where currently
get_non_local_strategy_keyspaces() is used, but work only with
keyspaces which use vnode-based replication strategy.
Will be used by tablet-based replication strategies, for which
effective replication map is different per table.
Also, this patch adapts existing users of effective replication map to
use the per-table effective replication map.
For simplicity, every table has an effective replication map, even if
the erm is per keyspace. This way the client code can be uniform and
doesn't have to check whether replication strategy is per table.
Not all users of per-keyspace get_effective_replication_map() are
adapted yet to work per-table. Those algorithms will throw an
exception when invoked on a keyspace which uses per-table replication
strategy.
With tablet-based replication strategies it will represent replication
of a single table.
Current vnode_effective_replication_map can be adapted to this interface.
This will allow algorithms like those in storage_proxy to work with
both kinds of replication strategies over a single abstraction.
`os.P_NOWAIT` is supposed to be used in spawn calls, while `os.WNOHANG`
is used as in the options parameter passed to wait calls. fortunately,
`P_NOWAIT` is defined as "1" in CPython, and `os.WNOHANG` is defined
as "1" in linux kernel. that's why the existing implementation works.
but we should not rely on this coincidence. so, in this change,
`os.P_NOWAIT` is replaced with `os.WNOHANG` for correctness and for
better readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13646
before this change we do not check if the clustering_key to be formatted
is UTF-8 encoded before printing it. but we do perform the validation
when printing paritition_keys. since the clustering_key is not different
from partition_key when it comes to encoding, actually they are
different parts of a parimary key. so let's validate the encoding of
clustering_key as well, when formatting it. this change is a follow-up
of 85b21ba049.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13641
This series adds handling of null std::unique_ptr to utils::clear_gently
and handling of std::optional and seastar::optimized_optional (both engaged and disengaged cases).
Also, unit tests were added to tests the above cases.
Fixes#13636Closes#13638
* github.com:scylladb/scylladb:
utils: clear_gently: add variants for optional values
utils: clear_gently: do not clear null unique_ptr
now that C++20 generates operator== for us, these is no need to
handcraft it manually. also, in C++17, the standard library offers
default implementation of operator== for `std::variant<>`, so no need
to implement it by ourselves.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13625
Add a formatter to compaction::table_state that
prints the table ks_name.cf_name and compaction group id.
Fixes#13467
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
All users of global proxy are gone (*), proxy can be made fully main/cql_test_env local.
(*) one test case still needs it, but can get it via cql_test_env
Closes#13616
* github.com:scylladb/scylladb:
code: Remove global proxy
schema_change_test: Use proxy from cql_test_env
test: Carry proxy reference on cql_test_env
this is the first step to the uuid-based generation identifier. the goal is to encapsulate the generation related logic in generator, so its consumers do not have to understand the difference between the int64_t based generation and UUID v1 based generation.
this commit should not change the behavior of existing scylla. it just allows us to derive from `generation_generator` so we can have another generator which generates UUID based generation identifier.
Closes#13073
* github.com:scylladb/scylladb:
replica, test: create generation id using generator
sstables: add generation_generator
test: sstables: use generate_n for generating ids for testing
In many cases we trigger offstrategy compaction opportunistically
also when there's nothing to do. In this case we still print
to the log lots of info-level message and call
`run_offstrategy_compaction` that wastes more cpu cycles
on learning that it has nothing to do.
This change bails out early if the maintenance set is empty
and prints a "Skipping off-strategy compaction" message in debug
level instead.
Fixes#13466
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
`seastar::current_backtrace()` can be quite heavey.
When we pass it to a log message in relatively detailed log_level
(debug/trace), we pay the price of `current_backtrace` every time,
but we rarely print the message.
Closes#13527
* github.com:scylladb/scylladb:
locator/topology: call seastar::current_backtrace only when log_level is enabled
schema_tables: call seastar::current_backtrace only when log_level is enabled
This series cleans up the generation and value types used in gms / gossiper.
Currently we use a blend of int, int32_t, and int64_t around messaging.
This change defines gms::generation_type and gms::version_type as int32_t
and add check in non-release modes that the respective int64 value passed over messaging do not overflow 32 bits.
Closes#12966
* github.com:scylladb/scylladb:
gossiper: version_generator: add {debug_,}validate_gossip_generation
gms: gossip_digest: use generation_type and version_type
gms: heart_beat_state: use generation_type and version_type
gms: versioned_value: use version_type
gms: version_generator: define version_type and generation_type strong types
utils: move generation-number to gms
utils: add tagged_integer
gms: versioned_value: make members private
scylla-gdb: add get_gms_versioned_value
gms: versioned_value: delete unused compare_to function
gms: gossip_digest: delete unused compare_to function
Implement clear_gently for std:;optional<T>
and seastar::optimized_optional<T> and respective
unit tests.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Otherwise the null pointer is dereferenced.
Add a unit test reproducing the issue
and testing this fix.
Fixes#13636
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The only reason why it's there (right next to compaction_fwd.hh) is
because the database::table_truncate_state subclass needs the definition
of compaction_manager::compaction_reenabler subclass.
However, the former sub is not used outside of database.cc and can be
defined in .cc. Keeping it outside of the header allows dropping the
compaction_manager.hh from database.hh thus greatly reducing its fanout
over the code (from ~180 indirect inclusions down to ~20).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13622
Now, with f1bbf705f9
(Cleanup sstables in resharding and other compaction types),
we may filter sstables as part of resharding compaction
and the assertion that all tokens are owned by the current
shard when filtering is no longer true.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When distributing the resharding jobs, prefer one of
the sstable shard owners based on foreign_sstable_open_info.
This is particularly important for uploaded sstables
that are resharded since they require cleanup.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Seen after f1bbf705f9 in debug mode
distributed_loader collect_all_shared_sstables copies
compaction::owned_ranges_ptr (lw_shared_ptr<const
dht::token_range_vector>)
across shards.
Since update_sstable_cleanup_state is synchronous, it can
be passed a const refrence to the token_range_vector instead.
It is ok to access the memory read-only across shards
and since this happens on start-up, there are no special
performance requirements.
Fixes#13631
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Make sure that the int64_t generation we get over rpc
fits in the int32_t generation_type we keep locally.
Restrict this assertion to non-release builds.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Adjust scylla-gdb.get_gms_version_value
to get the versioned_value version as version_type
(utils::tagged_integer).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Derived from utils::tagged_integer, using different tags,
the types are incompatible with each other and require explicit
typecasting to- and from- their value type.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Although get_generation_number implementation is
completely generic, it is used exclusively to seed
the gossip generation number.
Following patches will define a strong gms::generation_id
type and this function should return it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
A generic template for defining strongly typed
integer types.
Use it here to replace raft::internal::tagged_uint64.
Will be used for defining gms generation and version
as strong and distinguishable types in following patches.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
and provide accessor functions to get them.
1. So they can't be modified by mistake, as the versioned value is
immutable. A new value must have a higher version.
2. Before making the version a strong gms::version_type.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prepare for next patch that makes gms::versioned_value
members private, and provides methods by the same name
as the current members.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Introduce a new table `CDC_GENERATIONS_V3` (`system.cdc_generations_v3`).
The table schema is a copy-paste of the `CDC_GENERATIONS_V2` schema. The
difference is that V2 lives in `system_distributed_keyspace` and writes to it
are distributed using regular `storage_proxy` replication mechanisms based on
the token ring. The V3 table lives in `system_keyspace` and any mutations
written to it will go through group 0.
Extend the `TOPOLOGY` schema with new columns:
- `new_cdc_generation_data_uuid` will be stored as part of a bootstrapping
node's `ring_slice`, it stores UUID of a newly introduced CDC
generation which is used as partition key for the `CDC_GENERATIONS_V3`
table to access this new generation's data. It's a regular column,
meaning that every row (corresponding to a node) will have its own.
- `current_cdc_generation_uuid` and `current_cdc_generation_timestamp`
together form the ID of the newest CDC generation in the cluster.
(the uuid is the data key for `CDC_GENERATIONS_V3`, the timestamp is
when the CDC generation starts operating). Those are static columns
since there's a single newest CDC generation.
When topology coordinator handles a request for node to join, calculate a new
CDC generation using the bootstrapping node's tokens, translate it to mutation
format, and insert this mutation to the CDC_GENERATIONS_V3 table through group 0
at the same time we assign tokens to the node in Raft topology. The partition
key for this data is stored in the bootstrapping node's `ring_slice`.
After inserting new CDC generation data , we need to pick a timestamp for this
generation and commit it, telling all nodes in the cluster to start using the
generation for CDC log writes once their clocks cross that timestamp.
We introduce a separate step to the bootstrap saga, before
`write_both_read_old`, called `commit_cdc_generation`. In this step, the
coordinator takes the `new_cdc_generation_data_uuid` stored in a bootstrapping
node's `ring_slice` - which serves as the key to the table where the CDC
generation data is stored - and combines it with a timestamp which it generates
a bit into the future (as in old gossiper-based code, we use 2 * ring_delay, by
default 1 minute). This gives us a CDC generation ID which we commit into the
topology state as the `current_cdc_generation_id` while switching the saga to
the next step, `write_both_read_old`.
Once a new CDC generation is committed to the cluster by the topology
coordinator, we also need to publish it to the user-facing description tables so
CDC applications know which streams to read from.
This uses regular distributed table writes underneath (tables living in the
`system_distributed` keyspace) so it requires `token_metadata` to be nonempty.
We need a hack for the case of bootstrapping the first node in the cluster -
turning the tokens into normal tokens earlier in the procedure in
`token_metadata`, but this is fine for the single-node case since no streaming
is happening.
When a node notices that a new CDC generation was introduced in
`storage_service::topology_state_load`, it updates its internal data structures
that are used when coordinating writes to CDC log tables.
We include the current CDC generation data in topology snapshot transfers.
Some fixes and refactors included.
Closes#13385
* github.com:scylladb/scylladb:
docs: cdc: describe generation changes using group 0 topology coordinator
cdc: generation_service: add a FIXME
cdc: generation_service: add legacy_ prefix for gossiper-based functions
storage_service: include current CDC generation data in topology snapshots
db: system_keyspace: introduce `query_mutations` with range/slice
storage_service: hold group 0 apply mutex when reading topology snapshot
service: raft_group0_client: introduce `hold_read_apply_mutex`
storage_service: use CDC generations introduced by Raft topology
raft topology: publish new CDC generation to the user description tables
raft topology: commit a new CDC generation on node bootstrap
raft topology: create new CDC generation data during node bootstrap
service: topology_state_machine: make topology::find const
db: system_keyspace: small refactor of `load_topology_state`
cdc: generation: extract pure parts of `make_new_generation` outside
db: system_keyspace: add storage for CDC generations managed by group 0
service: topology_state_machine: better error checking for state name (de)serialization
service: raft: plumbing `cdc::generation_service&`
cdc: generation: `get_cdc_generation_mutations`: take timestamp as parameter
cdc: generation: make `topology_description_generator::get_sharding_info` a parameter
sys_dist_ks: make `get_cdc_generation_mutations` public
sys_dist_ks: move find_schema outside `get_cdc_generation_mutations`
sys_dist_ks: move mutation size threshold calculation outside `get_cdc_generation_mutations`
service/raft: group0_state_machine: signal topology state machine in `load_snapshot`
we only need to declare a variable with `global` when we need to
write to it, but if we just want to read it, there is no need to
declare it. because the way how python looks up for a variable
when reading from it enables python to find the global variables
(and apparently the functions!). but when we assign a variable in
python, the interpreter would have to tell in which scope the
variable lives. by default the local scope is used, and a new
variable is added to `locals()`.
but in this case, we just read from it. so no need to add the
`global` statement.
see also https://docs.python.org/3/reference/simple_stmts.html#global
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
In order not to copy the rvalue consumer arg -- instantly convert it
into value. No other tricks.
Indentation is deliberately left broken.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
reuse generation_generator for generating generation identifiers for
less repeatings. also, add allow update generator to update its
lastest known generation id.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
to prepare for the uuid-based generation identifier, where we
will generate uuid-based generation idenfier if corresponding
option is enabled, otherwise an integer based id. to reduce the
repeatings, generation_generator is extracted out so it can be reused.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The tombstone_gc was documented as experimental in version 5.0.
It is no longer experimental in version 5.2.
This commit updates the information about the option.
Closes#13469
this the standard library offers
`std::lexicographical_compare_threeway()`, and we never uses the
last two addition parameters which are not provided by
`std::lexicographical_compare_threeway()`. there is no need to have
the homebrew version of trichotomic compare function.
in this change,
* all occurrences of `lexicographical_tri_compare()` are replaced
with `std::lexicographical_compare_threeway()`.
* ``lexicographical_tri_compare()` is dropped.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13615
C++20 compiler is able to generate defaulted operator== and
operator!=. and the default generated operators behaves exactly
the same as the ones crafted by us. so let's it do its job.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13614
No code needs global proxy anymore. Keep on-stack values in main and
cql_test_env and keep the pointer on debug:: namespace.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's one place where test case calls for storage proxy and currently
does it via global refernece. Time to switch it to cql_test_env's one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All sharded<> services are created by cql_test_env on the stack. The
cql_test_env() is then used to keep references on some of them and to
export them to test cases via its methods. Proxy is missing on that
exportable list, but will be needed, so add one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
in `make_group0_history_state_id_mutation`, when adding a new entry to
the group 0 history table, if the parameter `gc_older_than` is engaged,
we create a range tombstone in the mutation which deletes entries older
than the new one by `gc_older_than`. In particular if
`gc_older_than = 0`, we want to delete all older entries.
There was a subtle bug there: we were using millisecond resolution when
generating the tombstone, while the provided state IDs used microsecond
resolution. On a super fast machine it could happen that we managed to
perform two schema changes in a single millisecond; this happened
sometimes in `group0_test.test_group0_history_clearing_old_entries`
on our new CI/promotion machines, causing the test to fail because the
tombstone didn't clear the entry correspodning to the previous schema
change when performing the next schema change (since they happened in
the same millisecond).
Use microsecond resolution to fix that. The consecutive state IDs used
in group 0 mutations are guaranteed to be strictly monotonic at
microsecond resolution (see `generate_group0_state_id` in
service/raft/raft_group0_client.cc).
Fixes#13594Closes#13604
* github.com:scylladb/scylladb:
db: system_keyspace: use microsecond resolution for group0_history range tombstone
utils: UUID_gen: accept decimicroseconds in min_time_UUID
Move gms::arrival_window to api/failure_detector which is its only user.
and get rid of the rest, which is not used, now that we use direct_failure_detector instead.
TODO: integare direct_failure_detector with failure_detector api.
Closes#13576
* github.com:scylladb/scylladb:
gms: get rid of unused failure_detector
api: failure_detector: remove false dependency on failure_detector::arrival_window
test: rest_api: add test_failure_detector
Fixes a problem when raft-based topology is enabled, which loads
topology from storage. It starts by clearing topology and then adding
nodes one by one. Before this patch, this violates internal invariant
of topology object which puts the local node as the first node. This
would manifest by triggering an assert in topology::pop_node() which
throws if popping the node at index 0 in order to keep the information
about local node around. This is normally prevented by a check in
topology::remove_node() which avoid calling pop_node() if removing the
local node. But since there is no node which is marked as local, this
check allows the first node to be popped.
To fix the problem I lift the invariant that local node is always in
_nodes. We still have information about local node in config. Instead
of keeping it in _nodes, we recognize it as part of indexing. We also
allow removing the local node like a regular node.
The path which reloads topology works correctly after this, the local
node will be recognized when (if) it is added to the topology.
Fixes#13495Closes#13498
* github.com:scylladb/scylladb:
locator: topology: Fix move assignment
locator: topology: Add printer
tests: topology: Test that topology clearing preserves information about local node
locator: topology: Recognize local node as part of indexing it
locator: topology: Fix get_location(ep) for local node
locator: topology: Fix typo
locator: topology: Preserve config when cloning
in `make_group0_history_state_id_mutation`, when adding a new entry to
the group 0 history table, if the parameter `gc_older_than` is engaged,
we create a range tombstone in the mutation which deletes entries older
than the new one by `gc_older_than`. In particular if
`gc_older_than = 0`, we want to delete all older entries.
There was a subtle bug there: we were using millisecond resolution when
generating the tombstone, while the provided state IDs used microsecond
resolution. On a super fast machine it could happen that we managed to
perform two schema changes in a single millisecond; this happened
sometimes in `group0_test.test_group0_history_clearing_old_entries`
on our new CI/promotion machines, causing the test to fail because the
tombstone didn't clear the entry correspodning to the previous schema
change when performing the next schema change (since they happened in
the same millisecond).
Use microsecond resolution to fix that. The consecutive state IDs used
in group 0 mutations are guaranteed to be strictly monotonic at
microsecond resolution (see `generate_group0_state_id` in
service/raft/raft_group0_client.cc).
Fixes#13594
so we don't need to keep a `prev_gen` around, this also prepares for
the coming change to use generation generator.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `function_name` without the help of `operator<<`.
the corresponding `operator<<()` are dropped dropped in this change,
as all its callers are now using fmtlib for formatting now.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13608
this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `dht::token` without the help of `operator<<`.
the corresponding `operator<<()` is preserved in this change, as it has lots of users in this project, we will tackle them case-by-case in follow-up changes.
also, the forward declaration of `operator<<(ostream&, constdht::token&)` in `dht/i_partitioner.hh` is removed. ias it not necessary.
Refs https://github.com/scylladb/scylladb/issues/13245Closes#13610
* github.com:scylladb/scylladb:
dht: remove unnecessarily forward declaration
dht: specialize fmt::formatter<dht::token>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `component_type` without the help of `operator<<`.
the corresponding `operator<<()` are dropped dropped in this change,
as all its callers are now using fmtlib for formatting now.
also, please note, to enable fmtlib to format `std::set<component_type>`
in `test/boost/sstable_3_x_test.cc` , we need to include
`<fmt/ranges.h>` in that source file.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13598
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `stream_reason` without the help of `operator<<`.
please note, because we still cannot use the generic formatter for
std::unordered_map provided by fmtlib, so in order to drop `operator<<`
for `stream_reason`, and to print `unordered_map<stream_reason>`,
`fmt::join()` is used as a temporary solution. we will audit all
`fmt::join()` calls, after removing the homebrew formatter of
`std::unordered_map`.
the corresponding `operator<<()` are dropped dropped in this change,
as all its callers are now using fmtlib for formatting now.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13609
this change replaces all occurrences of `boost::lexical_cast<std::string>`
in the source tree with `fmt::to_string()`. for couple reasons:
* `boost::lexical_cast<std::string>` is longer than `fmt::to_string()`,
so the latter is easier to parse and read.
* `boost::lexical_cast<std::string>` creates a stringstream under the
hood, so it can use the `operator<<` to stringify the given object.
but stringstream is known to be less performant than fmtlib.
* we are migrating to fmtlib based formatting, see #13245. so
using `fmt::to_string()` helps us to remove yet another dependency
on `operator<<`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13611
The legacy failure_detector is now unused and can be removed.
TODO: integare direct_failure_detector with failure_detector api.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Up until 0ef33b71ba
get_endpoint_phi_values retrieved arrival samples
from gms::get_arrival_samples(). That function was
removed since it returned a constant ampty map.
This patch returns empty results without relying
on failure_detector::arrival_window, so the latter can
be retired altogether.
As Tomasz Grabiec <tgrabiec@scylladb.com> said:
> I don't think the logic of arrival_window belongs to api,
> it belongs to the failure detector. If there is no longers
> a failure detector, there should be no arrival_window.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch contains tests reproducing issue #13601 and the corresponding
Cassandra issue CASSANDRA-18470. These issues are about what the AVG
aggregation does for arbitrary-precision "decimal" numbers - the tests
we add here show examples where the current behavior doesn't make sense:
The problem is that "decimal" has arbitrary precision - so, should an
average of 1/3 be returned as 0.3 or 0.33333333333333333? This is not
specified, so Scylla (and Cassandra) decided to pick the result precision
based on the input precision. In particular, the average of 1 and 2
is returned as 2 (zero digits after the decimal point, like in the
inputs) instead of the expected 1.5. Arguably this isn't useful behavior.
The test adds a second test which fails on Cassandra, but does pass
on Scylla: Cassandra returns as the average of 1, 2, 2, 3 the integer 1
whereas the correct average is 2 (and Scylla returns it correctly).
The reason why this bug is even worse on Cassandra is that Scylla's AVG
only loses precision when dividing the sum and count, but Cassandra
tries to maintain only the average, and loses precision at every step.
Refs #13601
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13603
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `apply_resume` without the help of `operator<<`.
the corresponding `operator<<()` are dropped dropped in this change,
as all its callers are now using fmtlib for formatting now.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13584
Currently, the reader might dereference a null pointer
if the input stream reaches eof prematurely,
and read_exactly returns an empty temporary_buffer.
Detect this condition before dereferencing the buffer
and sstables::malformed_sstable_exception.
Fixes#13599
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13600
Database functions currently receive their arguments as an std::vector. This
is inflexible (for example, one cannot use small_vector to reduce allocations).
This series adapts the function signature to accept parameters using std::span.
Some changes in the keys interface are needed to support this. Lastly, one call
site is migrated to small_vector.
This is in support of changing selectors to use expressions.
Closes#13581
* github.com:scylladb/scylladb:
cql3: abstract_function_selector: use small_vector for argument buffer
db, cql3: functions: pass function parameters as a span instead of a vector
keys: change from_optional_exploded to accept a span instead of a vector
it turns out the declaration of `operator<<(ostream&, const
dht::token&)` is unnecessarily. so let's drop it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `dht::token` without the help of `operator<<`.
the corresponding `operator<<()` is preserved in this change, as it
has lots of users in this project, we will tackle them case-by-case in
follow-up changes.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Fixes a problem when raft-based topology is enabled, which loads
topology from storage. It starts by clearing topology and then adding
nodes one by one. Before this patch, this violates internal invariant
of topology object which puts the local node as the first node. This
would manifest by triggering an assert in topology::pop_node() which
throws if popping the node at index 0 in order to keep the information
about local node around. This is normally prevented by a check in
topology::remove_node() which avoid calling pop_node() if removing the
local node. But since there is no node which is marked as local, this
check allows the first node to be popped.
To fix the problem I lift the invariant that local node is always in
_nodes. We still have information about local node in config. Instead
of keeping it in _nodes, we recognize it as part of indexing. We also
allow removing the local node like a regular node.
The path which reloads topology works correctly after this, the local
node will be recognized when (if) it is added to the topology.
Fixes#13495
topology config may designate a different node than
get_broadcast_address() as local node. In particular, some tests don't
designate any node as the local node, which leads to logic errors
where current get_location(ep) for ep which happens to have the
address 127.0.0.1 returns location of the first node in _nodes rather
than ep.
Fix by looking up in _nodes first and fall back to local node if it's
equal to configured local node (if any).
Config is separate from state of the topology (nodes it
contains). Preserving the config will make it easier in later patches
to maintain invariants for cloned instances.
The tests in question are using MINIO_SERVER_ADDRESS environment variable to export minio server address from pylib to test cases. Also they use hard-coded public bucket name. Both plays badly with AWS S3, the former due to MINIO_... in its name and the latter because public bucket name can be any.
So this PR puts address and public bucket name into S3_..._FOR_TEST environment variables and fixes output stream closure on failure while at it.
Detached from #13493Closes#13546
* github.com:scylladb/scylladb:
s3/test: Rename MINIO_SERVER_ADDRESS environment variable
s3/test: Keep public bucket name in environment
s3/test: Fix upload stream closure
test/lib: Add getenv_safe() helper
Update the `Generation switching` section: most of the existing
description landed in `Gossiper-based topology changes` subsection, and
a new subsection was added to describe Raft group 0 based topology
changes. Marked as WIP - we expect further development in this area
soon.
The existing gossiper-based description was also updated a bit.
Note that we don't need to include earlier CDC generations, just the
current (i.e. latest) one.
We might observe a problem when nodes are being bootstrapped in quick
succession - I left a FIXME describing the problem and possible
solutions.
There is a `query_mutations` function which loads the entire contents of
a given table into memory. There was no function for e.g. loading just a
single partition in the form of mutations. Introduce one.
This is a bugfix: we need to hold the mutex when loading topology data
from tables, otherwise they might be concurrently modified by
`group0_state_machine::apply` and the snapshot that we send won't make
any sense.
Also specify in comments that the lock must be held during
`topology_transition`, `topology_state_load`, `merge_topology_snapshot`.
When a node notices that a new CDC generation was introduced in
`storage_service::topology_state_load`, it updates its internal data
structures that are used when coordinating writes to CDC log tables.
Once a new CDC generation is committed to the cluster by the topology
coordinator, we also need to publish it to the user-facing description
tables so CDC applications know which streams to read from.
This uses regular distributed table writes underneath (tables living
in the `system_distributed` keyspace) so it requires `token_metadata`
to be nonempty. We need a hack for the case of bootstrapping the
first node in the cluster - turning the tokens into normal tokens
earlier in the procedure in `token_metadata`, but this is fine for the
single-node case since no streaming is happening.
After inserting new CDC generation data (see previous commit), we need
to pick a timestamp for this generation and commit it, telling all nodes
in the cluster to start using the generation for CDC log writes once
their clocks cross that timestamp.
We introduce a separate step to the bootstrap saga, before
`write_both_read_old`, called `commit_cdc_generation`. In this step, the
coordinator takes the `new_cdc_generation_data_uuid` stored in a
bootstrapping node's `ring_slice` - which serves as the key to the table
where the CDC generation data is stored - and combines it with a
timestamp which it generates a bit into the future (as in old
gossiper-based code, we use 2 * ring_delay, by default 1 minute). This
gives us a CDC generation ID which we commit into the topology state as
the `current_cdc_generation_id` while switching the saga to the next
step, `write_both_read_old`.
`system_keyspace::load_topology_state` is extended to load
`current_cdc_generation_id`.
For now, nodes don't react to `current_cdc_generation_id`. In later
commit we'll extend `storage_service::topology_state_load` to start
using the current CDC generation for CDC log table writes.
The solution with specifying a timestamp into the future is the same as
it is for gossip-based topology changes and it has the same consistency
problem - if some node is temporarily partitioned away from the quorum,
it might not learn about the new CDC generation before its clock crosses
the generation's timestamp, causing it to temporarily send writes to the
wrong CDC streams (until it learns about the new timestamp). I left a
FIXME which describes an alternative solution which wasn't viable for
gossiper-based topology changes, but it is viable when we have a
fault-tolerant topology coordinator.
Calculate a new CDC generation using the bootstrapping node's tokens,
translate it to mutation format, and insert this mutation to the
CDC_GENERATIONS_V3 table through group 0 at the same time we assign
tokens to the node in Raft topology. The partition key for this data is
stored in the bootstrapping node's `ring_slice`.
The data is inserted, but it's not used for anything yet, we'll do it in
later commits.
Two FIXMEs are left for follow-ups:
- in `get_sharding_info` we shouldn't have to use the token owner's IP,
but get the host ID directly from token metadata (#12279),
- splitting the CDC generation data write into multiple commands. The
comment elaborates.
The upload_sink::_upload_id remains empty until upload starts, remains
non-empty while it proceeds, then becomes empty again after it
completes. The upload_started() method cheks that and on .close()
started upload is aborted.
The final switch to empty is done by std::move()ing the upload id into
completion requrest, but it's better to use std::exchange() to emphasize
the fact the the _upload_id becomes empty at that point for a reason.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13570
The variables necessary for constructing a `ring_slice` are now living in
a local block of code. This makes it easier to see which data is part of
the `ring_slice` and will make it easier to add more data to
`ring_slice` in following commits.
Also add some more sanity checking.
The method needs proxy to get data_dictionary::database from to pass down to select_statement::prepare(). And a legacy bit that can come with data_dictionary::database as well. Fortunately, all the call traces that end up at select_statement() start inside table:: methods that have view_update_generator, or at view_builder::consumer that has reference to view_builder. Both services can share the database reference. However, the call traces in question pass through several code layers, so the PR adds data_dictionary::database to those layers one by one.
Closes#13591
* github.com:scylladb/scylladb:
view_info: Drop calls to get_local_storage_proxy()
view_info: Add data_dictionary argument to select_statement()
view_info: Add data_dictionary argument to partition_slice() method
view_filter_checking_visitor: Construct with data_dictionary
view: Carry data_dictionary arg through standalone helpers
view_updates: Carry data_dictionary argument throug methods
view_update_builder: Construct with data dictionary
table: Push view_update_generator arg to affected_views()
view: Add database getters to v._update_generator and v._builder
`cdc::generation_service::make_new_cdc_generation` would create a new
CDC generation and insert it into the `CDC_GENERATIONS_V2` table these
days. For Raft-based topology chnages we'll do the data insertion
somewhere else - in topology coordinator code. So extract the parts for
calculating the CDC generation to free-standing functions (these are
almost pure calculations, modulo accessing RNG).
The `CDC_GENERATIONS_V3` table schema is a copy-paste of the
`CDC_GENERATIONS_V2` schema. The difference is that V2 lives in
`system_distributed_keyspace` and writes to it are distributed using
regular `storage_proxy` replication mechanisms based on the token ring.
The V3 table lives in `system_keyspace` and any mutations written to it
will go through group 0.
Also extend the `TOPOLOGY` schema with new columns:
- `new_cdc_generation_data_uuid` will be stored as part of a bootstrapping
node's `ring_slice`, it stores UUID of a newly introduced CDC
generation which is used as partition key for the `CDC_GENERATIONS_V3`
table to access this new generation's data. It's a regular column,
meaning that every row (corresponding to a node) will have its own.
- `current_cdc_generation_uuid` and `current_cdc_generation_timestamp`
together form the ID of the newest CDC generation in the cluster.
(the uuid is the data key for `CDC_GENERATIONS_V3`, the timestamp is
when the CDC generation starts operating). Those are static columns
since there's a single newest CDC generation.
For example:
```
std::ostream& operator<<(std::ostream& os, ring_slice::replication_state s) {
os << replication_state_to_name_map[s];
return os;
}
```
this would print an empty string if the state was missing from
`replication_state_to_name_map` (because `operator[]` default-construct
a value if it's missing).
Use `find` instead and make it an error if the state is missing.
Also turn `throw std::runtime_error` into `on_internal_error` in
deserialization functions because failure to deserialize a state name is
an internal error, not user error.
The function would generate a mutation timestamp for itself, take it as
parameter instead. We'll use timestamps provided by Group 0 APIs when
creating CDC generations during Group 0- based topology changes.
The function used to obtain the sharding info for a given node (its
number of shards and ignore_msb_bits) was using gossiper application
states.
We want to reuse `topology_description_generator` to build CDC
generations when doing Raft Group 0-based topology changes, so make
`get_sharding_info` a parameter.
It was a `static` function inside system_distributed_keyspace. Later it
will be used for another table living in system_keyspace, so move it
outside, to the CDC generations module, and make it accessible from
other places.
The function turns a `cdc::topology_description` into a vector of
mutations. It decides when to push_back a new mutation (instead of
extending an existing one) based on certain parameters. This calculation
is specific to where we insert the mutation later.
Move the calculation outside, to the function which does the insertion.
`get_cdc_generation_mutations` will be used outside this function later.
There are few places in the API handlers that call global proxy for their needs. Most of those places are easy to patch, because proxy is either at http_ctx thing right inside the handler code. Also there's a handler code in view_builder that needs proxy too, but it really needs topology, not proxy, and can get it elsewhere (the handler is coroutinized while at it)
Closes#13593
* github.com:scylladb/scylladb:
view: Get topology via database tokens
view: Indentation fix after previous patch
view: Coroutinuze view_builder::view_build_statuses()
api: Use ctx.sp in storage service handler
api,main: Unset storage_proxy API on stop
api: Use ctx.sp in set_storage_proxy() routes
This `with` context is supposed to disable, then re-enable
autocompaction for the given keyspaces, but it used the wrong API for
it, it used the column_family/autocompaction API, which operates on
column families, not keyspaces. This oversight led to a silent failure
because the code didn't check the result of the request.
Both are fixed in this patch:
* switch to use `storage_service/auto_compaction/{keyspace}` endpoint
* check the result of the API calls and report errors as exceptions
Fixes: #13553Closes#13568
server to see other servers after start/restart
When starting/restarting a server, provide a way to wait for the server
to see at least n other servers.
Also leave the implementation methods available for manual use and
update previous tests, one to wait for a specific server to be seen, and
one to wait for a specific server to not be seen (down).
Fixes#13147
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13438
The view_builder::view_build_statuses() needs topology to walk its
nodes. Now it gets one from global proxy via its token metadata, but
database also has tokens and view_builder has reference to database.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similarly to previous patch, but from another routes group. The storage
service API calls mainly use storage service, but one place needs proxy
to call recalculate_schema_version() with
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
So that the routes referencing and using ctx.sp don't step on a proxy
that's going to be removed (not now, but some time later) fron under
them on shutdown.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The logger instancewas removed in a previous commit but it is used in
the wrapper helper. Add it back.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
In both cases the proxy is called to get data_dictionary from. Now its
available as the call argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This method needs data_dictionary to work. Fortunately, all callers of
it already have the dictionary at hand and can just pass it as argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The caller is calculate_affected_clustering_ranges() with dictionary
arg, the method needs dictionary to call view_info::select_statement()
later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The visitor is wait-free helper for matches_view_filter() that has
dictionary as its argument. Later the visitor will pass the dictionary
to view_info::select_statement().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a bunch of functions in view.{hh|cc} that don't belong to any
class and perform view-related claculations for view updates. Lots of
them eventually call view_info::select_statement() which will later need
the dictionary.
By now all those methods' callers have data dictionary at hand and can
share it via argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goal is to have the dictionary at places that later wrap calls to
view_info::select_statement(). This graph of calls starts at the only
public view_updates::generate_update() method which, in turn, is called
from view_update_builder that already has data dictionary at hand.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The caller is table with view-update-generator at hand (it calls
mutate_MV on). Builder here is used as a temporary object that destroys
once the caller coroutine co_return-s, so keeping the database obtained
from the view-update-generator is safe.
Later the v.u.b. object will propagate its data dictionary down the
callstacks.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Caller already has it to call mutate_MV() on. The method in question
will need the generator in one of the next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Both services carry database which will be used by auxiliary objects
like view_updates, view_update_builder, consumer, etc in next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
abstract_function_selector uses a preallocated vector to store the
arguments to aggregate functions, to prevent an allocation for every row.
Use small_vector to prevent an allocation per query, if the number of
arguments happens to be small.
This isn't expected to make a significant performance difference.
Spans are more flexible and can be constructed from any contiguous
container (such as small_vector), or a subrange of such a container.
This can save allocations, so change the signature to accept a span.
Spans cannot be constructed from std::initializer_list, so one such
call site is changed to use construct a span directly from the single
argument.
A span is more generic than a vector, and can be constructed
from any contiguous container (like small_vector), or a subset
of a container.
To support this, helpers in compound.hh need to use make_iterator_range,
since a span doesn't fit the container concept (since spans don't own
their contents).
This is needed to make a similar change to function evaluation, as
the token function passes its parameters to from_optional_exploded().
This short PR fixes a bug in SUM() aggregation where if the data contains +Inf and -Inf the returned sum should be NaN but we returned an error instead. This is a recent regression uncovered by a dtest (see issue #13551), but in the first patch we add additional tests in the cql-pytest framework which reproduce this bug and explore various other areas (wrongly) implicated by the failing dtest.
Fixes#13551Closes#13564
* github.com:scylladb/scylladb:
cql3: allow SUM() aggregation to result in a NaN
test/cql-pytest: add tests for data casts and inf in sums
Using it the pylib minio code export minio address for tests. This
creates unneeded WTFs when running the test over AWS S3, so it's better
to rename to variable not to mention MINIO at all.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Local test.py runs minio with the public 'testbucket' bucket and all
test cases know that. This series adds an ability to run tests over real
S3 so the bucket name should be configurable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If multipart upload fails for some reason the output stream remains not
closed and the respective assertion masquerades the original failure.
Fix that by closing the stream in all cases.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The helper is like ::getenv() but checks if the variable exists and
throws descriptive exception. So instead of
fatal error: in "...": std::logic_error: basic_string: construction from null is not valid
one could get something like
fatal error: in "...": std::logic_error: Environment variable ... not set
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Update comments, test names and etc. that are still using the old terminology for
permit state names, bring them up to date with the recent state name changes.
When floating-point data contains +Inf and -Inf, the sum is NaN.
Our SUM() aggregation calculated this sum correctly, but then instead
of returning it, complained that the sum overflowed by narrowing.
This was a false positive: The sum() finalizer wanted to test that no
precision was lost when casting the accumulator to the result type,
so checked that the result before and after the cast are the same.
But specifically for NaN, it is never equal to anything - not even
to itself. This check is wrong for floating point, but moreover -
isn't even necessary when the two types (accumulator type and result
type) are identical so in this patch we skip it in this case.
Note that in the current code, a different accumulator and result type
is only used in the case of integer types; When accumulating floating
point sums, the same type is used, so the broken check will be avoided.
The test for this issue starts to pass with this patch, so the xfail
tag is removed.
Fixes#13551
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Similar to the storage_service api, print a log message
for admin operations like enabling/disabling auto_compaction,
running major compaction, and setting the table compaction
strategy.
Note that there is overlap in functionality
between the storage_service and the column_family api entry points.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The AWS signature-generating code was moved from alternator some time ago as is. Now it's clear that in which places it should be extended to work for S3 client as well. The enhancements are
- Support UNSIGNED-PAYLOAD to omit calculating checksums for request body
- Include full URL path into the signature, not just hard-coded "/" string
- Don't check datastamp expiration if not asked for
This is a part of #13493Closes#13535
* github.com:scylladb/scylladb:
utils/aws: Brush up the aws_sigv4.hh header
utils/aws: Export timepoint formatter
utils/aws: Omit datestamp expiration checks when not needed
utils/aws: Add canonical-uri argument
utils/aws: Support unsigned-payload signatures
By removing unneeded headers inclusions. At the cost of few more forward
declarations and a couple of extra includes in other .cc files.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13552
This patch adds tests to reproduce issue #13551. The issue, discovered
by a dtest (cql_cast_test.py), claimed that either cast() or sum(cast())
from varint type broke. So we add two tests in cql-pytest:
1. A new test file, test_cast_data.py, for testing data casts (a
CAST (...) as ... in a SELECT), starting with testing casts from
varint to other types.
The test uncovers a lot of interesting cases (it is heavily
commented to explain these cases) but nothing there is wrong
and all tests pass on Scylla.
2. An xfailing test for sum() aggregate of +Inf and -Inf. It turns out
that this caused #13551. In Cassandra and older Scylla, the sum
returned a NaN. In Scylla today, it generates a misleading
error message.
As usual, the tests were run on both Cassandra (4.1.1) and Scylla.
Refs #13551.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add new columns to the `system.topology` table: `shard_count` and `ignore_msb`. When a node bootstraps or restarts and observes that the values stored in `topology` are different than the local values, it updates them. This is done in the `update_topology_with_local_metadata` function (the 'metadata' here being the two values).
Additional flag persisted in `system.scylla_local` is used to safely avoid performing read barriers when the values didn't change on node restart. A comment in `update_topology_with_local_metadata` explains why this flag is needed.
An example use case where `shard_count` and `ignore_msb` are needed is creating CDC generations.
Fixes: #13508Closes#13521
* github.com:scylladb/scylladb:
raft topology: update `release_version` in topology on restart
raft topology: store `shard_count` and `ignore_msb` in topology
in this series, we use <=> operator to replace `big_decimal::compare()` for better readability. also, we trade the chained ternary expression with a more verbose if-else statement for better performance and readability.
Closes#13478
* github.com:scylladb/scylladb:
utils: big_decimal: replace compare() with <=> operator
utils: big_decimal: optimize big_decimal::compare()
Such namespace-wide imports can create conflicts between names that
are the same in seastar and std, such as {std,seastar}::future and
{std,seastar}::format, since we also have 'using namespace seastar'.
Replace the namespace imports with explicit qualification, or with
specific name imports.
Closes#13528
This is a part of PR #13493 that contains found fixes for the client code itself. The original PR has some questions to resolve, so it's worth merging the fixes separately.
Closes#13534
* github.com:scylladb/scylladb:
s3/client: Add comments about multipart upload completion message
s3/client: Fix succeeded/failed part upload final checking
s3/client: Fix parts to start from 1
Fixes https://github.com/scylladb/scylla-doc-issues/issues/823.
This PR extends the note on the Tracing page to explain what is meant by setting the RF to ALL and adds a link for reference.
Closes#12418
* github.com:scylladb/scylladb:
docs: add an explanation to recommendation in the Note box
doc: extend the information about the recommended RF on the Tracing page
Check on node start if local value of `release_version` changed. If
it did, update it in `system.topology` like we do with `shard_count` and
`ignore_msb`.
Add new columns to the `system.topology` table: `shard_count` and
`ignore_msb`. When a node bootstraps or restarts and observes that the
values stored in `topology` are different than the local values, it
updates them. This is done in the `update_topology_with_local_metadata`
function (the 'metadata' here being the two values).
Additional flag persisted in `system.scylla_local` is used to safely
avoid performing read barriers when the values didn't change on node
restart. A comment in `update_topology_with_local_metadata` explains why
this flag is needed.
An example use case where `shard_count` and `ignore_msb` are needed is
creating CDC generations.
Fixes: #13508
Add lost pragma-once directive.
Remove the hashers.hh inclusion. It was carried in when the whole code
was detached from alternator (f5de0582c8), but this header is not needed
in the header, only in the .cc file which uses sha256_hasher.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The format of timestamp for AWS requests is defined in documentation,
there's already the code that prepares it in this form. This patch
exports this method so that S3 client could use it in next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The signing code is used in two ways -- by alternator to verify the
arrived signed request and by S3 client to prepare the signed request.
In the former case date expiration check is performed, but for the
latter this is not required, because date stamp is most likely now (or
close to it).
So this patch makes the orig_datestamp argument optional meaning that
expiration checks can be omited.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Current signing code hard-codes the "/" as the URL, likely this just
works for alternator. For S3 client the URL would include bucket and
object name and should thus become the argument, not constant.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
For S3 signing the whole request payload can be too resource consuming.
Fortunately, payload signing is only enforced if used with plain http,
but with real S3 we're going to use signed requests over https only (see
next patch why).
Said that, the patch turns body-content into optional reference (i.e. --
a pointer) so that the signing code could inject the UNSIGNED-PAYLOAD
mark instead of the payload signature and omit heavy payload signing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The message length is pre-calculated in advance to provide correct
content-length request header. This math is not obvious and deserves a
comment.
Also, the final message preparation code is also implicitly checking
if any part failed to upload. There's a comment in the upload_sink's
upload_part() method about it, but the finalization place deserves one
too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When all parts upload complete the final message is prepared and sent
out to the server. The preparation code is also responsible for checking
if all parts uploaded OK by checking the part etag to be non-empty. In
that check a misprint crept in -- the whole list is checked to be empty,
not the individual etag itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently if index_node throws when trying to
add an already indexed node, pop_node might
unindex the existing node instead of the new one.
Instead, with this change, unindex_node looks up
the node by its pointer and removed it from the
index map only if it's found there so to clean up
safely after index_node throws (at any stage).
Add a unit test to verify that.
In addition, added a unit test to reproduce #13502 and test the fix.
Closes#13512
* github.com:scylladb/scylladb:
test: locator_topology: add test_update_node
topology: add_node, unindex_node: make exception safe
Docs say, that part numbers should start from 1, while the code follows
the tradition and starts from 0. Minio is conveniently incompatible in
this sense so test had been passing so far. On real S3 part number 0
ends up with failed request.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print following classes without the help of `operator<<`.
- partition_key_view
- partition_key
- partition_key::with_schema_wrapper
- key_with_schema
- clustering_key_prefix
- clustering_key_prefix::with_schema_wrapper
the corresponding `operator<<()` are dropped dropped in this change,
as all its callers are now using fmtlib for formatting now. the helper
of `print_key()` is removed, as its only caller is
`operator<<(std::ostream&, const
clustering_key_prefix::with_schema_wrapper&)`.
the reason why all these operators are replaced in one go is that
we have a template function of `key_to_str()` in `db/large_data_handler.cc`.
this template function is actually the caller of operator<< of
`partition_key::with_schema_wrapper` and
`clustering_key_prefix::with_schema_wrapper`.
so, in order to drop either of these two operator<<, we need to remove
both of them, so that we can switch over to `fmt::to_string()` in this
template function.
Refs scylladb#13245
Closes#13513
* github.com:scylladb/scylladb:
keys: consolidate the formatter for partition_keys
keys: specialize fmt::formatter<partition_key> and friends
`seastar::current_backtrace()` can be quite heavey.
When we pass it to a log message in relatively detailed log_level
(debug/trace), we pay the price of `current_backtrace` every time,
but we rarely print the message.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
`seastar::current_backtrace()` can be quite heavey.
When we pass it to a log message in relatively detailed log_level
(debug/trace), we pay the price of `current_backtrace` every time,
but we rarely print the message.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
now that we are using C++20, it'd be more convenient if we can use
the <=> operator for comparing. the compiler creates the 6 other
operators for us if the <=> operator is defined. so the code is more
compacted.
in this change, `big_decimal::compare()` is replaced with `operator<=>`,
and its caller is updated accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change in the worst case, the underlying
`number::compare()` gets called twice. as it is used by Boost::multiprecision
to implement the comparing operators of `number`. but since we can
have the result in one go, there is no need to to perform the
comparison multiple times.
so, in this change, we just call `number::compare()` explicitly,
and use it to implement `compare()`. this should save a call of
`number::compare()`. also, the chained ternary expression is
replaced using if-else statement for better readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The names of these states have been the source of confusion ever since they were introduced. Give them names which better reflects their true meaning and gives less room for misinterpretation. The changes are:
* active/unused -> active
* active/used -> active/need_cpu
* active/blocked -> active/await
Hopefully the new names do a better job at conveying what these states really mean:
* active - a regular admitted permit, which is active (as opposed to an inactive permit).
* active/need_cpu - an active permit which was marked as needing CPU for the read to make progress. This permit prevents admission of new permits while it is in this state.
* active/await - a former active/need_cpu permit, which has to wait on I/O or a remote shard. While in this state, it doesn't block the admission of new permits (pending other criteria such as resource availability).
Closes#13482
* github.com:scylladb/scylladb:
docs/dev/reader-concurrency-semaphore.md: expand on how the semaphore works
reader_permit: give better names to active* states
Lately we have observed that some builds are missing the package_url in the build metadata. This is usually caused by changes in how build metadata is stored on the servers and the s3 reloc server failing to dig them out of the metadata files. A user can usually still obtain the package url but currently there is no way to plug in user-obtained scylla package into the script's workflow.
This PR fixes this by allowing the user to provide the package as `$ARTIFACT_DIR/scylla.package` (in unpacked form).
Closes#13519
* github.com:scylladb/scylladb:
scripts/open-coredump.sh: allow bypassing the package downloading
scripts/open-coredump.sh: check presence of mandatory field in build json object
scripts/open-coredump.sh: more consistent error messaging
Current if index_node throws when trying to
add an already indexed node, pop_node might
unindex the existing node instead of the new one.
Instead, with this change, unindex_node looks up
the node by its pointer and removed it from the
index map only if it's found there so to clean up
safely after index_node throws (at any stage).
Add a unit test to verify that.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
scylla-sstable currently has two ways to obtain the schema:
* via a `schema.cql` file.
* load schema definition from memory (only works for system tables).
This meant that for most cases it was necessary to export the schema into a CQL format and write it to a file. This is very flexible. The sstable can be inspected anywhere, it doesn't have to be on the same host where it originates form. Yet in many cases the sstable is inspected on the same host where it originates from. In this cases, the schema is readily available in the schema tables on disk and it is plain annoying to have to export it into a file, just to quickly inspect an sstable file.
This series solves this annoyance by providing a mechanism to load schemas from the on-disk schema tables. Furthermore, an auto-detect mechanism is provided to detect the location of these schema tables based on the path of the sstable, but if that fails, the tool check the usual locations of the scylla data dir, the scylla confguration file and even looks for environment variables that tell the location of these. The old methods are still supported. In fact, if a schema.cql is present in the working directory of the tool, it is preferred over any other method, allowing for an easy force-override.
If the auto-detection magic fails, an error is printed to the console, advising the user to turn on debug level logging to see what went wrong.
A comprehensive test is added which checks all the different schema loading mechanisms. The documentation is also updated to reflect the changes.
This change breaks the backward-compatibility of the command-line API of the tool, as `--system-schema` is now just a flag, the keyspace and table names are supplied separately via the new `--keyspace` and `--table` options. I don't think this will break anybody's workflow as this tools is still lightly used, exactly because of the annoying way the schema has to be provided. Hopefully after this series, this will change.
Example:
```
$ ./build/dev/scylla sstable dump-data /var/lib/scylla/data/ks/tbl2-d55ba230b9a811ed9ae8495671e9e4f8/quarantine/me-1-big-Data.db
{"sstables":{"/var/lib/scylla/data/ks/tbl2-d55ba230b9a811ed9ae8495671e9e4f8/quarantine//me-1-big-Data.db":[{"key":{"token":"-3485513579396041028","raw":"000400000000","value":"0"},"clustering_elements":[{"type":"clustering-row","key":{"raw":"","value":""},"marker":{"timestamp":1677837047297728},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1677837047297728,"value":"0"}}}]}]}}
```
As seen above, subdirectories like qurantine, staging etc are also supported.
Fixes: https://github.com/scylladb/scylladb/issues/10126Closes#13448
* github.com:scylladb/scylladb:
test/cql-pytest: test_tools.py: add tests for schema loading
test/cql-pytest: add no_autocompaction_context
docs: scylla-sstable.rst: remove accidentally added copy-pasta
docs: scylla-sstable.rst: remove paragraph with schema limitations
docs: scylla-sstable.rst: update schema section
test/cql-pytest: nodetool.py: add flush_keyspace()
tools/scylla-sstable: reform schema loading mechanism
tools/schema_loader: add load_schema_from_schema_tables()
db/schema_tables: expose types schema
Greatly expand on the details of how the semaphore works.
Organize the content into thematic chapters to improve navigation.
Improve formatting while at it.
The names of these states have been the source of confusion ever since
they were introduced. Give them names which better reflects their true
meaning and gives less room for misinterpretation. The changes are:
* active/unused -> active
* active/used -> active/need_cpu
* active/blocked -> active/await
Hopefully the new names do a better job at conveying what these states
really mean:
* active - a regular admitted permit, which is active (as opposed to
an inactive permit).
* active/need_cpu - an active permit which was marked as needing CPU for
the read to make progress. This permit prevents admission of new
permits while it is in this state.
* active/await - a former active/need_cpu permit, which has to wait on
I/O or a remote shard. While in this state, it doesn't block the
admission of new permits (pending other criteria such as resource
availability).
By allowing the user to plug a manually downloaded package. Consequently
the "package_url" field of the build metadata is checked only if there
is no user-provided extracted package.
This allows working around builds for which the metadata server returns
no "package_url", by allowing the user to locate and download the
package themselves, providing it to the script by simply extracting it
as $ARTIFACT_DIR/scylla.package.
Reproducers for https://github.com/scylladb/scylladb/issues/10770.
(Already fixed in 15ebd59071)
Includes necessary improvements and fixes to `pylib`.
Closes#12699
* github.com:scylladb/scylladb:
test/pytest: reproducers for store mutation...
test: pylib: Add a way to create cql connections with particular coordinators
test/pylib: get gossiper alive endpoints
test/topology: default replication factor 3
test/pylib: configurable replication factor
Mandatory fields missing in the build json object lead to obscure,
unrelated error messages down the road. Avoid this by checking that all
required fields all present and print an error message if any is
missing.
before this change, we just print out the addresses of the elements
in `column_defs`, if the arguments passed to `token()` function are
not valid. this is not quite helpful from the user's perspective. as
user would be more interested in the values. also, we could print
more accurate error message for different error.
in this change, following Cassandra 4.1's behavior, three cases are
identified, and corresponding errors are returned respectively:
* duplicated partition keys
* wrong order of partition key
* missing keys
where, if the partition key order is wrong, instead of printing the
keys specified by user, the correct order is printed in the error
message for helping user to correct the `token()` function.
for better performance, the checks are performed only if the keys
do not match, based on the assumption that the error handling path
is not likely to be executed.
tests are added accordingly. they tested with Canssandra 4.1.1 also.
Fixes#13468
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13470
Description of storage options is important for S3, as one
needs to know if underlying storage is either local or
remote, and if the latter, details about it.
This relies on server-side desc statement.
$ ./bin/cqlsh.py -e "describe keyspace1;"
CREATE KEYSPACE keyspace1 WITH replication = { ... } AND
storage = {'type': 'S3', 'bucket': 'sstables',
'endpoint': '127.0.0.1:9000'} AND
durable_writes = true;
Fixes#13507.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13510
There's a legacy safety check in view code that needs to find a base table from its schema ID. To do it it calls for global storage proxy instance. The comment says that this code can be removed once computes_column feature is known by everyone. I'm not sure if that's the case, so here's more complicated yet less incompatible way to stop using global proxy instance.
Closes#13504
* github.com:scylladb/scylladb:
view: Remove unused view_ptr reference
view: Carry backing-secondary-index bit via view builder
view: Keep backing-seconday-index bool on value_getter
table: Add const index manager sgetter
in dcce0c96a9, we should have used
f-string for printing the return code of gzip subprocess. but the
"f" prefix was missed. so, in this change, it is added.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13500
enable_optimized_twcs_queries is specific to TWCS, therefore it
belongs to TWCS, not replica::table.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13489
Related: https://github.com/scylladb/scylla-enterprise/issues/2794
This commit adds the information about the metric changes
in version 2023.1 compared to version 5.2.
This commit is part of the 5.2-to-2023.1 upgrade guide and
must be backported to branch-5.2.
Closes#13506
since there are two places formatting `with_schema_wrapper`, it'd
be desirable if we can consolidate them. so, in this change, the
formatting code is extracted into a helper, so we only have a single
place for formatting the `with_schema_wrapper`s.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print following classes without the help of `operator<<`.
- partition_key_view
- partition_key
- partition_key::with_schema_wrapper
- key_with_schema
- clustering_key_prefix
- clustering_key_prefix::with_schema_wrapper
the corresponding `operator<<()` are dropped dropped in this change,
as all its callers are now using fmtlib for formatting now. the helper
of `print_key()` is removed, as its only caller is
`operator<<(std::ostream&, const
clustering_key_prefix::with_schema_wrapper&)`.
the reason why all these operators are replaced in one go is that
we have a template function of `key_to_str()` in `db/large_data_handler.cc`.
this template function is actually the caller of operator<< of
`partition_key::with_schema_wrapper` and
`clustering_key_prefix::with_schema_wrapper`.
so, in order to drop either of these two operator<<, we need to remove
both of them, so that we can switch over to `fmt::to_string()` in this
template function.
Refs scylladb#13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Load-and-stream reads the entire content from SSTables, therefore it can
afford to discard the bloom filter that might otherwise consume a significant
amount of memory. Bloom filters are only needed by compaction and other
replica::table operations that might want to check the presence of keys
in the SSTable files, like single-partition reads.
It's not uncommon to see Data:Filter ratio of less than 100:1, meaning
that for ~300G of data, filters will take ~3G.
In addition to saving memory footprint, it also reduces operation time
as load-and-stream no longer have to read, parse and build the filters
from disk into memory.
Closes#13486
* github.com:scylladb/scylladb:
sstable_loader: Discard SSTable bloom filter on load-and-stream
sstables: Allow SSTable loading to discard bloom filter
sstables: Allow sstable_directory user to feed custom sstable open config
sstables: Move sstable_open_info into open_info.hh
with schema change and host down
Reproducers for a failure during lwt operation due to missing of a
column mapping in schema history table.
Issue #10770
For most tests there will be nodes down, increase replication factor to
3 to avoid having problems for partitions belonging to down nodes.
Use replication factor 1 for raft upgrade tests.
Currently, opt_st overrides the internal `changed` flag
by setting it with the opt_st changed status.
Instead, it should use `|=` to keep it true if it is already so.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13502
Load-and-stream reads the entire content from SSTables, therefore it can
afford to discard the bloom filter that might otherwise consume a significant
amount of memory. Bloom filters are only needed by compaction and other
replica::table operations that might want to check the presence of keys
in the SSTable files, like single-partition reads.
It's not uncommon to see Data:Filter ratio of less than 100:1, meaning
that for ~300G of data, filters will take ~3G.
In addition to saving memory footprint, it also reduces operation time
as load-and-stream no longer have to read, parse and build the filters
from disk into memory.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If bloom filter is not loaded, it means that an always-present filter
is used, which translates into the SSTable being opened on every
single read.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When view builder constructs it populates itself with view updates.
Later the updates may instantiate the value_getter-s which, in turn,
would need to check if the view is backing secondary index.
Good news is that when view builder constructs it has all the
information at hand needed to evaluate this "backing" bit. It's then
propagated down to value_getter via corresponding view_updates.
The getter's _view field becomes unused after this change and is
(void)-ed to make this patch compile.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The getter needs to check if the view is backing a secondary index.
Currentl it's done inside the handle_computed_column() method, but it's
more convenient if this bit is known during construction, so move it
there. There are no places that can change this property between
view_getter is created and the method in question is called.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Inactive readers should only be evicted to free up resources for waiting
readers. Evicting them when waiters are not admitted for any other
reason than resources is wasteful and leads to extra load later on when
these evicted readers have to be recreated end requeued.
This patch changes the logic on both the registering path and the
admission path to not evict inactive readers unless there are readers
actually waiting on resources.
A unit-test is also added, reproducing the overly-agressive eviction and
checking that it doesn't happen anymore.
Fixes: #11803Closes#13286
It makes compiler complan about mis-ordered initialization of st_nlink
vs st_mode on different arches. Current code (st_nlink before st_mode)
compiled fine on x86, but fails on ARM which wants st_mode to come
before st_nlink. Changing the order would, apparently, break x86 build
with similar message.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13499
* seastar 1204efbc...ed7a0f54 (46):
> gate: s/intgernal/internal/
> reactor: set reactor::_stopping to true on all shards
> condition-variable: replace the coroutine wakeup task with a promise
> tutorial: explain the buffer_size_t param of generator coroutine
> log: call log_level_map explicitly in constructor
> future: de-variadicate make_ready_future() and similar helpers
> timer-set,scollectd: remove unnecessary ";"
> util/conversion: remove inclusion guards
> foreign_ptr: destroy: use run_in_background
> abort_source, abortable_fifo: use is_nothrow_invocable_r_v<>
> alien: add type constraint for alien::run_on and alien::submit_to
> alien: add noexcept specifier for lambda passed to run_on()
> test: alien_test: test alien::run_on() also
> test: alien_test: throw if unexpected things happens
> future: make API level 6 mandatory
> api-level: update IDE fallback
> core/on_internal_error: always log error with backtrace
> future: make API level 5 mandatory
> websocket: fix frame parsing.
> websocket: fix frame assembling.
> when_all: drop code for API_LEVEL < 4
> future: drop internal call_then_impl
> future: when_all_succeed(): make API level 4 mandatory
> reactor: trade comment for type constraints
> sstring: s/is_invocable_r/is_invocable_r_v/
> doc: compatibility: document API levels 5 and 6
> demos: file_demo: pass a string_view to_open_file_dma()
> TLS: Add issuer/subject info to verification error message
> test: fstream_test: drop unnecessary API_LEVEL check
> manual_clock: advance: use run_in_background to expire_timers
> reactor: add run_in_backround and close
> websocket: shutdown input first.
> websocket: use gate to guard background tasks.
> websocket: remove trailing spaces.
> websocket_demo: ignore sleep_aborted exception.
> websocket_demo: fix coredump.
> fstream: drop API level 2 (make_file_output_stream() returning non-future)
> core/sstring: do not use ostream_formatter
> metrics: use fmt::to_string() when creating a label
> backtrace: fix size calculation in dl_iterate_phdr
> Downgrade expected stall detector warning to info
> fix: Add missing inline code blocks
> spawn_test: fix /bin/cat stuck in reading input.
> reactor: pass fd opened in blocking mode to spawned process
> reactor: skip sigaction if handler has been registered before.
> reactor: allow registering handler multiple times for a signal.
the substvar of `${shlibs:Depends}` is set by dh_shlibdeps, which
inspects the ELF images being packaged to figure out the shared
library dependencies for packages. but since f3c3b9183c,
we just override the `override_dh_shlibdeps` target in debian/rules
with no-op. as we take care of the shared library dependencies by
vendoring the runtime dependencies by ourselves using the relocatable
package. so this variable is never set. that's why `dpkg-gencontrol`
complains when processing `debian/control` and trying to materialize
the substvars.
in this change, the occurances of `${shlibs:Depends}` are removed
to silence the warnings from `dpkg-gencontrol`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13457
Commit 49892a0, back in 2018, bumps the compaction shares by 200 to
guarantee a minimum base line.
However, after commit e3f561d, major compaction runs in maintenance
group meaning that bumping shares became completely irrelevant and
only causes regular compaction to be unnecessarily more aggressive.
Fixes#13487.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13488
Currently, token_metadata_impl maintains a "shadow" endpoint to host_id map on top of the maps in topology.
This series first reimplements the functions that currently use this map to use topology instead.
Then the important users of `get_endpoint_to_host_id_map_for_reading`: node_ops_ctl and view_builder
and converted to use a new `topology::for_each_node` function to process all nodes in topology directly, without going through `get_endpoint_to_host_id_map_for_reading`.
Closes#13476
* github.com:scylladb/scylladb:
view_builder: view_build_statuses: use topology::for_each_node
storage_service: node_ops_ctl: refresh_sync_nodes: use topology::for_each_node
topology: add for_each_node
token_metadata: get endpoint to node map from topology
A set of comprehensive tests covering all the supported ways of providing
the schema to scylla-sstable, either explicitely or implicitely
(auto-detect).
The above file contained a paragraph explaining the limitations of
`scylla-sstable.rst` w.r.t. automatically finding the schema. This no
longer applies so remove it.
It would have been better if `flush()` could have been called with a
keyspace and optional table param, but changing it now is too much
churn, so we add a dedicated method to flush a keyspace instead.
So far, schema had to be provided via a schema.cql file, a file which
contains the CQL definition of the table. This is flexible but annoying
at the same time. Many times sstables the tool operates on are located
in their table directory in a scylla data directory, where the schema
tables are also available. To mitigate this, an alternative method to
load the schema from memory was added which works for system tables.
In this commit we extend this to work for all kind of tables: by
auto-detecting where the scylla data directory is, and loading the
schema tables from disk.
Allows loading the schema for the designated keyspace and table, from
the system table sstables located on disk. The sstable files opened for
read only.
this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `tombstone` and `shadowable_tombstone` without the help of fmt::ostream. and their `operator<<(ostream,..)` are dropped, as there are no users of them anymore.
Refs #13245Closes#13474
* github.com:scylladb/scylladb:
mutation: specialize fmt::formatter<tombstone> and fmt::formatter<shadowable_tombstone>
utils: specialize fmt::formatter<optional<>>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `big_decimal` without the help of `operator<<`. this operator
is droppe in this change, as all its callers are now using fmtlib
for formatting now. we might need to use fmtlib to implement `big_decimal::to_string()`,
and use `fmt::to_string()` instead, but let's leave it for a follow-up
change.
Refs scylladb#13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13479
Task manager task implementations of classes that cover
rewrite sstables keyspace compaction which can be start
through /storage_service/keyspace_compaction/ api.
Top level task covers the whole compaction and creates child
tasks on each shard.
Closes#12714
* github.com:scylladb/scylladb:
test: extend test_compaction_task.py to test rewrite sstables compaction
compaction: create task manager's task for rewrite sstables keyspace compaction on one shard
compaction: create task manager's task for rewrite sstables keyspace compaction
compaction: create rewrite_sstables_compaction_task_impl
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `tombstone` and `shadowable_tombstone` without the
help of `operator<<`.
in this change, only `operator<<(ostream&, const shadowable_tombstone&)`
is dropped, and all its callers are now using fmtlib for formatting the
instances of `shadowable_tombstone` now.
`operator<<(ostream&, const tombstone&)` is preserved. as it is still
used by Boost::test for printing the operands in case the comparing tests
fail.
please note, before this change we were using a concrete string
for indent. after this change, some of the places are changed to
using fmtlib for indent.
Refs scylladb#13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `optional<T>` without the help of `operator<<()`.
this change also enables us to ditch more `operator<<()`s in future.
as we are relying on `operator<<(ostream&, const optional<T>&)` for
printing instances of `optional<T>`, and `operator<<(ostream&, const optional<T>&)`
in turn uses the `operator<<(ostream&, const T&)`. so, the new
specialization of `fmt::formatter<optional<>>` will remove yet
another caller of these operators.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print classes fulfill the requirement of `FragmentedView` concept
without the help of template function of `to_hex()`, this function is
dropped in this change, as all its callers are now using fmtlib
for formatting now. the helper of `fragment_to_hex()` is dropped
as well, its only caller is `to_hex()`.
Refs scylladb#13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13471
Don't maintain a "shadow" endpoint_to_host_id_map in token_metadata_impl.
Instead, get the nodes_by_endpoint map from topology
and use it to build the endpoint_to_host_id_map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This series extends sstable cleanup to resharding and other (offstrategy, major, and regular) compaction types so to:
* cleanup uploaded sstables (#11933)
* cleanup staging sstables after they are moved back to the main directory and become eligible for compaction (#9559)
When perform_cleanup is called, all sstables are scanned, and those that require cleanup are marked as such, and are added for tracking to table_state::cleanup_sstable_set. They are removed from that set once released by compaction.
Along with that sstables set, we keep the owned_ranges_ptr used by cleanup in the table_state to allow other compaction types (offstrategy, major, or regular) to cleanup those sstables that are marked as require_cleanup and that were skipped by cleanup compaction for either being in the maintenance set (requiring offstrategy compaction) or in staging.
Resharding is using a more straightforward mechanism of passing the owned token ranges when resharding uploaded sstables and using it to detect sstable that require cleanup, now done as piggybacked on resharding compaction.
Closes#12422
* github.com:scylladb/scylladb:
table: discard_sstables: update_sstable_cleanup_state when deleting sstables
compaction_manager: compact_sstables: retrieve owned ranges if required
sstables: add a printer for shared_sstable
compaction_manager: keep owned_ranges_ptr in compaction_state
compaction_manager: perform_cleanup: keep sstables in compaction_state::sstables_requiring_cleanup
compaction: refactor compaction_state out of compaction_manager
compaction: refactor compaction_fwd.hh out of compaction_descriptor.hh
compaction_manager: compacting_sstable_registration: keep a ref to the compaction_state
compaction_manager: refactor get_candidates
compaction_manager: get_candidates: mark as const
table, compaction_manager: add requires_cleanup
sstable_set: add for_each_sstable_until
distributed_loader: reshard: update sstable cleanup state
table, compaction_manager: add update_sstable_cleanup_state
compaction_manager: needs_cleanup: delete unused schema param
compaction_manager: perform_cleanup: disallow empty sorted_owened_ranges
distributed_loader: reshard: consider sstables for cleanup
distributed_loader: process_upload_dir: pass owned_ranges_ptr to reshard
distributed_loader: reshard: add optional owned_ranges_ptr param
distributed_loader: reshard: get a ref to table_state
distributed_loader: reshard: capture creator by ref
distributed_loader: reshard: reserve num_jobs buckets
compaction: move owned ranges filtering to base class
compaction: move owned_ranges into descriptor
before this change, we don't error out even if pigz fails. but
there is chance that pigz fails to create the gzip'ed relocatable
tarball either due to environmental issues or some other problems,
and we are not aware of this until packaging scripts like
`reloc/build_rpm.sh` tries to ungzip this corrupted gzip file.
in this change, if pigz's status code is not 0, the status code
is printed, and create-relocatable-package.py will return 1.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13459
This series adds an option to read the relabel config from file.
Most of Scylla's metrics are reported per-shard, some times they are also reported per scheduling
groups or per tables. With modern hardware, this can quickly grow to a large number of metrics that
overload Scylla and the collecting server.
One of the main issues around metrics reduction is that many of the metrics are only
helpful in certain situations.
For example, Scylla monitoring only looks at a subset of the metrics. So in large deployments
it would be helpful to scrap only those.
An option to do that, would be to mark all dashboards related metrics with a label value, and then Prometheus
will request only metrics with that label value.
There are two main limitations to scrap by label values:
1. some of the metrics we want to report are in seastar, so we'll need to label them somehow (we cannot just add random labels to seastar metrics)
2. things change, new metrics are introduce and we may want them, it's not practicall to re-compile and wait
for a new release whenever we want to change a label just for monitoring.
It will be best to have the option to add metrics freely and choose at runtime what to report.
This series make use of Seastar API to perform metrics manipulation dynamically. It includes adding, removing, and changing labels and also enable and disable metrics, and enable and disable the skip_when_empty option.
After this series the configuration could be used with:
```--relabel-config-file conf.yaml```
The general logic and format follows Prometheus metrics_relabel_config configuration.
Where the configuration file looks like:
```
$ cat conf.yaml
relabel_configs:
- source_labels: [shard]
action: drop
target_label: shard
regex: (2)
- source_labels: [shard]
action: replace
target_label: level
replacement: $1
regex: (.*3)
```
Closes#12687
* github.com:scylladb/scylladb:
main: Load metrics relabel config from a file if it exists
Add relabel from file support.
Alternator's implementation of TagResource, UntagResource and UpdateTimeToLive (the latter uses tags to store the TTL configuration) was unsafe for concurrent modifications - some of these modifications may be lost. This short series fixes the bug, and also adds (in the last patch) a test that reproduces the bug and verifies that it's fixed.
The cause of the incorrect isolation was that we separately read the old tags and wrote the modified tags. In this series we introduce a new function, `modify_tags()` which can do both under one lock, so concurrent tag operations are serialized and therefore isolated as expected.
Fixes#6389.
Closes#13150
* github.com:scylladb/scylladb:
test/alternator: test concurrent TagResource / UntagResource
db/tags: drop unsafe update_tags() utility function
alternator: isolate concurrent modification to tags
db/tags: add safe modify_tags() utility functions
migration_manager: expose access to storage_proxy
This is a translation of Cassandra's CQL unit test source file
validation/operations/DeleteTest.java into our cql-pytest framework.
There are 51 tests, and they did not reproduce any previously-unknown
bug, but did provide additional reproducers for three known issues:
Refs #4244 Add support for mixing token, multi- and single-column
restrictions
Refs #12474 DELETE prints misleading error message suggesting ALLOW
FILTERING would work
Refs #13250 one-element multi-column restriction should be handled like
a single-column restriction
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13436
SSTable summary is one of the components fully loaded into memory that may have a significant footprint.
This series reduces the summary footprint by reducing the amount of token information that we need to keep
in memory for each summary entry.
Of course, the benefit of this size optimization is proportional to the amount of summary entries, which
in turn is proportional to the number of partitions in a SSTable.
Therefore we can say that this optimization will benefit the most tables which have tons of small-sized
partitions, which will result in big summaries.
Results:
```
BEFORE
[1000000 pkeys] data size: 4035888890, summary -> memory footprint: 5843232, entries: 88158
[10000000 pkeys] data size: 40368888890, summary -> memory footprint: 55787128, entries: 844925
AFTER
[1000000 pkeys] data size: 4035888890, summary -> memory footprint: 4351536, entries: 88158
[10000000 pkeys] data size: 40368888890, summary -> memory footprint: 42211984, entries: 844925
```
That shows a 25% reduction in footprint, for both 1 and 10 million pkeys.
Closes#13447
* github.com:scylladb/scylladb:
sstables: Store raw token into summary entries
sstables: Don't store token data into summary's memory pool
The PR adds sstables storage backend that keeps all component files as S3 objects and system.sstables_registry ownership table that keeps track of what sstables objects belong to local node and their names.
When a keyspace is configured with 'STORAGE = { 'type': 'S3' }' the respective class table object eventually gets the storage_options instance pointing to the target S3 endpoint and bucket. All the sstables created for that table attach the S3 storage implementation that maintains components' files as S3 objects. Writing to and reading from components is handled by the S3 client facilities from utils/. Changing the sstable state, which is -- moving between normal, staging and quarantine states -- is not yet implemented, but would eventually happen by updating entries in the sstables registry.
To keep track of which node owns which objects, to provide bucket-wide uniqueness of object names and to maintain sstable state the storage driver keeps records in the system.sstables_registry ownership table. The table maps sstable location and generation to the object format, version, status-state (*) and (!) unique identifier (some time soon this identifier is supposed to be replaced with UUID sstables generations). The component object name is thus s3://bucket/uuid/component_basename. The registry is also used on boot. The distributed loader picks up sstables from all the tables found in schema and for S3-backed keyspaces it lists entries in the registry to a) identify those and b) get their unique S3-side identifiers to open by name.
(*) About sstable's status and state.
The state field is the part of today's sstable path on disk -- staging, quarantine, normal (root table data dir), etc. Since S3 doesn't have the renaming facility, moving sstable between those states is only possible by updating the entry in the registry. This is not yet implemented in this set (#13017)
The status field tracks sstable' transition through its creation-deletion. It first starts with 'creating' status which corresponds to the today's TemporaryTOC file. After being created and written to the sstable moves into 'sealed' state which corresponds to the today's normal sstable being with the TOC file. To delete sstable atomically it first moves into 'removing' state which is equivalent to being in the deletion-log for the on-disk sstable. Once removed from the bucket, the entry is removed from the registry.
To play with:
1. Start minio (installed by install-dependencies.sh)
```
export MINIO_ROOT_USER=${root_user}
export MINIO_ROOT_PASSWORD=${root_pass}
mkdir -p ${root_directory}
minio server ${root_directory}
```
2. Configure minio CLI, create anonymous bucket
```
mc config host rm local
mc config host add local http://127.0.0.1:9000 ${root_user} ${root_pass}
mc mb local/sstables
mc anonymous set public local/sstables
```
3. Start Scylla with object-storage feature enabled
``` scylla ... --experimental-features=keyspace-storage-options --workdir ${as_usual}```
4. Create KS with S3 storage
``` create keyspace ... storage = { 'type': 'S3', 'endpoint': '127.0.0.1:9000', 'bucket': 'sstables' };```
The S3 client has a logger named "s3", it's useful to use on with `trace` verbosity.
Closes#12523
* github.com:scylladb/scylladb:
test: Add object-storage test
distributed_loader: Print storage type when populating
sstable_directory: Add ownership table components lister
sstable_directory: Make components_lister and API
sstable_directory: Create components lister based on storage options
sstables: Add S3 storage implementation
system_keyspace: Add ownership table
system_keyspace: Plug to user sstables manager too
sstable: Make storage instance based on storage options
sstable_directory: Keep storage_options aboard
sstable: Virtualize the helper that gets on-disk stats for sstable
sstable, storage: Virtualize data sink making for small components
sstable, storage: Virtualize data sink making for Data and Index
sstable/writer: Shuffle writer::init_file_writers()
sstable: Make storage an API
utils: Add S3 readable file impl for random reads
utils: Add S3 data sink for multipart upload
utils: Add S3 client with basic ops
cql-pytest: Add option to run scylla over stable directory
test.py: Equip it with minio server
sstables: Detach write_toc() helper
We need to remove the deleted sstables from
update_sstable_cleanup_state otherwise their data and index
files will remain opened and their storage space won't be reclaimed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If any of the sstables to-be-compacted requires cleanup,
retrive the owned_ranges_ptr from the table_state.
With that, staging sstables will eventually be cleaned up
via regular compaction.
Refs #9559
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Refactor the printing logic in compaction::formatted_sstables_list
out to sstables::to_string(const shared_sstable&, bool include_origin)
and operator<<(const shared_sstable) on top of it.
So that we can easily print std::vector<shared_sstable>
from compaction_manager in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When perform_cleanup adds sstables to sstables_requiring_cleanup,
also save the owned_ranges_ptr in the compaction_state so
it could be used by other compaction types like
regular, reshape, or major compaction.
When the exhausted sstables are released, check
if sstables_requiring_cleanup is empty, and if it is,
clear also the owned_ranges_ptr.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
As a first step towards parallel cleanup by
(regular) compaction and cleanup compaction,
filter all sstables in perform_cleanup
and keep the set of sstables in the compaction_state.
Erase from that set when the sstables are unregistered
from compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
So it can be used in the next patch that will refactor
compaction_state out of class compaction_manager.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Allow getting candidates for compaction
from an arbitrary range of sstable, not only
the in_strategy_sstables.
To be used by perform_cleanup to mark all sstables
that require cleanup, even if they can't be
compacted at this time.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Calls a function on all sstables or until the
function returns stop_iteration::yes.
Change the sstable_set_impl interface to expose
only for_each_sstable_until and let
sstable_set::for_each_sstable use that, wrapping
the void-returning function passed to it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since the sstables are loaded from foreign open info
we should mark them for cleanup if needed (and owned_ranges_ptr is provided).
This will allow a later patch to enable filtering
for cleanup only for sstable sets containing
sstables that require cleanup.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
update_sstable_cleanup_state calls needs_cleanup and
inserts (or erases) the sstable into the respective
compaction_state.sstables_requiring_cleanup set.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
I'm not sure why this was originally supported,
maybe for upgrade sstables where we may want to
rewrite the sstables without filtering any tokens,
but perform_sstable_upgrade is now following a
different code path and uses `rewrite_sstables`
directly, without pigybacking on cleanup.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When called from `process_upload_dir` we pass a list
of owned tokens to `reshard`. When they are available,
run resharding, with implicit cleanup, also on unshared
sstables that need cleanup.
Fixes#11933
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that reshard is a coroutine, creator is preserved
in the coroutine frame until completion so we can
simply capture it by reference now.
Note that previously it was moved into the compaction
descriptor, but the capture wasn't mutable so it was
copied anyhow and this change doesn't introduced a
regression.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move the token filtering logic down from cleanup_compaction
to regular_compaction and class compaction so it can be
reused by other compaction types.
Create a _owned_ranges_checker in class compaction
when _owned_ranges is engaged, and use it in
compaction::setup to filter partitions based on the owned ranges.
Ref scylladb/scylladb#12998
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move the owned_ranges_ptr, currently used only by
cleanup and upgrade compactions, to the generic
compaction descriptor so we apply cleanup in other
compaction types.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `auth::auth_authentication_options` and `auth::resource_kind`
without the help of fmt::ostream. and their `operator<<(ostream,..)` are
dropped, as there are no users of them anymore.
Refs #13245Closes#13460
* github.com:scylladb/scylladb:
auth: remove unused operator<<(.., resource_kind)
auth: specialize fmt::formatter<resource_kind>
auth: remove unused operator<<(.., authentication_option)
auth: specialize fmt::formatter<authentication_option>
The test does
- starts scylla (over stable directory
- creates S3-backed keyspace (minio is up and running by test.py
already)
- creates table in that keyspace and populates it with several rows
- flushes the keyspace to make sstables hit the storage
- checks that the ownership table is populated properly
- restarts scylla
- makes sure old entries exist
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
On boot it's very useful to know which storage a table comes from, so
add the respective info to existing log messages.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When sstables are stored on object storage, they are "registered" in the
system.sstables_registry ownership table. The sstable_directory is
supposed to list sstables from this table, so here's the respective
components lister.
The lister is created by sstables_manager, by the time it's requested
from the the system keyspace is already plugged. The lister only handles
"sealed" sstables. Dangling ones are still ignored, this is to be fixed
later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now the lister is filesystem-specific. There will soon come another one
for S3, so the sstable_directory should be prepared for that by making
the lister an abstract class.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The directory's lister is storage-specific and should be created
differently for different storage options.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The driver puts all componenets into
s3://bucket/uuid/component_name
objects where 'bucket' is the keyspace options configuration parameter,
and the 'uuid' is the value obtained from the ownership table.
E.g.
s3://test_bucket/d0a743b0-ad38-11ed-85b5-39b6b0998182/Data.db
The life-time is straightforward. Until sealed, the sstable has
'creating' status in the table, then it's updated to be 'sealed'. Prior
to removing the objects the status is set to 'deleting' thus allowing
the distributed loader to pick up the dangling objects un re-load (not
yet implemented). Finally, the entry is deleted from the table.
It needs the PR #12648 not to generate empty ks/cf directories on the
local filesystem.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The schema is
CREATE TABLE system.sstables (
location text,
generation bigint,
format text,
status text,
uuid uuid,
version text,
PRIMARY KEY (location, generation)
)
A sample entry looks like:
location | generation | format | status | uuid | version
---------------------------------------------------------------------+------------+--------+--------+--------------------------------------+---------
/data/object_storage_ks/test_table-d096a1e0ad3811ed85b539b6b0998182 | 2 | big | sealed | d0a743b0-ad38-11ed-85b5-39b6b0998182 | me
The uuid field points to the "folder" on the storage where the sstable
components are. Like this:
s3
`- test_bucket
`- f7548f00-a64d-11ed-865a-0c1fbc116bb3
`- Data.db
- Index.db
- Filter.db
- ...
It's not very nice that the whole /var/lib/... path is in fact used as
location, it needs the PR #12707 to fix this place.
Also, the "status" part is not yet fully functional, it only supports
three options:
- creating -- the same as TemporaryTOC file exists on disk
- sealed -- default state
- deleting -- the analogy for the deletion log on disk
The latter needs support from the distributed_loader, which's not yet
there. In fact, distributes_loader also needs to be patched to actualy
select entries from this table on load. Also it needs the mentioned
PR #12707 to support staging and quarantine sstables.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sharded<sys_ks> instances are plugged to large data handler and
compaction manager to maintain the circular dependency between these
components via the interposing database instance. Do the same for user
sstables manager, because S3 driver will need to update the local
ownership table.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch adds storage options lw-ptr to sstables_manager::make_sstable
and makes the storage instance creation depend on the options. For local
it just creates the filesystem storage instance, for S3 -- throws, but
next patch will fix that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The class in question will need to know the table's storage it will need
to list sstables from. For that -- construct it with the storage options
taken from table.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When opening an existing (or just sealed) sstable its components are
stat()-ed to get the on-disk sizes and a bit more. Stat-ing a file by
name on S3 is not (yet) implemented and doing it file-by-file can be
quite terrible. So add a method to return sstable stats in a
storage-specific manner. For S3 this can be implemented by getting the
info from the ownership table (in the future).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This time sstable needs to create a data sink for a component without
having the file at hand. That's pretty much the same as in previous
patch, but the mathod declaration differs slightly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method needs to create two data sinks -- for Data and for Index
files -- and then wrap it with more stuff (compression, checksums,
streams, etc.). With S3 backend using file-output-stream won't work,
becase S3 storage cannot provide writable file API (it has data_sink
instead).
This patch extracts file_data_sink creation so that it could be
virtualized with storage API later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently sstable carries a filesystem_storage instance on board. Next
patches will make it possible to use some other storage with different
data accessing methods. This patch makes sstable carry abstract storage
interface and make the existing filesystem_storage implement it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Sometimes an sstable is used for random read, sometimes -- for streamed
read using the input stream. For both cases the file API can be
provided, because S3 API allows random reads of arbitrary lengths.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Putting a large object into S3 using plain PUT is bad choice -- one need
to collect the whole object in memory, then send it as a content-length
request with plain body. Less memory stress is by using multipart
upload, but multipart upload has its limitation -- each part should be
at least 5Mb in size. For that reason using file API doesn't work --
file IO API operates with external memory buffers and the file impl
would only have raw pointers to it. In order to collect 5Mb of chunk in
RAM the impl would have to copy the memory which is not good. Unlike the
file API data_sink API is more flexible, as it has temporary buffers at
hand and can cache them in zero-copy manner.
Having sad that, the S3 data_sink implementation is like this:
* put(buffer):
move the buffer into local cache, once the local cache grows above 5Mb
send out the part
* flush:
send out whatever is in cache, then send upload completion request
* close:
check that the upload finihsed (in flush), abort the upload otherwise
User of the API may (actually should) wrap the sink with output_stream
and use it as any other output_stream.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Those include -- HEAD to get size, PUT to upload object in one go, GET
to read the object as contigious buffer and DELETE to drop one.
The client uses http client from seastar and just implements the S3
protocol using it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The facilities in run.py script allow launching scylla over temporary
directory, waiting for it to come alive, killing, etc. The limitation of
those is that the work-dir create for scylla is tighly coupled with its
pid. The object-storage test in next patches will need to check that the
sstables are preserved on scylla restart and this hard binding of
workdir to pid won't work.
This patch generalizes the scylla run/abort helpers to accept an
external directory to work on and adds a call to restart scylla process
over existing directory.
And one small related change here -- log file is opened in O_APPEND mode
so that restarted scylla process continues writing into the old file.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When test.py starts it activates a minio server inside test-dir and
configures an anonymous bucket for test cases to run on
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When sstable is opened it generates a certain content into TOC file. In
filesystem storage this first gets into TemporaryTOC one. Future S3
driver will need the same to put into TOC object. Not to produce
duplicate code detach the content generation into a helper. Next patches
will make use of it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Scylla stores a dht::token into each summary entry, for convenience.
But that costs us 16 bytes for each summary entry. That's because
dht::token has a kind field in addition to data, both 64 bits.
With 1kk partitions, each averaging 4k bytes, summary may end up
with ~90k summary entries. So dht::token only will add ~1.5M to the
memory footprint of summary.
We know summary samples index keys, therefore all tokens in all
summary entries cannot have any token kind other than 'key'.
Therefore, we can save 8 bytes for each summary entry by storing
a 64-bit raw token and converting it back into token whenever
needed.
Memory footprint of summary entries in a summary goes from
sizeof(summary_entry) * entries.size(): 1771520
to
sizeof(summary_entry) * entries.size(): 1417216
which is explained by the 8 bytes reduction per summary entry.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
summary has a memory pool, which is implemented as a set of contiguous
buffer of exponentially increasing size, with the max size of 128k.
This pool served for both storing keys of summary entries and their
respective tokens. The summary entry itself just stores a string_view,
which points to the actual data in the memory pool.
Since this series 31593e1451, which removed token_view, summary_entry
stores the actual token, not just the view.
Therefore, memory is being wasted, as SSTable loader / writer is
unnecessarily storing the token data into the pool.
With 11k summary entries, the footprint drops from 756004 to 624932.
A 18% reduction. Of course, the reduction depends on factors like key
size, where the key size can outweigh significantly this waste.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Use token_metadata get_endpoint_to_host_id_map_for_reading
to get all normal token owners for all node operations,
rather than using gossip for some operation and
token_metadata for others.
Fixes#12862Closes#13256
* github.com:scylladb/scylladb:
storage_service: node ops: standardize sync_nodes selection
storage_service: get_ignore_dead_nodes_for_replace: make static and rename to parse_node_list
This is not really an error, so print it in debug log_level
rather than error log_level.
Fixes#13374
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13462
Except for where usage of `std::regex` is required by 3rd party library interfaces.
As demonstrated countless times, std::regex's practice of using recursion for pattern matching can result in stack overflow, especially on AARCH64. The most recent incident happened after merging https://github.com/scylladb/scylladb/pull/13075, which (indirectly) uses `sstables::make_entry_descriptor()` to test whether a certain path is a valid scylla table path in a trial-and-error manner. This resulted in stacks blowing up in AARCH64.
To prevent this, use the already tried and tested method of switching from `std::regex` to `boost::regex`. Don't wait until each of the `std::regex` sites explode, replace them all preemptively.
Refs: https://github.com/scylladb/scylladb/issues/13404Closes#13452
* github.com:scylladb/scylladb:
test: s/std::regex/boost::regex/
utils: s/std::regex/boost::regex/
db/commitlog: s/std::regex/boost::regex/
types: s/std::regex/boost::regex/
index: s/std::regex/boost::regex/
duration.cc: s/std::regex/boost::regex/
cql3: s/std::regex/boost::regex/
thrift: s/std::regex/boost::regex/
sstables: use s/std::regex/boost::regex/
in c642ca9e73, a reference to the
a parameter `config` passed to the `thrift_server` 's constructor is
passed down to `create_handler_factory()`, which keeps it so it can
create connection handler on demand. but unfortunately,
- the `config` parameter is a temporary variable
- the `config` parameter is moved away in the constructor after
`create_handler_factory()` is called
hence we have a dangling reference when the factory created by
`create_handler_factory()` tries to deference the reference when
handling a new incoming connection.
in this change,
- the definitions of `_config` and `_handler_factory` member
variables are transposed, so that the former is initialized
first.
- `_handler_factory` now keeps a reference to `_config`'s member
variable, so that the weak reference it holds is always valid.
Fixes#13455
Branches: none
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13456
This patch reads the relabel config from a file if it exists. A problem
with the file or metrics would stop Scylla from starting. This is on
purpose, as it's a configuration problem that should be addressed.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds a configuration with an optional file name for
relabeling metrics. It also adds a function that accepts a file name
and loads the relabel config from a file.
An example for such a file:
```
$cat conf.yml
relabel_configs:
- source_labels: [shard]
action: drop
target_label: shard
regex: (2)
- source_labels: [shard]
action: replace
target_label: level
replacement: $1
regex: (.*3)
```
update_relabel_config_from_file throws an exception on failure, it's up
to the caller to decide what to do in such cases.
since the only user of operator<<(..., resource_kind) is now
`auth_resource_test`, let's just move it into this test. and
there is no need to keep this operator in the header file where
`resource_kind` is defined.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `auth::resource_kind`
without the help of fmt::ostream. its `operator<<(ostream,..)` is
reimplemented using fmtlib accordingly to ease the review.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
since we already have fmt::formatter<authentication_option>, and
there is no exiting users of `operator<<(ostream&,
authentication_option)`, let's just drop it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `auth::auth_authentication_options`
without the help of fmt::ostream. its `operator<<(ostream,..)` is
reimplemented using fmtlib accordingly to ease the review.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The former is prone to producing stack-overflow as it uses recursion in
it match implementation.
The migration is entirely mechanical is for the most part.
escape() needs some special treatment, looks like boost::regex wants
double escaped bacspace.
The system_keyspace.hh now includes raft stuff, topology changes stuff, task_manager stuff, etc. It's going to include tablets.hh (but maybe not). Anything that deals with system keyspace, and includes system_keyspace.hh, would transitively pull these too. This header is becoming a central hub for all the features.
This PR removes all the headers from system_keyspace.hh that correspond to other "subsystems" keeping only generic mutations/querying and seastar ones.
Closes#13450
* github.com:scylladb/scylladb:
system_keyspace.hh: Remove unneeded headers
system_keyspace: Move topology_mutation_builder to storage_service
system_keyspace: Move group0_upgrade_state conversions to group0 code
After a failed topology operation, like bootstrap / decommission /
removenode, the cluster might contain a garbage entry in either token
ring or group 0. This entry can be cleaned-up by executing removenode on
any other node, pointing to the node that failed to bootstrap or leave
the cluster.
Document this procedure, including a method of finding the host ID of a
garbage entry.
Add references in other documents.
Fixes: #13122Closes#13186
As a first step towards using host_id to identify nodes instead of ip addresses
this series introduces a node abstraction, kept in topology,
indexed by both host_id and endpoint.
The revised interface also allows callers to handle cases where nodes
are not found in the topology more gracefully by introducing `find_node()` functions
that look up nodes by host_id or inet_address and also get a `must_exist` parameter
that, if false (the default parameter value) would return nullptr if the node is not found.
If true, `find_node` throws an internal error, since this indicates a violation of an internal
assumption that the node must exist in the topology.
Callers that may handle missing nodes, should use the more permissive flavor
and handle the !find_node() case gracefully.
Closes#11987
* github.com:scylladb/scylladb:
topology: add node state
topology: remove dead code
locator: add class node
topology: rename update_endpoint to add_or_update_endpoint
topology: define get_{rack,datacenter} inline
shared_token_metadata: mutate_token_metadata: replicate to all shards
locator: endpoint_dc_rack: refactor default_location
locator: endpoint_dc_rack: define default operator==
test: storage_proxy_test: provide valid endpoint_dc_rack
The latter is the only user of the class. This keeps system keyspace
code free from unrelated logic and from raft::server_id type.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In order to keep system keyspace free from group0 logic and from the
service::group0_upgrade_state type
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
before this change, we suse `git submodule summary ${submodule}`
for collecting the titles of commits in between current HEAD and
origin/master. normally, this works just fine. but it fails to
collect all commits if the origin/master happens to reference
a merge commit. for instance, if we have following history like:
1. merge foo
2. bar
3. foo
4. baz <--- submodule is pointing here.
`git submodule summary` would just print out the titles of commits
of 1 and 3.
so, in this change, instead of relying on `git submodule summary`,
we just collect the commits using `git log`. but we preserve the
output format used by `git submodule summary` to be consistent with
the previous commits bumping up the submodules. please note, in
this change instead of matching the output of `git submodule summary`,
we use `git merge-base --is-ancestor HEAD origin/master` to check
if we are going to create a fastforward change, this is less fragile.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13366
A problem in compaction reevaluation can cause the SSTable set to be left uncompacted for indefinite amount of time, potentially causing space and read amplification to be suboptimal.
Two revaluation problems are being fixed, one after off-strategy compaction ended, and another in compaction manager which intends to periodically reevaluate a need for compaction.
Fixes https://github.com/scylladb/scylladb/issues/13429.
Fixes https://github.com/scylladb/scylladb/issues/13430.
Closes#13431
* github.com:scylladb/scylladb:
compaction: Make compaction reevaluation actually periodic
replica: Reevaluate regular compaction on off-strategy completion
Writing into sstable component output stream should be done with care. In particular -- flushing can happen only once right before closing the stream. Flushing the stream in between several writes is not going to work, because file stream would step on unaligned IO and S3 upload stream would send completion message to the server and would lose any subsequent write.
Most of the file_writer users already obey that and flush the writer once right before closing it. The do_write_simple() is extra careful about exceptions handling, but it's an overkill (see first patch).
It's better to make file_writer API explicitly lack the ability to flush itself by flushing the stream when closing the writer.
Closes#13338
* github.com:scylladb/scylladb:
sstables: Move writer flush into close (and remove it)
sstables: Relax exception handling in do_write_simple
This test currently uses `test/lib/test_table.hh` to generate data for its test cases. This data generation facility is used by no other tests. Worse, it is redundant as we already have a random data generator with fixed schema, in `test/lib/mutation_source_test.hh`. So in this series, we migrate the test cases in said test file to random schema and its random data generation facilities. These are used by several other test cases and using random schema allows us to cover a wider (quasi-infinite) number of possibilities.
After migrating all tests away from it, `test/lib/test_table.hh` is removed.
This series also reduces the runtime of `fuzzy_test` drastically. It should now run in a few minutes or even in seconds (depending on the machine).
Fixes: #12944Closes#12574
* github.com:scylladb/scylladb:
test/lib: rm test_table.hh
test/boos/multishard_mutation_query_test: migrate other tests to random schema
test/boost/multishard_mutation_query_test: use ks keyspace
test/boost/multishard_mutation_query_test: improve test pager
test/boost/multishard_mutation_query_test: refactor fuzzy_test
test/boost: add multishard_mutation_query_test more memory
types/user: add get_name() accessor
test/lib/random_schema: add create_with_cql()
test/lib/random_schema: fix udt handling
test/lib/random_schema: type_generator(): also generate frozen types
test/lib/random_schema: type_generator(): make static column generation conditional
test/lib/random_schema: type_generator(): don't generate duration_type for keys
test/lib/random_schema: generate_random_mutations(): add overload with seed
test/lib/random_schema: generate_random_mutations(): respect range tombstone count param
test/lib/random_schema: generate_random_mutations(): add yields
test/lib/random_schema: generate_random_mutations(): fix indentation
test/lib/random_schema: generate_random_mutations(): coroutinize method
test/lib/random_schema: generate_random_mutations(): expand comment
Every tracker insertion has to have a corresponding removal or eviction,
(otherwise the number of rows in the tracker will be misaccounted).
If we add the row to the tracker before adding it to the tree,
and the tree insertion fails (with bad_alloc), this contract will be violated.
Fix that.
Note: the problem is currently irrelevant because an exception during
sentinel insertion will abort the program anyway.
Closes#13336
Compaction group is responsible for deleting SSTables of "in-strategy"
compactions, i.e. regular, major, cleanup, etc.
Both in-strategy and off-strategy compaction have their completion
handled using the same compaction group interface, which is
compaction_group::table_state::on_compaction_completion(...,
sstables::offstrategy offstrategy)
So it's important to bring symmetry there, by moving the responsibility
of deleting off-strategy input, from manager to group.
Another important advantage is that off-strategy deletion is now throttled
and gated, allowing for better control, e.g. table waiting for deletion
on shutdown.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13432
this is the 15th changeset of a series which tries to give an overhaul to the CMake building system. this series has two goals:
- to enable developer to use CMake for building scylla. so they can use tools (CLion for instance) with CMake integration for better developer experience
- to enable us to tweak the dependencies in a simpler way. a well-defined cross module / subsystem dependency is a prerequisite for building this project with the C++20 modules.
also, i just found that the scylla executable built with cmake building system segfault in master HEAD. like
```
AddressSanitizer:DEADLYSIGNAL
=================================================================
==3974496==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x000000000000 bp 0x7ffd48549f70 sp 0x7ffd48549728 T0)
==3974496==Hint: pc points to the zero page.
==3974496==The signal is caused by a READ memory access.
==3974496==Hint: address points to the zero page.
#0 0x0 (<unknown module>)
#1 0x14e785a5 in wasmtime_runtime::traphandlers::unix::trap_handler::h1f510afc2968497f /home/kefu/.cargo/registry/src/mirrors.sjtug.sjtu.edu.cn-7a04d2510079875b/wasmtime-runtime-5.0.1/src/traphandlers/unix.rs:159:9
#2 0x7f3462e5eb9f (/lib64/libc.so.6+0x3db9f) (BuildId: 6107835fa7d4725691b2b7f6aaee7abe09f493b2)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (<unknown module>)
==3974496==ABORTING
Aborting on shard 0.
Backtrace:
0xd16c38a
0x13c5aab0
0x13b9821e
0x13c2fdc7
/lib64/libc.so.6+0x3db9f
/lib64/libc.so.6+0x8eb93
/lib64/libc.so.6+0x3daed
/lib64/libc.so.6+0x2687e
0xd1e5f8a
0xd1e3d34
0xd1ca059
0xd1c5e29
0xd1c5605
0x14e785a5
/lib64/libc.so.6+0x3db9f
```
decoded:
```
__interceptor_backtrace at ??:?
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at /home/kefu/dev/scylladb/seastar/include/seastar/util/backtrace.hh:60
seastar::backtrace_buffer::append_backtrace() at /home/kefu/dev/scylladb/seastar/src/core/reactor.cc:778
(inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at /home/kefu/dev/scylladb/seastar/src/core/reactor.cc:808
seastar::print_with_backtrace(char const*, bool) at /home/kefu/dev/scylladb/seastar/src/core/reactor.cc:820
(inlined by) seastar::sigabrt_action() at /home/kefu/dev/scylladb/seastar/src/core/reactor.cc:3882
(inlined by) operator() at /home/kefu/dev/scylladb/seastar/src/core/reactor.cc:3858
(inlined by) __invoke at /home/kefu/dev/scylladb/seastar/src/core/reactor.cc:3854
/lib64/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=6107835fa7d4725691b2b7f6aaee7abe09f493b2, for GNU/Linux 3.2.0, not stripped
__GI___sigaction at :?
__pthread_kill_implementation at ??:?
__GI_raise at :?
__GI_abort at :?
__sanitizer::Abort() at ??:?
__sanitizer::Die() at ??:?
__asan::ScopedInErrorReport::~ScopedInErrorReport() at ??:?
__asan::ReportDeadlySignal(__sanitizer::SignalContext const&) at ??:?
__asan::AsanOnDeadlySignal(int, void*, void*) at ??:?
wasmtime_runtime::traphandlers::unix::trap_handler at /home/kefu/.cargo/registry/src/mirrors.sjtug.sjtu.edu.cn-7a04d2510079875b/wasmtime-runtime-5.0.1/src/traphandlers/unix.rs:159
__GI___sigaction at :?
```
this led me to this change. but unfortunately, this changeset does not address the segfault. will continue the investigation in my free cycles.
Closes#13434
* github.com:scylladb/scylladb:
build: cmake: include cxx.h with relative path
build: cmake: set stack frame limits
build: cmake: pass -fvisibility=hidden to compiler
build: cmake: use -O0 on aarch64, otherwise -Og
S3 client cannot perform anonymous multipart uploads into any real S3
buckets regardless of their configuration. Since multipart upload is
essential part of the sstables backend, we need to implement the
authorisation support for the client early.
(side note): with minio anonymous multipart upload works, with aws s3
anonymous PUT and DELETE can be configured, it's exactly the combination
of aws + multipart upload that does need authorization.
Fortunately, the signature generation and signature checking code is
symmetrical and we have the checking option already in alternator :) So
what this patch does is just moves the alternator::get_signature()
helper into utils/. A sad side effect of that is all tests now need to
link with gnutls :( that is used to compute the hash value itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13428
This PR reverts the scylla sstable schema loading improvements as they fail in CI every other run. I am already working on fixes for these but I am not sure I understand all the failures so it is best to revert and re-post the series later.
Fixes: #13404Fixes: #13410Closes#13419
* github.com:scylladb/scylladb:
Revert "Merge 'tool/scylla-sstable: more flexibility in obtaining the schema' from Botond Dénes"
Revert "tools/schema_loader: don't require results from optional schema tables"
The manager intended to periodically reevaluate compaction need for
each registered table. But it's not working as intended.
The reevaluation is one-off.
This means that compaction was not kicking in later for a table, with
low to none write activity, that had expired data 1 hour from now.
Also make sure that reevaluation happens within the compaction
scheduling group.
Fixes#13430.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When off-strategy compaction completes, regular compaction is not triggered.
If off-strategy output causes the table's SSTable set to not conform the strategy
goal, it means that read and space amplification will be suboptimal until the next
compaction kicks in, which can take undefinite amount of time (e.g. when active
memtable is flushed).
Let's reevaluate compaction on main SSTable set when off-strategy ends.
Fixes#13429.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
before this change, the wasm binding source files includes the
cxxbridge header file of `cxx.h` with its full path.
to better mirror the behavior of configure.py, let's just
include this header file with relative path.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* transpose include(mode.common) and include (mode.${build_mode}),
so the former can reference the value defined by the latter.
* set stack_usage_threshold for supported build modes.
please note, this compiler option (-Wstack-usage=<bytes>) is only
supported by GCC so far.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this addresses an oversight in b234c839e4,
which is supposed to mirror the behavior of `configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Related: https://github.com/scylladb/scylla-enterprise/issues/2770
This commit adds the upgrade guide from ScyllaDB Open Source 5.2
to ScyllaDB Enterprise 2023.1.
This commit does not cover metric updates (the metrics file has no
content, which needs to be added in another PR).
As this is an upgrade guide, this commit must be merged to master and
backported to branch-5.2 and branch-2023.1 in scylla-enterprise.git.
Closes#13294
Task manager compaction tasks that cover compaction group
compaction need access to compaction_manager::tasks.
To avoid circular dependency and be able to rely on forward
declaration, task needs to be moved out of compaction manager.
To avoid naming confusion compaction_manager::task is renamed.
Closes#13226
* github.com:scylladb/scylladb:
compaction: use compaction namespace in compaction_manager.cc
compaction: rename compaction::task
compaction: move compaction_manager::task out of compaction manager
compaction: move sstable_task definition to source file
This reverts commit 32fff17e19, reversing
changes made to 164afe14ad.
This series proved to be problematic, the new test introduced by it
failing quite often. Revert it until the problems are tracked down and
fixed.
There are two occasions in scylla_cluster
where we read the node logs, and in both of
them we read the entire file in memory.
This is not efficient and may cause an OOM.
In the first case we need the last line of the
log file, so we seek at the end and move backwards
looking for a new line symbol.
In the second case we look through the
log file to find the expected_error.
The readlines() method returns a Python
list object, which means it reads the entire
file in memory. It's sufficient to just remove
it since iterating over the file instance
already yields lines lazily one by one.
This is a follow-up for #13134.
Closes#13399
this is a part of a series migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `position_in_partition` and `partition_region` without using ostream<<. also, this change removes `operator<<(ostream, const position_in_partition_view&)` , `operator<<(ostream, const partition_region&)` along with their callers.
Refs #13245Closes#13391
* github.com:scylladb/scylladb:
mutation: drop operator<< for position_in_partition and friends
partition_snapshot_row_cursor: do not use operator<< when printing position
mutation: specialize fmt::formatter<position_in_partition>
mutation: specialize fmt::formatter<partition_region>
as alien::run_on() requires the function to be noexcept, let's
make this explicit. also, this paves the road to the type constraint
added to `alien::run_on()`. the type contraint will enforce this
requirement to the function passed to `alien::run_on()`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13375
That's courtersy of 153813d3b8, which annotates Seastar smart pointer classes with Clang's consumed attributes, to help Clang to statically spot use-after-move bugs.
Closes#13386
* github.com:scylladb/scylladb:
replica: Fix use-after-move in table::make_streaming_reader
index/built_indexes_virtual_reader.hh: Fix use-after-move
db/view/build_progress_virtual_reader: Fix use-after-move
sstables: Fix use-after-move when making reader in reverse mode
Add a simple node state model with:
`joining`, `normal`, `leaving`, and `left` states
to help managing nodes during replace
with the the same ip address.
Later on, this could also help prevent nodes
that were decommissioned, removed, or replaced
from rejoining the cluster.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And keep per node information (idx, host_id, endpoint, dc_rack, is_pending)
in node objects, indexed by topology on several indices like:
idx, host_id, endpoint, current/pending, per dc, per dc/rack.
The node index is a shorthand identifier for the node.
node* and index are valid while the respective topology instance is valid.
To be used, the caller must hold on to the topology / token_metadata object
(e.g. via a token_metadata_ptr or effective_replication_map)
Refs #6403
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
topology: add node idx
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Define get_location() that gets the location
for the local node, and use either this entry point
or get_location(inet_address) to get the respective
dc or rack.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
storage_service::replicate_to_all_cores has a sophisticated way
to mutate the token_metadata and effective_replication_map
on shard 0 and cloning those to all other shards, applying
the changes only mutate and clone succeeded on all shards
so we don't end up with only some of the shards with the mutated
copy if an error happend mid-way (and then we would need to
roll-back the change for exception safety).
shared_token_metadata::mutate_token_metadata is currently only called from
a unit test that needs to mutate the token metadata only on shard 0,
but a following patch will require doing that on all shards.
This change adds this capbility by enforcing the call to be
on shard 0m mutating the token_metdata into a temporary pending copy
and cloning it on all other shards. Only then, when all shard
succeeded, set the modified token_metadata on all shards.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Refactor the thread_local default_location out of
topology::get_location so it can be used elsewhere.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use token_metadata get_endpoint_to_host_id_map_for_reading
to get all normal token owners for all node operations,
rather than using gossip for some operation and
token_metadata for others.
Fixes#12862
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Courtersy of clang-tidy:
row_cache.cc:1191:28: warning: 'entry' used after it was moved [bugprone-use-after-move]
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{_schema});
^
row_cache.cc:1191:60: note: move occurred here
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{_schema});
^
row_cache.cc:1191:28: note: the use and move are unsequenced, i.e. there is no guarantee about the order in which they are evaluated
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{*_schema});
The use-after-move is UB, as for it to happen, depends on evaluation order.
We haven't hit it yet as clang is left-to-right.
Fixes#13400.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13401
When loading a schema from disk, only the `tables` and `columns` tables
are required to have an entry to the loaded schema. All the others are
optional. Yet the schema loader expects all the tables to have a
corresponding entry, which leads to errors when trying to load a schema
which doesn't. Relax the loader to only require existing entries in the
two mandatory tables and not the others.
Closes#13393
Variant used by
streaming/stream_transfer_task.cc: , reader(cf.make_streaming_reader(cf.schema(), std::move(permit_), prs))
as full slice is retrieved after schema is moved (clang evaluates
left-to-right), the stream transfer task can be potentially working
on a stale slice for a particular set of partitions.
static report:
In file included from replica/dirty_memory_manager.cc:6:
replica/database.hh:706:83: error: invalid invocation of method 'operator->' on object 'schema' while it is in the 'consumed' state [-Werror,-Wconsumed]
return make_streaming_reader(std::move(schema), std::move(permit), range, schema->full_slice());
Fixes#13397.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
static report:
./index/built_indexes_virtual_reader.hh:228:40: warning: invalid invocation of method 'operator->' on object 's' while it is in the 'consumed' state [-Wconsumed]
_db.find_column_family(s->ks_name(), system_keyspace::v3::BUILT_VIEWS),
Fixes#13396.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
use-after-free in ctor, which potentially leads to a failure
when locating table from moved schema object.
static report
In file included from db/system_keyspace.cc:51:
./db/view/build_progress_virtual_reader.hh:202:40: warning: invalid invocation of method 'operator->' on object 's' while it is in the 'consumed' state [-Wconsumed]
_db.find_column_family(s->ks_name(), system_keyspace::v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS),
Fixes#13395.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
static report:
sstables/mx/reader.cc:1705:58: error: invalid invocation of method 'operator*' on object 'schema' while it is in the 'consumed' state [-Werror,-Wconsumed]
legacy_reverse_slice_to_native_reverse_slice(*schema, slice.get()), pc, std::move(trace_state), fwd, fwd_mr, monitor);
Fixes#13394.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
in order to prepare for dropping the `operator<<()` for `position_in_partition_view`,
let's use fmtlib to print `position()`.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print
- position_in_partition
- position_in_partition_view
- position_in_partition_view::printer
without the help of fmt::ostream. their `operator<<(ostream,..)` are
reimplemented using fmtlib accordingly to ease the review.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `partition_region` with the help of fmt::ostream.
to help with the review process, the corresponding `to_string()` is
dropped, and its callers now switch over to `fmt::to_string()` in
this change as well. to use `fmt::to_string()` helps with consolidating
all places to use fmtlib for printing/formatting.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
sleep_abortable() is aborted on success, which causes sleep_aborted
exception to be thrown. This causes scylla to throw every 100ms for
each pinged node. Throwing may reduce performance if happens often.
Also, it spams the logs if --logger-log-level exception=trace is enabled.
Avoid by swallowing the exception on cancellation.
Fixes#13278.
Closes#13279
When adding extra columns in a test, make them value column. Name them
with the "v_" prefix and use the value column number counter.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13271
To allow tests with custom clusters, allow configuration of initial
cluster size of 0.
Add a proof-of-concept test to be removed later.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13342
Let the caller pass the string to parse to the function
rather than the function itself get to it via _db.local().get_config()
so it could be used as a general purpose function.
Make it static now that it doesn't require an instance.
Rename to `parse_node_list` as that's what the function does.
It doesn't care if the nodes are to be ignored or something else
(e.g. removed), they only need to be in token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
generation_for_sharded_test is not used by any of these sstable
tests, so let's drop it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13388
Task manager task implementations of classes that cover
offstrategy keyspace compaction which can be start through
/storage_service/keyspace_compaction/ api.
Top level task covers the whole compaction and creates child
tasks on each shard.
Closes#12713
* github.com:scylladb/scylladb:
test: extend test_compaction_task.py to test offstrategy compaction
compaction: create task manager's task for offstrategy keyspace compaction on one shard
compaction: create task manager's task for offstrategy keyspace compaction
compaction: create offstrategy_compaction_task_impl
The forward_service.hh and raft_group0_client.hh can be replaced with
forward declarations. Few other files need their previously indirectly
included headers back.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13384
The commitlog api originally implied that the commitlog_directory would contain files from a single commitlog instance. This is checked in segment_manager::list_descriptors, if it encounters a file with an unknown prefix, an exception occurs in `commitlog::descriptor::descriptor`, which is logged with the `WARN` level.
A new schema commitlog was added recently, which shares the filesystem directory with the main commitlog. This causes warnings to be emitted on each boot. This patch solves the warnings problem by moving the schema commitlog to a separate directory. In addition, the user can employ the new `schema_commitlog_directory` parameter to move the schema commitlog to another disk drive.
This is expected to be released in 5.3.
As #13134 (raft tables->schema commitlog) is also scheduled for 5.3, and it already requires a clean rolling restart (no cl segments to replay), we don't need to specifically handle upgrade here.
Fixes: #11867Closes#13263
* github.com:scylladb/scylladb:
commitlog: use separate directory for schema commitlog
schema commitlog: fix commitlog_total_space_in_mb initialization
The commitlog api originally implied that
the commitlog_directory would contain files
from a single commitlog instance. This is
checked in segment_manager::list_descriptors,
if it encounters a file with an unknown
prefix, an exception occurs in
commitlog::descriptor::descriptor, which is
logged with the WARN level.
A new schema commitlog was added recently,
which shares the filesystem directory with
the main commitlog. This causes warnings
to be emitted on each boot. This patch
solves the warnings problem by moving
the schema commitlog to a separate directory.
In addition, the user can employ the new
schema_commitlog_directory parameter to move
the schema commitlog to another disk drive.
By default, the schema commitlog directory is
nested in the commitlog_directory. This can help
avoid problems during an upgrade if the
commitlog_directory in the custom scylla.yaml
is located on a separate disk partition.
This is expected to be released in 5.3.
As #13134 (raft tables->schema commitlog)
is also scheduled for 5.3, and it already
requires a clean rolling restart (no cl
segments to replay), we don't need to
specifically handle upgrade here.
Fixes: #11867
initialization
It seems there was a typo here, which caused
commitlog_total_space_in_mb to always be zero
and the schema commitlog to be effectively
unlimited in size.
Preparing for #10459, this series defines sstables::generation_type::int_t
as `int64_t` at the moment and use that instead of naked `int64_t` variables
so it can be changed in the future to hold e.g. a `std::variant<int64_t, sstables::generation_id>`.
sstables::new_generation was defined to generation new, unique generations.
Currently it is based on incrementing a counter, but it can be extended in the future
to manufacture UUIDs.
The unit tests are cleaned up in this series to minimize their dependency on numeric generations.
Basically, they should be used for loading sstables with hard coded generation numbers stored under `test/resource/sstables`.
For all the rest, the tests should use existing and mechanisms introduced in this series such as generation_factory, sst_factory and smart make_sstable methods in sstable_test_env and table_for_tests to generate new sstables with a unique generation, and use the abstract sst->generation() method to get their generation if needed, without resorting the the actual value it may hold.
Closes#12994
* github.com:scylladb/scylladb:
everywhere: use sstables::generation_type
test: sstable_test_env: use make_new_generation
sstable_directory::components_lister::process: fixup indentation
sstables: make highest_generation_seen return optional generation
replica: table: add make_new_generation function
replica: table: move sstable generation related functions out of line
test: sstables: use generation_type::int_t
sstables: generation_type: define int_t
The wasm engine is moved from replica::database to the query_processor.
The wasm instance cache and compilation thread runner were already there,
but now they're also initialized in the query_processor constructor.
By moving the initialization to the constructor, we can now
be certain that all wasm-related objects (wasm instance cache,
compilation thread runner, and wasm engine, which was already
passed in the constructor) are initialized when we try to use
them because we have to use the query processor to access them
anyway.
The change is also motivated by the fact that we're planning
to take Wasm UDFs out of experimental, after which they should
stop getting special treatment.
Closes#13311
* github.com:scylladb/scylladb:
wasm: move wasm initialization to query_processor constructor
wasm: return wasm instance cache as a reference instead of a pointer
wasm: move wasm engine to query_processor
Currently, aggregate functions are implemented in a statefull manner.
The accumulator is stored internally in an aggregate_function::aggregate,
requiring each query to instantiate new instances (see
aggregate_function_selector's constructor, and note how it's called
from selector::new_instance()).
This makes aggregates hard to use in expressions, since expressions
are stateless (with state only provided to evaluate()). To facilitate
migration towards stateless expressions, we define a
stateless_aggregate_function (modeled after user-defined aggregates,
which are already stateless). This new struct defines the aggregate
in terms of three scalar functions: one to aggregate a new input into
an accumulator (provided in the first parameter), one to finalize an
accumulator into a result, and one to reduce two accumulators for
parallelized aggregation.
All existing native aggregate functions are converted to the new model, and
the old interface is removed. This series does not yet convert selectors to
expressions, but it does remove one of the obstacles.
Performance evaluation: I created a table with a million ints on a single-node cluster, and ran the avg() function on them. I measured the number of instructions executed with `perf stat -p $(pgrep scylla) -e instructions` while the query was running. The query executed from cache, memtables were flushed beforehand. The instruction count per row increased from roughly 49k to roughly 52k, indicating 3k extra instructions per row. While 3k instructions to execute a function is huge, it is currently dwarfed by other overhead (and will be even less important in a cluster where it CL>1 will cause non-coordinator code to run multiple times).
Closes#13105
* github.com:scylladb/scylladb:
cql3/selection, forward_service: use use stateless_aggregate_function directly
db: functions: fold stateless_aggregate_function_adapter into aggregate_function
cql3: functions: simplify accumulator_for template
cql3: functions: base user-defined aggregates on stateless aggregates
cql3: functions: drop native_aggregate_function
cql3: functions: reimplement count(column) statelessly
cql3: functions: reimplement avg() statelessly
cql3: functions: reimplement sum() statelessly
cql3: functions: change wide accumulator type to varint
cql3: functions: unreverse types for min/max
cql3: functions: rename make_{min,max}_dynamic_function
cql3: functions: reimplement min/max statelessly
cql3: functions: reimplement count(*) statelessly
cql3: functions: simplify creating native functions even more
cql3: functions: add helpers for automating marshalling for scalar functions
types: fix big_decimal constructor from literal 0
cql3: functions: add helper class for internal scalar functions
db: functions: add stateless aggregate functions
db, cql3: move scalar_function from cql3/functions to db/functions
`scylla-sstable` currently has two ways to obtain the schema:
* via a `schema.cql` file.
* load schema definition from memory (only works for system tables).
This meant that for most cases it was necessary to export the schema into a `CQL` format and write it to a file. This is very flexible. The sstable can be inspected anywhere, it doesn't have to be on the same host where it originates form. Yet in many cases the sstable *is* inspected on the same host where it originates from. In this cases, the schema is readily available in the schema tables on disk and it is plain annoying to have to export it into a file, just to quickly inspect an sstable file.
This series solves this annoyance by providing a mechanism to load schemas from the on-disk schema tables. Furthermore, an auto-detect mechanism is provided to detect the location of these schema tables based on the path of the sstable, but if that fails, the tool check the usual locations of the scylla data dir, the scylla confguration file and even looks for environment variables that tell the location of these. The old methods are still supported. In fact, if a `schema.cql` is present in the working directory of the tool, it is preferred over any other method, allowing for an easy force-override.
If the auto-detection magic fails, an error is printed to the console, advising the user to turn on debug level logging to see what went wrong.
A comprehensive test is added which checks all the different schema loading mechanisms. The documentation is also updated to reflect the changes.
This change breaks the backward-compatibility of the command-line API of the tool, as `--system-schema` is now just a flag, the keyspace and table names are supplied separately via the new `--keyspace` and `--table` options. I don't think this will break anybody's workflow as this tools is still lightly used, exactly because of the annoying way the schema has to be provided. Hopefully after this series, this will change.
Example:
```
$ ./build/dev/scylla sstable dump-data /var/lib/scylla/data/ks/tbl2-d55ba230b9a811ed9ae8495671e9e4f8/quarantine/me-1-big-Data.db
{"sstables":{"/var/lib/scylla/data/ks/tbl2-d55ba230b9a811ed9ae8495671e9e4f8/quarantine//me-1-big-Data.db":[{"key":{"token":"-3485513579396041028","raw":"000400000000","value":"0"},"clustering_elements":[{"type":"clustering-row","key":{"raw":"","value":""},"marker":{"timestamp":1677837047297728},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1677837047297728,"value":"0"}}}]}]}}
```
As seen above, subdirectories like `qurantine`, `staging` etc are also supported.
Fixes: https://github.com/scylladb/scylladb/issues/10126Closes#13075
* github.com:scylladb/scylladb:
docs/operating-scylla/admin-tools: scylla-sstable.rst: update schema section
test/cql-pytest: test_tools.py: add test for schema loading
test/cql-pytest: nodetool.py: add flush_keyspace()
tools/scylla-sstable: reform schema loading mechanism
tools/schema_loader: add load_schema_from_schema_tables()
db/schema_tables: expose types schema
Writing into sstable component output stream should be done with care.
In particular -- flushing can happen only once right before closing the
stream. Flushing the stream in between several writes is not going to
work, because file stream would step on unaligned IO and S3 upload
stream would send completion message to the server and would lose any
subsequent write.
Having said that, it's better to remove the flush() ability from the
component writer not to tempt the developers.
refs: #13320
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This effectively reverts 000514e7cc (sstable: close file_writer if an
exception in thrown) because it became obsoleted by 60873d2360 (sstable:
file_writer: auto-close in destructor).
The change is in fact idempotent.
Before the patch writer was closed regardless of write/flush failing or
not. After the patch writer will close itself in destrictor for sure.
Before the patch an exception from write/flush was caught, then close
was called and regardless of close failed or not the former exception
was re-thrown. After the patch an exception from write/flush will result
inin writer destruction that would ignore close exception (if any).
Before the patch throwing close after successfull write/flush re-threw
the close exception. After the patch writer will be closed "by hand" and
any exception will be reported.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
this is a part of a series migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `composite` and `composite_view` without using ostream<<. also, this change removes `operator<<(ostream, const composite&)` , `operator<<(ostream, const composite_view&)` along with their callers.
Refs #13245Closes#13360
* github.com:scylladb/scylladb:
compound_compat: remove operator<<(ostream, composite)
compound_compat: remove operator<<(ostream, composite_view)
sstables: do not use operator<< to print composite_view
compound_compat.hh: specialize fmt::formatter<composite>
compound_compat.hh: specialize fmt::formatter<composite_view>
compound_compat.hh: specialize fmt::formatter<component_view>
... and drop usage of global storage proxy from several places of mutate_MV().
This is the last dependency loop around storage proxy left as long as the last user of the global storage proxy. The trouble is that while proxy naturally depends on database, the database SUDDENLY requires proxy to push view updates from the guts of database::do_apply().
Similar loop existed in a form of database -> { large_data_handler, compaction manager } -> system keyspace -> database and it was cut in 917fdb9e53 (Cut database-system_keyspace circular dependency) by introducing a soft dependency link from l. d. handler / compaction manager to system keyspace. The similar solution is proposed here.
The database instance gets a soft dependency (shared_ptr) to view_update_generator instance. On start the link is nullptr and pushing view updates is not possible until view_updates_generator starts and plugs itself to the database. The plugging happens naturally, because v.u.generator needs proxy as explicit dependency and, thus, can reach database via proxy. This (seems to) works because tables that need view updates don't start being mutated until late enough, as late as v.u.generator starts.
As a nice side effect this allows removing a bunch of global storage proxy usages from mutate_MV() which opens a pretty short way towards de-globalizing proxy (after it only qctx, tracing and schema registry will be left).
Closes#13367
* github.com:scylladb/scylladb:
view: Drop global storage_proxy usage from mutate_MV()
view: Make mutate_MV() method of view_update_generator
table: Carry v.u.generator down to populate_views()
table: Carry v.u.generator down to do_push_view_replica_updates()
view: Keep v.u.generator shared pointer on view_builder::consumer
view: Capture v.u.generator on view_updating_consumer lambda
view: Plug view update generator to database
view: Add view_builder -> view_update_generator dependency
view: Add view_update_generator -> sharded<storage_proxy> dependency
in this change, following query timeouts config options are marked live update-able:
- range_request_timeout_in_ms
- read_request_timeout_in_ms
- counter_write_request_timeout_in_ms
- cas_contention_timeout_in_ms
- truncate_request_timeout_in_ms
- write_request_timeout_in_ms
- request_timeout_in_ms
as per https://github.com/scylladb/scylladb/issues/10172,
> Many users would like to set the driver timers based on server timers.
> For example: expire a read timeout before or after the server read time
> out.
with this change, we are able to set the timeouts on the fly. these timeout options specify how long coordinator waits for the completion of different kinds of operations. but these options are cached by the servers consuming them, so in this series, helpers are added to update the cached values when the options gets modified. also, since the observers are not copyable, sharded_parameter is used to initialize the config when creating these sharded servers.
Fixes#12232
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12531
* github.com:scylladb/scylladb:
timeout_config: remove unused make_timeout_config()
client_state: split the param list of ctor into multi lines
redis,thrift,transport: make timeout_config live-updateable
config: mark query timeouts live update-able
transport: mark cql_server::timeout_config() const
auth: remove unused forward declaration
redis: drop unused member function
transport: drop unused member function
thrift: keep a reference of timeout_config in handler_factory
redis,thrift,transport: initialize _config with std::move(config)
redis,thrift,transport: pass config via sharded_parameter
utils: config_file: add a space after `=`
since we don't build the rpm/deb packages from source tarball anymore,
instead we build the rpm/deb packages from precompiled relocatable
package. there is no need to keep git-archive-all in the repo. in this
change, the git-archive-all script and its license file are removed.
they were added for building rpm packages from source tarball in
f87add31a7.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13372
This is important for multiple compaction groups, as they cannot share state that must span a single SSTable set.
The solution is about:
1) Decoupling compaction strategy from its state; making compaction_strategy a pure stateless entity
2) Each compaction group storing its own compaction strategy state
3) Compaction group feeds its state into compaction strategy whenever needed
Closes#13351
* github.com:scylladb/scylladb:
compaction: TWCS: wire up compaction_strategy_state
compaction: LCS: wire up compaction_strategy_state
compaction: Expose compaction_strategy_state through table_state
replica: Add compaction_strategy_state to compaction group
compaction: Introduce compaction_strategy_state
compaction: add table_state param to compaction_strategy::notify_completion()
compaction: LCS: extract state into a separate struct
compaction: TWCS: prepare for stateless strategy
compaction: TWCS: extract state into a separate struct
compaction: add const-qualifier to a few compaction_strategy methods
Now the mutate_MV is the method of v.u.generator which has reference to
the sharded<storage_proxy>. Few helper static wrappers are patched to
get the needed proxy or database reference from the mutate_MV call.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Nowadays its a static helper, but internally it depends on storage
proxy, so it grabs its global instance. Making it a method of view
update generator makes it possible to use the proxy dependency from the
generator.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method is called by view_builder::consumer when building a view and
the consumer already has stable dependency reference on the view updates
generator.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The purpose of `_stop` is to remember whether the consumption of the
last partition was interrupted or it was consumed fully. In the former
case, the compactor allows retreiving the compaction state for the given
partition, so that its compaction can be resumed at a later point in
time.
Currently, `_stop` is set to `stop_iteration::yes` whenever the return
value of any of the `consume()` methods is also `stop_iteration::yes`.
Meaning, if the consuming of the partition is interrupted, this is
remembered in `_stop`.
However, a partition whose consumption was interrupted is not always
continued later. Sometimes consumption of a partitions is interrputed
because the partition is not interesting and the downstream consumer
wants to stop it. In these cases the compactor should not return an
engagned optional from `detach_state()`, because there is not state to
detach, the state should be thrown away. This was incorrectly handled so
far and is fixed in this patch, but overwriting `_stop` in
`consume_partition_end()` with whatever the downstream consumer returns.
Meaning if they want to skip the partition, then `_stop` is reset to
`stop_partition::no` and `detach_state()` will return a disengaged
optional as it should in this case.
Fixes: #12629Closes#13365
We introduced exclude_submodules at 19da4a5b8f
to exclude tools/java and tools/jmx since they have their own
relocatable packages, so we don't want to package same files twice.
However, most of the files under tools/ are not needed for installation,
we just need tools/scyllatop.
So what we really need to do is "ar.reloc_add('tools/scyllatop')", not
excluding files from tools/.
related with #13183Closes#13215
To avoid confusion with task manager tasks compaction::task is renamed
to compaction::compaction_task_exector. All inheriting classes are
modified similarly.
compaction_manager::task needs to be accessed from task manager compaction
tasks. Thus, compaction_manager::task and all inheriting classes are moved
from compaction manager to compaction namespace.
By moving the initialization to the constructor, we can now
be certain that all wasm-related objects (wasm instance cache,
compilation thread runner, and wasm engine, which was already
passed in the constructor) are initialized when we try to use
them because we have to use the query processor to access them
anyway.
The change is also motivated by the fact that we're planning
to take Wasm UDFs out of experimental, after which they should
stop getting special treatment.
this is a part of a series migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `authenticated_user` without using ostream<<. also, this change removes all existing callers of `operator<<(ostream, const authenticated_user&)`.
Refs #13245Closes#13359
* github.com:scylladb/scylladb:
auth: drop operator<<(ostream, authenticated_user)
cql3: do not use operator<< to print authenticated_user
auth: specialize fmt::formatter<authenticated_user>
it is replaced by the ctor of updateable_timeout_config, so it does not
have any callers now. let's drop it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* timeout_config
- add `updated_timeout_config` which represents an always-updated
options backed by `utils::updateable_value<>`. this class is
used by servers which need to access the latest timeout related
options. the existing `timeout_config` is more like a snapshot
of the `updated_timeout_config`. it is used in the use case where
we don't need to most updated options or we update the options
manually on demand.
* redis, thrift, transport: s/timeout_config/updated_timeout_config/
when appropriate. use the improved version of timeout_config where
we need to have the access to the most-updated version of the timeout
options.
Fixes#10172
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in this change, following query timeouts config options are marked
live update-able:
- range_request_timeout_in_ms
- read_request_timeout_in_ms
- counter_write_request_timeout_in_ms
- cas_contention_timeout_in_ms
- truncate_request_timeout_in_ms
- write_request_timeout_in_ms
- request_timeout_in_ms
as per https://github.com/scylladb/scylladb/issues/10172,
> Many users would like to set the driver timers based on server timers.
> For example: expire a read timeout before or after the server read time
> out.
with this change, these options are *marked* live-updateable, but since
they are cached by their consumers locally, so we will have another commit
to update the local copies when these options get updated.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this function returns a const reference to member variable, so we
can mark it with the `const` specifier for better readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
`timeout_config` is not used by auth/common.hh. presumably, this
class is not a public interface exposed by auth, as it is not
inherently related auth. timeout_config is a shared setting across
related services, specifically, redis_server, thrift and cql_server.
so, in this change, let's drop this forward declaration.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
now that `redis_server::connection::timeout_config()` and
`redis_server::timeout_config()` are used nowhere, let's drop them.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change should keep the timeout settings of handler_factory sync'ed
with the ones used by `thrift_server`. so far, the `timeout_config`
instance in `thrift_server` is not live-updateable, but in a follow-up
change, we will make it so. so, this change prepares the handler_factory
for a live-updateable timeout_config.
instead keeping a snapshot of the timeout_config, keep a reference of
it in handler_factory. the reference points to `thrift_server::_config`.
so despite that `thrift_server::_handler_factory` is a shared_ptr,
the member variable won't outlive its container, as the only reason to
have it as a shared_ptr is to appease the ctor of
`CassandraAsyncProcessorFactory`. and the constructed
`_processor_factory` is also a member variable of `thrift_server`, so we
won't take the risk of a dangling reference held by `handler_factory`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead of copying the `config` parameter, move away from it.
this change also prepares for a non-copyable config. if the class
of `config` is not copyable, we will not be able to initialize
the member variable by copying from the given `config` parameter.
after the live-updateable config change, the `_config` member
variable will contain instances of utils::observer<>, which is
not copyable, but is move-constructable, hence in this change,
we just move away from the give `config`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* pass config via sharded_parameter
* initialize config using designated initializer
this change paves the road to servers with live-updateable timeout
options.
before this change, the servers initialize a domain specific combo
config, like `redis_server_config`, with the same instance of a
timeout_config, and pass the combox config as a ctor parameter to
construct each sharded service instance. but this design assumes
the value semantic of the config class, say, it should be copyable.
but if we want to use utils::updateable_value<> to get updated
option values, we would have to postpone the instantiation of the
config until the sharded service is about to be initialized.
so, in this change, instead of taking a domain specific config created
before hand, all services constructed with a `timeout_config` will
take a `sharded_parameter()` for creating the config. also, take
this opportunity to initialize the config using designated initializer.
for two reasons:
* less repeatings this way. we don't have to repeat the variable
name of the config being initialized for each member variable.
* prepare for some member variables which do not have a default
constructor. this applies to the timeout_config's updater which
will not have a default constructor, as it should be initialized
by db::config and a reference to the timeout_config to be updated.
we will update the `timeout_config` side in a follow-up commit.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
to disambiguate `fmt::format_to()` from `std::format_to()`. turns out,
we have `using namespace std` somewhere in the source tree, and with
libstdc++ shipped by GCC-13, we have `std::format_to()`, so without
exactly which one to use, compiler complains like
```
/optimized_clang/stage-1-X86/build/bin/clang++ -MD -MT build/dev/mutation/mutation.o -MF build/dev/mutation/mutation.o.d -I/optimized_clang/scylla-X86/seastar/include -I/optimized_clang/scylla-X86/build/dev/seastar/gen/include -U_FORTIFY_SOURCE -DSEASTAR_SSTRING -Werror=unused-result -fstack-clash-protection -DSEASTAR_API_LEVEL=6 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_TYPE_ERASE_MORE -DFMT_SHARED -I/usr/include/p11-kit-1 -ffile-prefix-map=/optimized_clang/scylla-X86=. -march=westmere -DDEVEL -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSCYLLA_ENABLE_ERROR_INJECTION -O2 -DSCYLLA_BUILD_MODE=dev -iquote. -iquote build/dev/gen --std=gnu++20 -ffile-prefix-map=/optimized_clang/scylla-X86=. -march=westmere -DBOOST_TEST_DYN_LINK -DNOMINMAX -DNOMINMAX -fvisibility=hidden -Wall -Werror -Wno-mismatched-tags -Wno-tautological-compare -Wno-parentheses-equality -Wno-c++11-narrowing -Wno-missing-braces -Wno-ignored-attributes -Wno-overloaded-virtual -Wno-unused-command-line-argument -Wno-unsupported-friend -Wno-delete-non-abstract-non-virtual-dtor -Wno-braced-scalar-init -Wno-implicit-int-float-conversion -Wno-delete-abstract-non-virtual-dtor -Wno-psabi -Wno-narrowing -Wno-nonnull -Wno-uninitialized -Wno-error=deprecated-declarations -DXXH_PRIVATE_API -DSEASTAR_TESTING_MAIN -DFMT_DEPRECATED_OSTREAM -c -o build/dev/mutation/mutation.o mutation/mutation.cc
In file included from mutation/mutation.cc:9:
In file included from mutation/mutation.hh:13:
In file included from mutation/mutation_partition.hh:21:
In file included from ./schema/schema_fwd.hh:13:
In file included from ./utils/UUID.hh:22:
./bytes.hh:116:21: error: call to 'format_to' is ambiguous
format_to(out, "{}{:02x}", _delimiter, std::byte(v[i]));
^~~~~~~~~
./bytes.hh:134:43: note: in instantiation of function template specialization 'fmt::formatter<fmt_hex>::format<fmt::basic_format_context<fmt::appender, char>>' requested here
return fmt::formatter<::fmt_hex>::format(::fmt_hex(bytes_view(s)), ctx);
^
/usr/include/fmt/core.h:813:64: note: in instantiation of function template specialization 'fmt::formatter<seastar::basic_sstring<signed char, unsigned int, 31, false>>::format<fmt::basic_format_context<fmt::appender, char>>' requested here
-> decltype(typename Context::template formatter_type<T>().format(
^
/usr/include/fmt/core.h:824:10: note: while substituting deduced template arguments into function template 'has_const_formatter_impl' [with Context = fmt::basic_format_context<fmt::appender, char>, T = seastar::basic_sstring<signed char, unsigned int, 31, false>]
return has_const_formatter_impl<Context>(static_cast<T*>(nullptr));
```
to address this FTBFS, let's be more explicit by adding "fmt::" to
specify which `format_to()` to use.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13361
The latter is the place where mutate_MV is called and it needs the
view updates generator nearby.
The call-stack starts at database::do_apply(). As was described in one
of the previous patches, applying mutations that need updating views
happen late enough, so if the view updates generator is not plugged to
the database yet, it's OK to bail out with exception. If it's plugged,
it's carried over thus keeping the generator instance alive and waited
for on its stop.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is another mutations consumer that pushes view updates forward and
thus also needs the view updates generator pointer. It gets one from the
view builder that already has the dependency on generator.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The consumer is in fact pushing the updates and _that_'s the component
that would really need the view_update_generator at hand. The consumer
is created from the generator itself so no troubles getting the pointer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The database is low-level service and currently view update generator
implicitly depend on it via storage proxy. However, database does need
to push view updates with the help of mutate_MV helper, thus adding the
dependency loop.
This patch exploits the fact that view updates start being pushed late
enough, by that time all other service, including proxy and view update
generator, seem to be up and running. This allows a "weak dependency"
from database to view update generator, like there's one from database
to system keyspace already.
So in this patch the v.u.g. puts the shared-from-this pointer onto the
database at the time it starts. On stop it removes this pointer after
database is drained and (hopefully) all view updates are pushed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The builder will need generator for view_builder::consumer in one of the
next patches.
The builder is a standalone service that starts one of the latest and no
other services need builder as their dependency.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The generator will be responsible for spreading view updates with the
help of mutate_MV helper. The latter needs storage proxy to operate, so
the generator gets this dependency in advance.
There's no need to change start/stop order at the moment, generator
already starts after and stops before proxy. Also, services that have
generator as dependency are not required by proxy (even indirectly) so
no circular dependency is produced at this point.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
this helps to use OpenJDK 11 instead of OpenJDK 8 for running scylla-jmx,
in hope to alleviate the pain of the crashes found in the JRE shipped along
with OpenJDK 8, as it is aged, and only security fixes are included now.
* tools/jmx 88d9bdc...48e1699 (3):
> Merge 'dist/redhat: support jre 11 instead of jre 8' from Kefu Chai
> install.sh: point java to /usr/bin/java
> Merge 'use OpenJDK 11 instead of OpenJDK 8' from Kefu Chai
Refs https://github.com/scylladb/scylla-jmx/issues/194
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13356
this change removes the last two callers of `operator<<(ostream&, const composite_view&)`,
it paves the road to remove this operator.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `composite` with the help of fmt::ostream.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `composite::composite_view` with the help of fmt::ostream.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `composite::component_view` with the help of fmt::ostream.
in this change, '#' is used to add 0x prefix. as fmtlib allows us to add
'0x' prefix using '#' format specifier when printing numbers using 'x'
as its type specifier. see https://fmt.dev/latest/syntax.html
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change removes the last two callers of `operator<<(ostream&, const authenticated_user&)`,
it paves the road to remove this operator.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `auth::authenticated_user` with the help of fmt::ostream.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Now that stateless_aggregate_function is directly exposed by
aggregate_function, we can use it directly, avoiding the intermediary
aggregate_function::aggregate, which is removed.
Now that all aggregate functions are derived from
stateless_aggregate_function_adapter, we can just fold its functionality
into the base class. This exposes stateless_aggregate_function to
all users of aggregate_function, so they can begin to benefit from
the transformation, though this patch doesn't touch those users.
The aggregate_function base class is partiallly devirtualized since
there is just a single implementation now.
The accumulator_for template is used to select the accumulator
type for aggregates. After refactoring, all that is needed from
it is to select the native type, so remove all the excess code.
Currently, we use __int128, but this has no direct counterpart
in CQL, so we can't express the accumulator type as part of a
CQL scalar function. Switch to varint which is a superset, although
slower.
Currently it works without this, but later unreversing will
be removed from another part of the stack, causing min/max
on reversed types to return incorrect results. Anticipate
that an unreverse the types during construction.
In an incoming change, the wasm instance cache will be modified to be owned
by the query_processor - it will hold an optional instead of a raw
pointer to the cache, so we should stop returning the raw pointer
from the getter as well.
Consequently, the cache is also stored as a reference in wasm::cache,
as it gets the reference from the query_processor.
For consistency with the wasm engine and the wasm alien thread runner,
the name of the getter is also modified to follow the same pattern.
The wasm engine is used for compiling and executing Wasm UDFs, so
the query_processor is a more appropriate location for it than
replica::database, especially because the wasm instance cache
and the wasm alien thread runner are already there.
This patch also reduces the number of wasm engines to 1, shared by
all shards, as recommended by the wasmtime developers.
Fixes#13332
The tests user the discriminator "system" as prefix to assume keyspaces are marked
"internal" inside scylla. This is not true in enterprise universe (replicated key
provider). It maybe/probably should, but that train is sailing right now.
Fix by removing one assert (not correct) and use actual API info in the alternator
test.
Closes#13333
TWCS no longer keeps internal state, and will now rely on state
managed by each compaction group through compaction::table_state.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
LCS no longer keeps internal state, and will now rely on state
managed by each compaction group through compaction::table_state.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That will allow compaction_strategy to access the compaction group state
through compaction::table_state, which is the interface at which replica
talks to the compaction layer.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
this is the 15th changeset of a series which tries to give an overhaul to the CMake building system. this series has two goals:
- to enable developer to use CMake for building scylla. so they can use tools (CLion for instance) with CMake integration for better developer experience
- to enable us to tweak the dependencies in a simpler way. a well-defined cross module / subsystem dependency is a prerequisite for building this project with the C++20 modules.
this changeset includes following changes:
- build: cmake: add two missing tests
- build: cmake: port more cxxflags from configure.py
Closes#13262
* github.com:scylladb/scylladb:
build: cmake: add missing source files to idl and service
build: cmake: port more cxxflags from configure.py
build: cmake: add two missing tests
The concept is needed by enterprise functionality, but in the hunt for globals this sticks out and should be removed.
This is also partially prompted by the need to handle the keyspaces in the above set special on shutdown as well as startup. I.e. we need to ensure all user keyspaces are flushed/closed earlier then these. I.e. treat as "system" keyspace for this purpose.
These changes adds a "extension internal" keyspace set instead, which for now (until enterprise branches are updated) also included the "load_prio" set. However, it changes distributed loader to use the extension API interface instead, as well as adds shutdown special treatment to replica::database.
Closes#13335
* github.com:scylladb/scylladb:
datasbase: Flush/close "extension internal" keyspaces after other user ks
distributed_loader: Use extensions set of "extension internal" keyspaces
db::extentions: Add "extensions internal" keyspace set
we've been seeing errors like
```
10:39:36 gdb-add-index: [Was there no debuginfo? Was there already an index?]
10:39:36 readelf: /jenkins/workspace/scylla-master/next/scylla/build/dist/debug/redhat/BUILDROOT/scylla-5.3.0~dev-0.20230321.0f97d464d32b.x86_64/usr/lib/debug/opt/scylladb/libreloc/libc.so.6-5.3.0~dev-0.20230321.0f97d464d32b.x86_64.debug: Error: Unable to find program interpreter name
```
when strip.sh is processing *.debug elf images. this is caused by a
known issue, see
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1012107 . and this
error is not fatal. but it is very distracting when we are trying to
find errors in jenkins logging messages.
so, in this change, the stderr output from readelf is muted for higher
signal-noise ratio in the build logging message.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13267
This is used for readiness API: /storage_service/rpc_server and the fix prevents from returning 'true' prematurely.
Some improvement for readiness was added in a51529dd15 but thrift implementation wasn't fully done.
Fixes https://github.com/scylladb/scylladb/issues/12376Closes#13319
* github.com:scylladb/scylladb:
thrift: return address in listen_addresses() only after server is ready
thrift: simplify do_start_server() with seastar:async
this is a part of a series migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `bytes` and `gms::inet_address` without using ostream<<. also, this change removes all existing callers of `operator<<(ostream, const bytes &)` and `operator<<(ostream, const gms::inet_address&)`.
`gms::inet_address` related changes are included here in hope to demonstrate the usage of delimiter specifier of `fmt_hex` 's formatter.
Refs #13245Closes#13275
* github.com:scylladb/scylladb:
gms/inet_address: implement operator<< using fmt::formatter
treewide: use fmtlib to format gms::inet_address
gms/inet_address: specialize fmt::formatter<gms::inet_address>
bytes: implement formatting helpers using formatter
bytes: specialize fmt::formatter<bytes>
bytes: specialize fmt::formatter<fmt_hex>
bytes: mark fmt_hex::v `const`
Files with task manager repair module and related classes
are modified to be consistent with task manager compaction module.
Closes#13231
* github.com:scylladb/scylladb:
repair: rename repair_module
repair: add repair namespace to repair/task_manager_module.hh
repair: rename repair_task.hh
The state is not wired anywhere yet. It will replice the ones
stored in compaction strategies themselves. Therefore, allowing
each compaction group to have its own state.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
once compaction_strategy is made staless, the state must be retrieved
in notify_completion() through table_state.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This is a step towards decoupling compaction strategy (computation)
and its state. Making the former stateless.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Refs #13334
Effectively treats keyspaces listed in "extension internal" as system keyspaces
w.r.t. shutdown/drain. This ensures all user keyspaces are fully flushed before
we disable these "internal" ones.
Refs #13334
To be populated early by extensions. Such a keyspace should be
1.) Started before user keyspaces
2.) Flushed/closed after user keyspaces
3.) For all other regards be considered "user".
* tools/cqlsh b9a606f...8769c4c (11):
> dist: redhat: provide only a single version
> pylib/setup, requirement.txt: remove Six
> setup: do not support python2
> install.sh: install files with correct permission in struct umask settings
> Remove unneed LC_ALL=en_US.UTF-8
> Support using other driver (datastax or older scylla ones)
> Fix RPM based downgrade command on scylla-cqlsh
> gitignore: ignore pylib/cqlshlib/__pycache__
> dist/redhat: add a proper changelog entry
> github actions: enable starting on tags
> Add support for building docker image
the goal of this change is to reduce the dependency on
`operator<<(ostream&, const gms::inet_address&)`.
this is not an exhaustive search-and-replace change, as in some
caller sites we have other dependencies to yet-converted ostream
printer, we cannot fix them all, this change only updates some
caller of `operator<<(ostream&, const gms::inet_address&)`.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `gms::inet_address` with the help of fmt::ostream.
please note, the ':' delimiter is specified when printing the IPv6 address.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
some of these helpers prints a byte array using `to_hex()`, which
materializes a string instance and then drop it on the floor after
printing it to the given ostream. this hurts the performance, so
`fmt::print()` should be more performant in comparison to the
implementations based on `to_hex()`.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print bytes with the help of fmt::ostream.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print bytes_view with the help of fmt::ostream. because fmtlib
has its own specialization for fmt::formatter<std::basic_string_view<T>>,
we cannot just create a full specialization for std::basic_string_view<int8_t>,
otherwise fmtlib would complain that
> Mixing character types is disallowed.
so we workaround this using a delegate of fmt_hex.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The `wait_for_normal_state_handled_on_boot` function waits until
`handle_state_normal` finishes for the given set of nodes. It was used
in `run_bootstrap_ops` and `run_replace_ops` to wait until NORMAL states
of existing nodes in the cluster are processed by the joining node
before continuing the joining process. One reason to do it is because at
the end of `handle_state_normal` the joining node might drop connections
to the NORMAL nodes in order to reestablish new connections using
correct encryption settings. In tests we observed that the connection
drop was happening in the middle of repair/streaming, causing
repair/streaming to abort.
Unfortunately, calling `wait_for_normal_state_handled_on_boot` in
`run_bootstrap_ops`/`run_replace_ops` is too late to fix all problems.
Before either of these two functions, we create a new CDC generation and
write the data to `system_distributed_everywhere.cdc_generation_descriptions_v2`.
In tests, the connections were sometimes dropped while this write was
in-flight. This would cause the write to never arrive to other nodes,
and the joining node would timeout waiting for confirmations.
To fix this, call `wait_for_normal_state_handled_on_boot` earlier in the
boot procedure, before `make_new_generation` call which does the write.
Fixes: #13302Closes#13317
* github.com:scylladb/scylladb:
storage_service: wait for normal state handlers earlier in the boot procedure
storage_service: bootstrap: wait for normal tokens to arrive in all cases
storage_service: extract get_nodes_to_sync_with helper
storage_service: return unordered_set from get_ignore_dead_nodes_for_replace
We need this so that we can have multi-partition mutations which are applied atomically. If they live on different shards, we can't guarantee atomic write to the commitlog.
Fixes: #12642Closes#13134
* github.com:scylladb/scylladb:
test_raft_upgrade: add a test for schema commit log feature
scylla_cluster.py: add start flag to server_add
ServerInfo: drop host_id
scylla_cluster.py: add config to server_add
scylla_cluster.py: add expected_error to server_start
scylla_cluster.py: ScyllaServer.start, refactor error reporting
scylla_cluster.py: fix ScyllaServer.start, reset cmd if start failed
raft: check if schema commitlog is initialized Refuse to boot if neither the schema commitlog feature nor force_schema_commit_log is set. For the upgrade procedure the user should wait until the schema commitlog feature is enabled before enabling consistent_cluster_management.
raft: move raft initialization after init_system_keyspace
database: rename before_schema_keyspace_init->maybe_init_schema_commitlog
raft: use schema commitlog for raft tables
init_system_keyspace: refactoring towards explicit load phases
listen_addresses() checks if _server variable is empty and after this
patch we assign (move) the value only after server is ready.
This is used for readiness API: /storage_service/rpc_server and the fix
prevents from returning 'true' prematurely. Some improvement for readiness
was added in a51529dd15 but thrift implementation
wasn't fully done.
Fixes#12376
* tools/python3 279b6c1...d2f57dd (3):
> dist: redhat: provide only a single version
> SCYLLA-VERSION-GEN: use -gt when comparing values
> SCYLLA-VERSION-GEN: remove unnecessary bashism
Task manager task implementations of classes that cover
cleanup keyspace compaction which can be started through
/storage_service/keyspace_compaction/ api.
Top level task covers the whole compaction and creates child
tasks on each shard.
Closes#12712
* github.com:scylladb/scylladb:
test: extend test_compaction_task.py to test cleanup compaction
compaction: create task manager's task for cleanup keyspace compaction on one shard
compaction: create task manager's task for cleanup keyspace compaction
api: add get_table_ids to get table ids from table infos
compaction: create cleanup_compaction_task_impl
as fmt_hex is a helper class for formatting the underlying `bytes_view`,
it does not mutate it, so mark the member variable const and mark
the parameter in its constructor const. this change also helps us to
use fmt_hex in the use case where the const semantics is expected.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `range_tombstone` and `range_tombstone_change` without using ostream<<. also, this change removes all existing callers of `operator<<(ostream, const range_tombstone &)` and `operator<<(ostream, const range_tombstone_change &)`, and then removes these two `operator<<`s.
Refs #13245Closes#13260
* github.com:scylladb/scylladb:
mutation: drop operator<<(ostream, const range_tombstone{_change,} &)
mutation: use fmtlib to print range_stombstone{_change,}
mutation: mutation_fragment_v2: specialize fmt::formatter<range_tombstone_change>
mutation: range_tombstone: specialize fmt::formatter<range_tombstone>
at least, we need to access the declarations of exceptions, like`not_a_leader` and `dropped_entry`, so, instead of relying on other header to do this job for us, we should include the header which include the declaration. so, in this chance "raft.h" is include explicitly. also, include boost headers using "<path/to/header>` instead of "path/to/header` for more consistency.
Closes#13326
* github.com:scylladb/scylladb:
raft: include boost header using <path/to/header> not "path/to/header"
raft: include used header
* scripts/create-relocatable-package.py: add a command to print out
executables under libexec
* dist/debian/debian_files_gen.py: call create-relocatable-package.py
for a list of files under libexec and create source/include-binaries
with the list.
we repackage the precompiled binaries in the relocatable package into a debian source package using `./scylla/install.sh`, which edits the executable to use the specified dynamic library loader. but dpkg-source does not like this, as it wants to ensure that the files in original tarball (*.orig.tar.gz) is identical to the files in the source package created by dpkg-source.
so we have following failure when running reloc/build_deb.sh
```
dpkg-source: error: cannot represent change to scylla/libexec/scylla: binary file contents changed
dpkg-source: error: add scylla/libexec/scylla in debian/source/include-binaries if you want to store the modified binary in the debian tarball
dpkg-source: error: unrepresentable changes to source
dpkg-buildpackage: error: dpkg-source -b . subprocess returned exit status 1
debuild: fatal error at line 1182:
dpkg-buildpackage -rfakeroot -us -uc -ui failed
```
in this change, to address the build failure, as proposed by dpkg, the
path to the patched/edited executable is added to
`debian/source/include-binaries`. see the "Building" section in https://manpages.debian.org/bullseye/dpkg-dev/dpkg-source.1.en.html for more details.
please search `adjust_bin()` in `scylladb/install.sh` for more details.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12722
Create a local method called create_test_table that has the same
signature as test::create_test_table, but uses random schema behind the
scenes to generate the schema and the data, then migrate all the test
cases to use it instead.
To accomodate to the added randomness added by the random schema and
random data, the unreliable querier cache population checks was replaced
with more reliable lookup and miss checks, to prevent test flakiness.
Querier cache population checks worked well with a fixed and simple
schema and a fixed table population, they don't work that well with
random data.
With this, there are no more uses of test_table.hh in this test and the
include can be removed.
This keyspace exists by default and thus we don't have to create a new
one for each test. Also use `get_name()` to pass the test case's name as
table name, instead of hard-coding it. We already had some copy-pasta
creep in: two tests used the same table name. This is an error, as each
test runs in its own env, but it is confusing to see another test case's
name in the logs.
Propagate the page size to the result builder, so it can determine when
a page is short and thus it is the last page, instead of asking for more
pages until an empty one turns up. This will make tests more reliable
when dealing with random datasets.
Also change how the page counter is bumped: bump it after the current
page is executed, at which point we know whether there will be a next
page or not. This fixes an off-by-one seen in some cases.
Use the random_schema and its facilities to generate the schema and the
dataset. This allows the test to provide a much better coverage then the
previous, fixed and simplistic schema did.
Also reduce the test table population and the number of scans ran on it
to the test runs in a more reasonable time-frame.
We run these tests all the time due to CI, so no need to try to do too
much in a single run.
The tests in this file work with random schema and random data. Some
seeds can generate large partitions and rows, give the test some
more headroom to work with.
* generate lowercase names (upper-case seems to cause problems);
* preserve dependency order between UDTs when dumping them from schema;
* use built-in describe() to dump to CQL string;
* drop single arg dump_udts() overlad, which was not recursive, unlike
the vector variant;
For regular and static columns, to introduce some further randomness.
So far frozen types were generated only for primary key members and
embedded types.
this series applies some random cleanups to bloom_filter. these cleanups were the side products when the author was working on #13314 .
Closes#13315
* github.com:scylladb/scylladb:
bloom_filter: mark internal help function static
bloom_filter: add more constness to false positive rate tables
bloom_filter: use vector::back() when appropriate
before this change, we use `round(random.random(), 5)` for
the value of `bloom_filter_fp_chance` config option. there are
chances that this expression could return a number lower or equal
to 6.71e-05.
but we do have a minimal for this option, which is defined by
`utils::bloom_calculations::probs`. and the minimal false positive
rate is 6.71e-05.
we are observing test failures where the we are using 0 for
the option, and scylla right rejected it with the error message of
```
bloom_filter_fp_chance must be larger than 6.71e-05 and less than or equal to 1.0 (got 0)
```.
so, in this change, to address the test failure, we always use a number
slightly greater or equal to a number slightly greater to the minimum to
ensure that the randomly picked number is in the range of supported
false positive rate.
Fixes#13313
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13314
When creating the reader, the lifecycle policy might return one that was saved on the last page and survived in the cache. This reader might have skipped some fast-forwarding ranges while sitting in the cache. To avoid using a reader reading a stale range (from the read's POV), check its read range and fast forward it if necessary.
Fixes: https://github.com/scylladb/scylladb/issues/12916Closes#12932
* github.com:scylladb/scylladb:
readers/multishard: shard_reader: fast-forward created reader to current range
readers/multishard: reader_lifecycle_policy: add get_read_range()
test/boost/multishard_mutation_query_test: paging: handle range becoming wrapping
The Wasm compilation is a slow, low priority task, so it should
not compete with reactor threads or the networking core.
To achieve that, we increase the niceness of the thread by 10.
An alternative solution would be to set the priority using
pthread_setschedparam, but it's not currently feasible,
because as long as we're using the SCHED_OTHER policy for our
threads, we cannot select any other priority than 0.
Closes#13307
Fixes https://github.com/scylladb/scylladb/issues/13106
This commit removes the information that BYPASS CACHE
is an Enterprise-only feature and replaces that info
with the link to the BYPASS CACHE description.
Closes#13316
<iterator> was introduced back in
1cf02cb9d8, but lexicographical_compare.hh
was extracted out in bdfc0aa748, since we
don't have any users of <iterator> in types.hh anymore, let's remove it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13327
s/%{version}/%{version}-%{release}/ in `Requires:` sections.
this enforces the runtime dependencies of exactly the same releases between scylla packages.
Fixes#13222
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13229
* github.com:scylladb/scylladb:
dist/redhat: split Requires section into multiple lines
dist/redhat: enforce dependency on %{release} also
min() and max() had two implementations: one static (for each type in
a select list) and one dynamic (for compound types). Since the
dynamic implementation is sufficient, we only reimplement that. This
means we don't use the automarshalling helpers, since we don't do any
arithemetic on values apart from comparison, which is conveniently
provided by abstract_type.
Add a helper function to consolidate the internal native function
class and the automatic marshalling introduced in previous patches.
Since decaying a lambda into a function pointer (in order to
infer its signature) there are two overloads: one accepts a lambda
and decays it into a function pointer, the second accepts a function
pointer, infers its argument, and constructs the function object.
at least, we need to access the declarations of exceptions, like
`not_a_leader` and `dropped_entry`, so, instead of relying on
other header to do this job for us, we should include the header
which include the declaration. so, in this chance "raft.h" is
include explicitly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
It would have been better if `flush()` could have been called with a
keyspace and optional table param, but changing it now is too much
churn, so we add a dedicated method to flush a keyspace instead.
So far, schema had to be provided via a schema.cql file, a file which
contains the CQL definition of the table. This is flexible but annoying
at the same time. Many times sstables the tool operates on are located
in their table directory in a scylla data directory, where the schema
tables are also available. To mitigate this, an alternative method to
load the schema from memory was added which works for system tables.
In this commit we extend this to work for all kind of tables: by
auto-detecting where the scylla data directory is, and loading the
schema tables from disk.
Allows loading the schema for the designated keyspace and table, from
the system table sstables located on disk. The sstable files opened for
read only.
When creating the reader, the lifecycle policy might return one that was
saved on the last page and survived in the cache. This reader might have
skipped some fast-forwarding ranges while sitting in the cache. To avoid
using a reader reading a stale range (from the read's POV), check its
read range and fast forward it if necessary.
After each page, the read range is adjusted so it continues from/after
the last read partition. Sometimes this can result in the range becoming
wrapped like this: (pk, pk]. In this case, we can just drop this range
and continue with the rest of the ranges (if there are multiple ones).
There was an attempt to cut feature-service -> system-keyspace dependency (#13172) which turned out to require more changes. Here's a preparation squeezing from this future work.
This set
- leaves only batch-enabling API in feature service
- keeps the need for async context in feature service
- narrows down system keyspace features API to only load and store records
- relaxes features updating logic in sys.ks.
- cosmetic
Closes#13264
* github.com:scylladb/scylladb:
feature_service: Indentation fix after previous patch
feature_service: Move async context into enable()
system_keyspace: Refactor local features load/save helpers
feature_service: Mark supported_feature_set() const
feature_service: Remove single feature enabling method
boot: Enable features in batch
gossiper: Enable features in batch
The test tries to start a node with
consistent_cluster_management but without
force_schema_commit_log. This is expected to fail,
since the schema commitlog feature should be enabled
by all the cluster nodes.
Sometimes when creating a node it's useful
to just install it and not start. For example,
we may want to try to start it later with
expected error.
The ScyllaServer.install method has been made
exception safe, if an exception occurs, it
reverts to the original state. This allows
to not duplicate the try/except logic
in two of its call sites.
We are going to allow the
ScyllaCluster.add_server function not to
start the server if the caller has requested
that with a special parameter. The host_id
can only be obtained from a running node, so
add_server won't be able to return it in
this case. I've grepped the tests for host_id
and there doesn't seem to be any
reference to it in the code.
Sometimes it's useful to check that the node has failed
to start for a particular reason. If server_start can't
find expected_error in the node's log or if the
node has started without errors, it throws an exception.
Extract the function that encapsulates all the error
reporting logic. We are going to use it in several
other places to implement expected_error feature.
The ScyllaServer expects cmd to be None if the
Scylla process is not running. Otherwise, if start failed
and the test called update_config, the latter will
try to send a signal to a non-existent process via cmd.
Refuse to boot if neither the schema commitlog feature
nor force_schema_commit_log is set. For the upgrade
procedure the user should wait until
the schema commitlog feature is enabled before
enabling consistent_cluster_management.
Raft tables are loaded on the second call to
init_system_keyspace, so it seems more logical
to move initialization after it. This is not
necessary right now since raft tables are not used
in this initialization logic, but it may
change in the future and cause troubles.
We are going to move the raft tables from the first
load phase to the second. This means the second
init_system_keyspace call will load raft tables along
with the schema, making the name of this function imprecise.
We aim (#12642) to use the schema commit log
for raft tables. Now they are loaded at
the first call to init_system_keyspace in
main.cc, but the schema commitlog is only
initialized shortly before the second
call. This is important, since the schema
commitlog initialization
(database::before_schema_keyspace_init)
needs to access schema commitlog feature,
which is loaded from system.scylla_local
and therefore is only available after the
first init_system_keyspace call.
So the idea is to defer the loading of the raft tables
until the second call to init_system_keyspace,
just as it works for schema tables.
For this we need a tool to mark which tables
should be loaded in the first or second phase.
To do this, in this patch we introduce system_table_load_phase
enum. It's set in the schema_static_props for schema tables.
It replaces the system_keyspace::table_selector in the
signature of init_system_keyspace.
The call site for populate_keyspace in init_system_keyspace
was changed, table_selector.contains_keyspace was replaced with
db.local().has_keyspace. This check prevents calling
populate_keyspace(system_schema) on phase1, but allows for
populate_keyspace(system) on phase2 (to init raft tables).
On this second call some tables from system keyspace
(e.g. system.local) may have already been populated on phase1.
This check protects from double-populating them, since every
populated cf is marked as ready_for_writes.
The `wait_for_normal_state_handled_on_boot` function waits until
`handle_state_normal` finishes for the given set of nodes. It was used
in `run_bootstrap_ops` and `run_replace_ops` to wait until NORMAL states
of existing nodes in the cluster are processed by the joining node
before continuing the joining process. One reason to do it is because at
the end of `handle_state_normal` the joining node might drop connections
to the NORMAL nodes in order to reestablish new connections using
correct encryption settings. In tests we observed that the connection
drop was happening in the middle of repair/streaming, causing
repair/streaming to abort.
Unfortunately, calling `wait_for_normal_state_handled_on_boot` in
`run_bootstrap_ops`/`run_replace_ops` is too late to fix all problems.
Before either of these two functions, we create a new CDC generation and
write the data to `system_distributed_everywhere.cdc_generation_descriptions_v2`.
In tests, the connections were sometimes dropped while this write was
in-flight. This would cause the write to never arrive to other nodes,
and the joining node would timeout waiting for confirmations.
To fix this, call `wait_for_normal_state_handled_on_boot` earlier in the
boot procedure, before `make_new_generation` call which does the write.
Fixes: #13302
`storage_service::bootstrap` waits until it receives normal tokens of
other nodes before proceeding or it times out with an error. But it only
did that for bootstrap operation, not for replace operation. Do it for
replace as well.
This commit removes the Enterprise upgrade guides from
the Open Source documentation. The Enterprise upgrade guides
should only be available in the Enterprise documentation,
with the source files stored in scylla-enterprise.git.
In addition, this commit:
- adds the links to the Enterprise user guides in the Enterprise
documentation at https://enterprise.docs.scylladb.com/
- adds the redirections for the removed pages to avoid
breaking any links.
This commit must be reverted in scylla-enterprise.git.
Closes#13298
no need to use `size - 1` for accessing the last element in a vector,
let's just use `vector::back()` for more compacted code.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
When propagating a view update to a paired view
replica fails, there is an error message.
This message is printed for every mutation,
which causes log spam when some node goes down.
This isn't a fatal error - it's normal that
a remote view replica goes down, it'll hopefully
receive the updates later through hints.
I'm unsure if the error message should
be printed at all, but for now we can
just rate limit it and that will improve
the situation with log spamming.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#13175
The validate_column_family() tries to find a schema and throws if it
doesn't exist. The latter is determined by the exception thrown by the
database::find_schema(), but there's a throw-less way of doing it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13295
The patch series introduces linearisable topology changes using
raft protocol. The state machine driven by raft is described in
"service: Introduce topology state machine". Some explanations about
the implementation can be found in "storage_service: raft topology:
implement topology management through raft".
The code is not ready for production. There is not much in terms of error
handling and integration with the rest of the system is not even started.
For full integration request fencing will need to be implemented and
token_metadata has to be extended to support not just "pending" nodes
but concepts of "read replica set" and "write replica set".
The code may be far from be usable, but it is hidden behind the
"experimental raft" flag and having it in tree will relieve me from
constant rebase burden.
* 'raft-topology-v6' of github.com:scylladb/scylla-dev:
storage_service: fix indentation from previous patch
storage_service: raft topology: implement topology management through raft
service: raft: make group0_guard move assignable
service: raft: wire up apply() and snapshot transfer for topology in group0 state machine
storage_service: raft topology: introduce a function that applies topology cmd to local state machine
storage_service: raft topology: introduce a raft monitor and topology coordinator fibers
storage_service: raft topology: introduce snapshot transfer code for the topology table
raft topology: add RAFT_TOPOLOGY_CMD verb that will be used by topology coordinator to communicated with nodes
bootstrapper: Add get_random_bootstrap_tokens function
service: raft: add support for topology_change command into raft_group0_client
service: raft: introduce topology_change group0 command
system_keyspace: add a table to persist topology change state machine's state
service: Introduce topology state machine data structures
storage_proxy: not consult topology on local table write
The code here implements the state machine described in "service:
Introduce topology state machine". A topology operation is requested
by writing into topology_request field through raft. After that
topology_change_transition() function running on a leader is responsible
to drive the operation to completion. There is no much in terms of error
handling here yet. It something fails the code will just continue trying.
topology_change_state_load() which is (eventually) called on all nodes each
time state machine's state changes is a glue between the raft view of
the topology and the rest of the "legacy" system. The code there creates
token_metadata object from the raft view and fills in peers table which
is needed for drivers. The gossiper is almost completely cut of from the
topology management, but the code still updates node's sate there to
'normal' and 'left' for some legacy functionality to continue working.
Note that handlers for those states are disabled in raft mode.
raft_topology_cmd_handler() is called by topology coordinator and this
is where the streaming happens. The kind of streaming depends on the
state the node is in. The function is "re-entrable". It can be called
more then once and will either start new operation if it is the first
invocation or previous one failed, or it will wait from previous
operation to complete.
The new code is hidden behind "experimental raft" and should not change
how the system works if disabled.
Some indentation here is intentionally left wrong and will be fixed by
the next patch.
The function applies to persistent storage and call stub function
topology_change_state_load() that will load the new state into the
memory in later patches.
Raft monitor fiber monitors local's server raft state and starts the
topology coordinator fiber when it becomes a leader. Stops it when it
is not longer a leader.
The coordinator fiber waits for topology state changes, but there will
be none yet.
Empty for now. Will be used later by the topology coordinator to
communicate with other nodes to instruct them to start streaming,
or start to fence read/writes.
This patch increases the connection timeout in the get_cql_cluster()
function in test/cql-pytest/run.py. This function is used to test
that Scylla came up, and also test/alternator/run uses it to set
up the authentication - which can only be done through CQL.
The Python driver has 2-second and 5-second default timeouts that should
have been more than enough for everybody (TM), but in #13239 we saw
that in one case it apparently wasn't enough. So to be extra safe,
let's increase the default connection-related timeouts to 60 seconds.
Note this change only affects the Scylla *boot* in the test/*/run
scripts, and it does not affect the actual tests - those have different
code to connect to Scylla (see cql_session() in test/cql-pytest/util.py),
and we already increased the timeouts there in #11289.
Fixes#13239
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13291
This reverts commit c6087cf3a0.
Said commit can cause a deadlock when 2 or more repairs compete for
locks on 2 or more nodes. Consider the following scenario:
Node n1 and n2 in the cluster, 1 shard per node, rf = 2, each shard has
1 available unit for the reader lock
n1 starts repair r1
r1-n1 (instance of r1 on node1) takes the reader lock on node1
n2 starts repair r2
r2-n2 (instance of r2 on node2) takes the reader lock on node2
r1-n2 will fail to take the reader lock on node2
r2-n1 will fail to take the reader lock on node1
As a result, r1 and r2 could not make progress and deadlock happens.
The complexity comes from the fact that a repair job needs lock on more
than one node. It is not guaranteed that all the participant nodes could
take the lock in one short.
There is no simple solution to this so we have to revert this locking
mechanism and look for another way to prevent reader trashing when
repairing nodes with mismatching shard count.
Fixes: #12693Closes#13266
before this change, the line is 249 chars long, so split it into
multiple lines for better readabitlity.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, alternator_timeout_in_ms is not live-updatable,
as after setting executor's default timeout right before creating
sharded executor instances, they never get updated with this option
anymore.
in this change,
* alternator_timeout_in_ms is marked as live-updateable
* executor::_s_default_timeout is changed to a thread_local variable,
so it can be updated by a per-shard updateable_value. and
it is now a updateable_value, so its variable name is updated
accordingly. this value is set in the ctor of executor, and
it is disconnected from the corresponding named_value<> option
in the dtor of executor.
* alternator_timeout_in_ms is passed to the constructor of
executor via sharded_parameter, so executor::_timeout_in_ms can
be initialized on per-shard basis
* executor::set_default_timeout() is dropped, as we already pass
the option to executor in its ctor.
please note, in the ctor of executor, we always update the cached
value of `s_default_timeout` with the value of `_timeout_in_ms`,
and we set the default timeout to 10s in `alternator_test_env`.
this is a design decision to avoid bending the production code for
testing, as in production, we always set the timeout with the value
specified either by the default value of yaml conf file.
Fixes#12232
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Cassandra detects when a batch has both an IF EXISTS and IF NOT EXISTS
on the same row, and complains this is not a useful request (after all,
it can never succeed, because the batch can only succeed if both conditions
are true, and that can't be if one checks IF EXISTS and the other
IF NOT EXISTS).
This patch adds a test, test_lwt_with_batch_conflict_1, which checks
that this case results in an error. It passes on Cassandra, but xfails
on Scylla which doesn't report an error in this case.
A second test, test_lwt_with_batch_conflict_2, shows that the detection
of the EXISTS / NOT EXISTS conflict is special, and other conflicts
such as having both "r=1" and "r=2" for the same row, are NOT detected
by Cassandra.
Refs #13011.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13270
It's declared in header, but is not used outside of .cc. Forward
declaration in header would be enough.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13289
for better readability.
also, add `#include <concepts>`, as we should include what we use
instead of relying on other headers do this on behalf of us.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13277
clang warns when the implicit conversion changes the precision of the
converted number. in this case, the before being multiplied,
`std::numeric_limits<unsigned long>::max() >> 1` is implicitly
promoted to double so it can obtain the common type of double and
unsigned long. and the compiler warns:
```
/home/kefu/dev/scylladb/test/boost/network_topology_strategy_test.cc:129:84: error: implicit conversion from 'unsigned long' to 'double' changes value from 9223372036854775807 to 9223372036854775808 [-Werror,-Wimplicit-const-int-float-conversion]
return static_cast<unsigned long>(d*(std::numeric_limits<unsigned long>::max() >> 1)) << 1;
~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~
```
but
1. we don't really care about the precision here, we just want to map a
double to a token represented by an int64_t
2. the maximum possible number being converted is less than
9223372036854775807, which is the maximum number of int64_t, which
is in general an alias of `long long`, not to mention that
LONG_MAX is always 2147483647, after shifting right, the result
would be 1073741823
so this is a false alarm. in order to silence it, we explicitly
cast the RHS of `*` operator to double.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13221
Hey y'all!
Me and @malusev998 are maintaining a updated version of the [PHP Driver ](https://github.com/he4rt/scylladb-php-driver) together with @he4rt community and it had a bunch of improvements on these last month.
Before it was working only at PHP 7.1 (DataStax branch), and at our branch we have it working at PHP 8.1 and 8.2.
We are also using the ScyllaDB C++ Driver on this project and I think that is a good idea to point new users for this project since it's the most updated PHP Driver maintained now.
What do y'all think about that?
Closes#13218
* github.com:scylladb/scylladb:
fix: links to php driver
fix: adding php versions into driver's description
docs: scylladb better php driver
Currently, we only tested whether permissions with UDFs
that have quoted names work correctly. This patch adds
the missing test that confirms that we can also use UDTs
(as UDF parameter types) when altering permissions.
Currently, the ut_name::to_string() is used only in 2 cases:
the first one is in logs or as part of error messages, and the
second one is during parsing, temporarily storing the user
defined type name in the auth::resource for later preparation
with database and data_dictionary context.
This patch changes the string so that the 'name' part of the
ut_name (as opposed to the 'keyspace' part) is now quoted when
needed. This does not worsen the logging set of cases, but it
does help with parsing of the resulting string, when finishing
preparing the auth::resource.
After the modification, a more fitting name for the function
is "ut_name::to_cql_string()", so the function is renamed to that.
While in debug mode, we may switch the default stack to
a larger one when parsing cql. We may, however, invoke
the parser recusively, causing us to switch to the big
stack while currently using it. After the reset, we
assume that the stack is empty, so after switching to
the same stack, we write over its previous contents.
This is fixed by checking if we're already using the large
stack, which is achieved by comparing the address of
a local variable to the start and end of the large stack.
this change is a leftover of 063b3be8a7, which failed to include the changes in the header files.
it turns out we have `using namespace httpd;` in seastar's `request_parser.rl`, and we should not rely on this statement to expose the symbols in `seatar::httpd` to `seastar` namespace. in this change,
also, sine `get_name()` previously a non-static member function of `seastar_test` is now a static member function, so we need to update the tests which capture `this` for calling this function, so they don't capture `this` anymore.
Closes#13202
* github.com:scylladb/scylladb:
test: drop unused captured variables
Update seastar submodule
* seastar 9cbc1fe889...1204efbc5e (14):
> http: Add lost pragma once into client.hh
> prometheus, http: do not expose httpd::* in seastar
> build: add haswell support
> ci: fix configuration to build checkheaders target.
> core: map_reduce: Fix use-after-free in variant with futurized reducer
> Merge 'tests: support boost::test decorators and tolerate failures in test_spawn_input' from Kefu Chai
> memory: support reallocing foreign (non-Seastar) memory on a reactor thread
> test: futures: disable -Wself-move for GCC>=13
> map_reduce: do not move a temporary object
> doc/building-dpdk.md: drop extraneous '$'
> http: url_decode: translate plus back into char
> Merge 'seastar-json2code: cleanups' from Kefu Chai
> Fix markdown formatting
> Merge 'Minor abort on OOM changes' from Travis Downs
Use generation_type rather than generation_type::int_t
where possible and removed the deprecated
functions accepting the int_t.i
Ref #10459
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Also, add a bunch of make_sstable variants that get a
generation_type param for this.
With that, the entry points for generation_type::int_t
are deprecated and their users will be converted
in following patches.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is possible to find no generation in an empty
table directory, and in he future, with uuid generations
it'd be possible to find no numeric generations in the
directory.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
make_new_generation generates a new generation
from an optional one.
If disengaged, it just generates a new generation
based on the shard_id. Otherwise, it generates
the next generation in sequence by adding
smp::count to the previous value, like we do today.
In the future, with uuid-based generations, the
function could be used to generate a new random
uuid based on the optional parameter.
It will be up to the caller, e.g. replica::table or
sstables manager to decide which kind of generation to
create.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
updating the highest generation happens only during
startup and creating sstables is done rarely enough
there is no reason to inline either functions.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Convert all users to use sstables::generation_type::int_t.
Further patches will continue to convert most to
using sstables::generation_type instead so we can
abstract the value type.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch extends a previous patch that added these metrics globally:
- cql_requests_count
- cql_request_bytes
- cql_response_bytes
This patch adds a "scheduling_group_name" label to these metrics and changes corresponding
counters to be accounted on a per-scheduling-group level.
As a bonus this patch also marks all 3 metrics as 'skip_when_empty'.
Ref #13061
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20230321201412.3004845-1-vladz@scylladb.com>
Notably, to admission execution and eviction. Registering/unregistering
the permit as inactive is not traced, as this happens on every
buffer-fill for range scans.
Semaphore trace messages have a "[reader_concurrency_semaphore]" prefix
to allow them to be clearly associated with the semaphore.
To make sure all tracing done on a certain page will make its way into
the appropriate trace session.
This is a contination of the previous patch (which added trace pointer
to the permit).
And propagate it down to where it is created. This will be used to add
trace points for semaphore related events, but this will come in the
next patches.
perf.cc has two key comparators: key_compare and key_tri_compare. These
are very generic name, in fact key_compare directly clashes with a
comparator with the same name in types.hh. Avoid the clash by renaming
both of these to a more unique name.
This is a translation of Cassandra's CQL unit test source file
validation/operations/SelectMultiColumnRelationTest.java into our
cql-pytest framework.
The tests reproduce four already-known Scylla bugs and three new bugs.
All tests pass on Cassandra. Because of these bugs 9 of the 22 tests
are marked xfail, and one is marked skip (it crashes Scylla).
Already known issues:
Refs #64: CQL Multi column restrictions are allowed only on a clustering
key prefix
Refs #4178: Not covered corner case for key prefix optimization in filtering
Refs #4244: Add support for mixing token, multi- and single-column
restrictions
Refs #8627: Cleanly reject updates with indexed values where value > 64k
New issue discovered by these tests:
Refs #13217: Internal server error when null is used in multi-column relation
Refs #13241: Multi-column IN restriction with tuples of different lengths
crashes Scylla
Refs #13250: One-element multi-column restriction should be handled like a
single-column restriction
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13265
Point to the difference between the official MurmurHash3 and Scylla / Cassandra implementation
Update docs/glossary.rst
Co-authored-by: Anna Stuchlik <37244380+annastuchlik@users.noreply.github.com>
Closes#11369
Call replicate_live_endpoints on shard 0 to copy from 0 to the rest of
the shards. And get the list of live members from shard 0.
Move lock to the callers.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13240
The topology state machine will track all the nodes in a cluster,
their state, properties (topology, tokens, etc) and requested actions.
Node state can be one of those:
none - the node is not yet in the cluster
bootstrapping - the node is currently bootstrapping
decommissioning - the node is being decommissioned
removing - the node is being removed
replacing - the node is replacing another node
normal - the node is working normally
rebuild - the node is being rebuilt
left - the node is left the cluster
Nodes in state left are never removed from the state.
Tokens also can be in one of the states:
write_both_read_old - writes are going to new and old replica, but reads are from
old replicas still
write_both_read_new - writes still going to old and new replicas but reads are
from new replica
owner - tokens are owned by the node and reads and write go to new
replica set only
Tokens that needs to be move start in 'write_both_read_old' state. After entire
cluster learns about it streaming start. After the streaming tokens move
to 'write_both_read_new' state and again the whole cluster needs to learn about it
and make sure no reads started before that point exist in the system.
After that tokens may move to 'owner' state.
topology_request is the field through which a topology operation request
can be issued to a node. A request is one of the topology operation
currently supported: join, leave, replace or remove.
Writes to tables with local replication strategies do not need to consult
the topology. This is not only an optimization but it allows writing
into the local tables before topology is known.
this change is a leftover of 063b3be,
which failed to include the changes in the header files.
it turns out we have `using namespace httpd;` in seastar's
`request_parser.rl`, and we should not rely on this statement to
expose the symbols in `seatar::httpd` to `seastar` namespace.
in this change,
* api/*.hh: all httpd symbols are referenced by `httpd::*`
instead of being referenced as if they are in `seastar`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
UUID_test uses lexicograhical_compare from the types module. This
is a layering violation, since UUIDs are at a much lower level than
the database type system. In practical terms, this cause link failures
with gcc due to some thread-local-storage variables defined in types.hh
but not provided by any object, since we don't link with types.o in this
test.
Fix by extracting the relevant functions into a new header.
gcc thinks the constructor call is ambiguous since "{}" can match
the default constructor. Fix by making the parameter type explicit.
Use "{}" for the constructor call to avoid the most-vexing-parse
problem.
Some tests initialize via an initializer_list, but gcc finds other
valid constructors via vector<managed_bytes>. Disambiguate by adding
a constructor that accepts the initializer_list, and forward to the
wanted constructor.
serialize_listlike() is called with a range of either managed_bytes
or managed_bytes_opt. If the former, then iterating and assigning
to a loop induction variable of type managed_byted_opt& will bind
the reference to a temporary managed_bytes_opt, which gcc dislikes.
Fix by performing the binding in a separate statement, which allows
for lifetime extension.
gcc dislikes a member name that matches a type name, as it changes
the type name retroactively. Fix by fully-qualifying the type name,
so it is not changed by the newly-introduced member.
We compare a signed variable to an unsigned one, which can
yield surprising results. In this case, it is harmless since
we already validated the signed input is positive, but
use std::cmp_less() to quench any doubts (and warnings).
There's a need to convert both -- version and format -- to string and back. Currently, there's a disperse set of helpers in sstables/ code doing that and this PR brings some other to it
- adds fmt::formatter<> specialization for both types
- leaves one set of {format|version}_from_string() helpers converting any string-ish object into value
refs: #12523Closes#13214
* github.com:scylladb/scylladb:
sstables: Expell sstable_version_types from_string() helper
sstables: Generalize ..._from_string helpers
sstables: Implement fmt::formatter<sstable_format_types>
sstables: Implement fmt::formatter<sstable_version_types>
sstables: Move format maps to namespace scope
Callers don't need to know that enabling features has this requirement
Indentation is deliberately left broken (until next patch)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Introduce load_local_enabled_features() and save_local_enabled_features()
that get and put std::set<sstring> with feature names (and perform set to
string and back conversions on their own). They look natural next to
existing sys.ks. methods to get/set local-supported features and peer
features.
Using the new API, the more generic functions to preserve individual
features and load them on startup can become much shorter and cleaner.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In the future, when testing WASM UDFs, we will only store the Rust
source codes of them, and compile them to WASM. To be able to
do that, we need rust standard library for the wasm32-wasi target,
which is available as an RPM called rust-std-static-wasm32-wasi.
Closes#12896
[avi: regenerate toolchain]
Closes#13258
On boot main calls enable_features_on_startup() which at the end scans
through the list of features and enables them. Same as in previous patch
-- it makes sense to use batch enabling here.
Note, that despite the loop that collects features is not as trivial as
in previous patch (gossiper case), it still operates with local copies
of feature sets so delaying the feature's enabling doesn't affect other
features' need to be enabled too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Gossiper code walks the list of feature names and enables them
one-by-one. However, in the feature_service code there's a method that
enables features in batch.
Using it now doesn't make any difference, but next patches will make
some use of it. Also, this will let shortening feature_service's API and
will make it simpler to remove qctx thing from there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Remove redundant "Total: ..." line.
Include the entire `reader_concurrency_semaphore::stats` in the printout. This includes a lot of metrics not exported to monitoring. These metrics are very valuable when debugging timeouts but are otherwise uninteresting. To avoid bloating our monitoring with such niche metrics, we dump them when they are interesting: when timeouts happen. To be really helpful, we do need historic values too, but this shouldn't be a problem: timeouts come in bursts, we usually get at least a handful of diagnostics dumps at a time.
New stats are also added to record the reason why reads are queued on the semaphore.
Printout before:
```
INFO 2023-03-14 12:43:54,496 [shard 0] reader_concurrency_semaphore - Semaphore test_reader_concurrency_semaphore_memory_limit_no_leaks with 4/4 count and 7168/4096 memory resources: kill limit triggered, dumping permit diagnostics:
permits count memory table/description/state
4 4 7K *.*/reader/active/unused
2 0 0B *.*/reader/waiting_for_admission
6 4 7K total
Total: 6 permits with 4 count and 7K memory resources
```
Printout after:
```
INFO 2023-03-16 04:23:41,791 [shard 0] reader_concurrency_semaphore - Semaphore test_reader_concurrency_semaphore_memory_limit_no_leaks with 3/4 count and 7168/4096 memory resources: kill limit triggered, dumping permit diagnostics:
permits count memory table/description/state
2 2 6K *.*/reader/active/unused
1 1 1K *.*/reader/waiting_for_memory
2 0 0B *.*/reader/waiting_for_admission
5 3 7K total
Stats:
permit_based_evictions: 0
time_based_evictions: 0
inactive_reads: 0
total_successful_reads: 0
total_failed_reads: 0
total_reads_shed_due_to_overload: 0
total_reads_killed_due_to_kill_limit: 1
reads_admitted: 4
reads_enqueued_for_admission: 4
reads_enqueued_for_memory: 5
reads_admitted_immediately: 2
reads_queued_because_ready_list: 0
reads_queued_because_used_permits: 0
reads_queued_because_memory_resources: 0
reads_queued_because_count_resources: 4
reads_queued_with_eviction: 0
total_permits: 6
current_permits: 5
used_permits: 0
blocked_permits: 0
disk_reads: 0
sstables_read: 0
```
Closes#13173
* github.com:scylladb/scylladb:
test/boost/reader_concurrency_semaphore_test: remove redundant stats printouts
reader_concurrency_semaphore: do_dump_reader_permit_diagnostics(): print the stats
reader_concurrency_semaphore: add stats to record reason for queueing permits
reader_concurrency_semaphore: can_admit_read(): also return reason for rejection
It's name is too generic despite it's narrow specialization. Also,
there's a version_from_string() method that does the same in a more
convenient way.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two string->{version|format} converters living on class
sstable. It's better to have both in namespace scope. Surprisingly,
there's only one caller of it.
Also this patch makes both accept std::string_view not to limit the
helpers in converting only sstring&-s. This changes calls for
reverse_map template update with "heterogenuous lookup".
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This way the version type can be fed as-is into fmt:: code, respectively
the conversion to string is as simple as fmt::to_string(v). So also drop
the explicit existing to_string() helper updating all callers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It turns out that Cassandra handles a restriction like `(c2) = (1)` just
like `c2 = 1`, and is not limited like multi-column restrictions. In
particular, this query works despite missing "c1", and may also use an
index if c2 is indexed.
But currently in Scylla, `(c2) = (1)` is handled like a multi-column
restriction, so complains if c2 is not the first clustering key column,
and cannot use an index.
This patch adds several tests demonstrating this difference between
Scylla and Cassandra (#13250). The xfailing tests pass on Cassandra
but fail on Scylla.
Refs #13250
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13252
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print range_tombstone_change without using ostream<<.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print range_tombstone.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
s/%{version}/%{version}-%{release}/ in `Requires:` sections.
this enforces the runtime dependencies of exactly the same
releases between scylla packages.
Fixes#13222
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Currently, when granting a permission on a funciton resource, we only check if the function exists, regardless of whether it's a user or a builtin function. We should not support altering permissions on builtin functions, so this patch adds a check for confirming that the found function is not builtin.
Additionally, adjust an error exception thrown when trying to alter a permission that does not apply on a given resource
Closes#13184
* github.com:scylladb/scylladb:
cql: change exception type when granting incorrect permissions
cql: check if the function is builtin when granting permissisons
this is a part of a series migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print UUID without using ostream<<. also, this change re-implements some formatting helpers using fmtlib for better performance and less dependencies on operator<<(), but we cannot drop it at this moment, as quite a few caller sites are still using operator<<(ostream&, const UUID&) and operator<<(ostream&, tagged_uuid<T>&). we will address them separately.
* add `fmt::formatter<UUID>`
* add `fmt::formatter<tagged_uuid<T>>`
* implement `UUID::to_string()` using `fmt::to_string()`
* implement `operator<<(std::ostream&, const UUID&)` with `fmt::print()`, this should help to improve the performance when printing uuid, as `fmt::print()` does not materialize a string when printing the uuid.
* treewide: use fmtlib when printing UUID
Refs #13245Closes#13246
* github.com:scylladb/scylladb:
treewide: use fmtlib when printing UUID
utils: UUID: specialize fmt::formatter for UUID and tagged_uuid<>
To apply topology_change commands group0_state_machine needs to have an
access to the storage service to support topology changes over raft.
Message-Id: <20230316112801.1004602-10-gleb@scylladb.com>
We create a snapshot (config only, but still), but do not assign it any
id. Because of that it is not loaded on start. We do want it to be
loaded though since the state of group0 will not be re-created from the
log on restart because the entries will have outdated id and will be
skipped. As a result in memory state machine state will not be restored.
This is not a problem now since schema state it restored outside of raft
code.
Message-Id: <20230316112801.1004602-5-gleb@scylladb.com>
Add a function that allows waiting for a state change of a raft server.
It is useful for a user that wants to know when a node becomes/stops
being a leader.
Message-Id: <20230316112801.1004602-4-gleb@scylladb.com>
Make tick() and is_leader() part of the API. First is used externally
already and another will be used in following patches.
Message-Id: <20230316112801.1004602-3-gleb@scylladb.com>
The REST test test_storage_service.py::test_toppartitions_pk_needs_escaping
was flaky. It tests the toppartition request, which unfortunately needs
to choose a sampling duration in advance, and we chose 1 second which we
considered more than enough - and indeed typically even 1ms is enough!
but very rarely (only know of only one occurance, in issue #13223) one
second is not enough.
Instead of increasing this 1 second and making this test even slower,
this patch takes a retry approach: The tests starts with a 0.01 second
duration, and is then retried with increasing durations until it succeeds
or a 5-seconds duration is reached. This retry approach has two benefits:
1. It de-flakes the test (allowing a very slow test to take 5 seconds
instead of 1 seconds which wasn't enough), and 2. At the same time it
makes a successful test much faster (it used to always take a full
second, now it takes 0.07 seconds on a dev build on my laptop).
A *failed* test may, in some cases, take 10 seconds after this patch
(although in some other cases, an error will be caught immediately),
but I consider this acceptable - this test should pass, after all,
and a failure indicates a regression and taking 10 seconds will be
the last of our worries in that case.
Fixes#13223.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13238
as in main(), we use `stop_signal` to handle SIGINT and SIGTERM,
so when scylla receives a SIGTERM, the corresponding signal handler
could get called on any threads created by this program. so there
is chance that the alien_runner thread could be choosen to run the
signal handler setup by `main()`, but that signal handler assumes
the availability of Seastar reactor. unfortunately, we don't have
a Seastar reactor in alien thread. the same applies to Seastar's
`thread_pool` which handles the slow and blocking POSIX calls typically
used for interacting with files.
so, in this change, we use the same approach as Seastar's
`thread_pool::work()` -- just block all signals, so the alien threads
used by wasm for compiling UDF won't handle the signals using the
handlers planted by `main()`.
Fixes#13228
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13233
This series cleans up unit test in preparation for PR #12994.
Helpers are added (or reused) to not rely on specific sstable generation numbers where possible (other than loading reference sstables that are committed to the repo with given generation numbers), and to generate the sstables for tests easily, taking advantage of generation management in `sstable_test_env`, `table_for_tests`, or `replica::table` itself.
Closes#13242
* github.com:scylladb/scylladb:
test: add verify_mutation helpers.
test: add make_sstable_containing memtable
test: table_for_tests: add make_sstable function
test: sstable_test_env: add make_sst_factory methods
test: sstable_compaction_test: do not rely on specific generations
tests: use make_sstable defaults as much as possible
test: sstable_test_env: add make_table_for_tests
test: sstable_datafile_test: do not rely on sepecific sstable generations
test: sstable_test_env: add reusable_sst(shared_sstable)
sstable: expose get_storage function
test: mutation_reader_test: create_sstable: do not rely on specific generations
test: mutation_reader_test: do_test_clustering_order_merger_sstable_set: rely on test_envsstable generation
test: mutation_reader_test: combined_mutation_reader_test: define a local sst_factory function
test: mutation_reader_test: do not use tmpdir
test: use big format by default
test: sstable_compaction_test: use highest sstable version by default
test: test_env: make_db_config: set cfg host_id
test: sstable_datafile_test: fixup indentation
test: sstable_datafile_test: various tests: do_with_async
test: sstable_3_x_test: validate_read, sstable_assertions: get shared_sstable
test: sstable_3_x_test: compare_sstables: get shared_sstable
test: sstable_3_x_test: write_sstables: return shared_sstable
test: sstable_3_x_test: write, compare, validate_sstables: use env.tempdir
test: sstable_3_x_test: compacted_sstable_reader: do not reopen compacted_sst
test: lib: test_services: delete now unused stop_and_keep_alive
test: sstable_compaction_test: use deferred_stop to stop table_for_tests
test: sstable_compaction_test: compound_sstable_set_incremental_selector_test: do_with_async
test: sstable_compaction_test: sstable_needs_cleanup_test: do_with_async
test: sstable_compaction_test: leveled_05: fixup indentation
test: sstable_compaction_test: leveled_05: do_with_async
test: sstable_compaction_test: compact_02: do_with_async
test: sstable_compaction_test: compact_sstables: simplify variable allocation
test: sstable_compaction_test: compact_sstables: reindent
test: sstable_compaction_test: compact_sstables: use thread
test: sstable_compaction_test: sstable_rewrite: simplify variable allocation
test: sstable_compaction_test: sstable_rewrite: fixup indentation
test: sstable_compaction_test: sstable_rewrite: do_with_async
test: sstable_compaction_test: compact: fixup indentation
test: sstable_compaction_test: compact: complete conversion to async thread
test: sstable_compaction_test: compaction_manager_basic_test: rename generations to idx
This patch addes a reproducing test for issue #13241, where attempting
a SELECT restriction (b,c,d) IN ((1,2)) - where the tuple is shorter
than needed - crashes Scylla (on segmentation fault) instead of
generating a clean error as it should (and as done on Cassandra).
The test also demonstractes that if the tuple is longer than needed
(instead of shorter), the behavior is correct, and it is also
correct if "=" is used instead of IN. Only the combination of IN
and too-short tuple seems to be broken - but broken in a bad way
(can be used to crash Scylla).
Because the test crashes Scylla when fails, it is marked "skip".
Refs #13241
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13244
Fixes https://github.com/scylladb/scylladb/issues/12758
This commit adds a new page with a matrix that shows
on which ScyllaDB Open Source versions we based given
ScyllaDB Enterprise versions.
The new file is added to the newly created Reference
section.
Closes#13230
this change tries to reduce the number of callers using operator<<()
for printing UUID. they are found by compiling the tree after commenting
out `operator<<(std::ostream& out, const UUID& uuid)`. but this change
alone is not enough to drop all callers, as some callers are using
`operator<<(ostream&, const unordered_map&)` and other overloads to
print ranges whose elements contain UUID. so in order to limit the
scope of the change, we are not changing them here.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print UUID without using ostream<<. also, this change reimplements
some formatting helpers using fmtlib for better performance and less
dependencies on operator<<(), but we cannot drop it at this moment,
as quite a few caller sites are still using operator<<(ostream&, const UUID&)
and operator<<(ostream&, tagged_uuid<T>&). we will address them separately.
* add fmt::formatter<UUID>
* add fmt::formatter<tagged_uuid<T>>
* implement UUID::to_string() using fmt::to_string()
* implement operator<<(std::ostream&, const UUID&) with fmt::print(),
this should help to improve the performance when printing uuid, as
fmt::print() does not materialize a string when printing the uuid.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is the 14rd changeset of a series which tries to give an overhaul to the CMake building system. this series has two goals:
- to enable developer to use CMake for building scylla. so they can use tools (CLion for instance) with CMake integration for better developer experience
- to enable us to tweak the dependencies in a simpler way. a well-defined cross module / subsystem dependency is a prerequisite for building this project with the C++20 modules.
this changeset includes following changes:
- build: cmake: promote add_scylla_test() to test/
- build: cmake: add all tests
Closes#13220
* github.com:scylladb/scylladb:
build: cmake: add all tests
build: cmake: promote add_scylla_test() to test/
To make it possible to move the class member away resetting to be be
empty at the same time.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13208
Some unit tests want to change the sstable::_dir on the fly. However,
the sstable::_dir is going away, so it needs a yet another virtual call
on storage driver.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13213
table_for_tests uses a sstables manager to generate sstables
and gets the new generation from
table.calculate_generation_for_new_table().
The version to use is either the highest supported or
an ad-hoc version passed to make_sstable.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The tests extensively use a `std::function<shared_sstable()>`
to generate new tables.
Rather than handcrafting them all over the place,
let sstable_test_env return such factory given a schema
(and another entry point that also gets a version)
and that uses the embedded generation_factory in the test_env
to generate new sstables with unique generations.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
No need to maintain a static generation numbers in the test.
Let the sstable_test_env dispatch sstable generations automatically
And use the generated sstable themselves for reference rather
than their generation numbers.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a few goodies to sstable_test_env to extend
entry points with default params for make_sstable
and reusable_sst.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Wrap table_for_tests ctor to pass the env sstables_manager
as well as the temporary directory path, as this is the
most common use case, and in preparation for adding
a make_sstable method in table_for_tests.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There is no need to use specific generations in the test, just
rely on the ones sstable_test_env generates.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Allow generating a sstable object from an existing
sstable to get the directory, generation, and version
from it, rather than passing them to reusable_sst
from other sources - since the intention is
to get a new sstable object based on an existing
sstable that was generated by the test.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than maintaining a running generation number,
use the default env.make_sstable(s) in sst_factory
and collect the expected generations from the resulting
shared sstable.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For generating shared_sstables with increasing generations
(using the test_env make_sstable generations) and a given level.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
No need to pass the big format explicitly as it's
set by default by make_sstable and it is never overriden.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Tests should just generate the highest sstable version
available. There is no need to ontinue testing old versions,
in particular partially supported ones like "la".
Use also the default values for sstable::format_types, buffer_size,
etc. if there's no particular need to override them.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Pass the test-generated shared_sstable to validate_read
and then to sstable_assertions so it can be used
for make_sstable version and generation params.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Just use the one we created during compaction
for verification so we won't have to rely on a particular
generation/version.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than calling cf.stop_and_keep_alive() before the test exits.
since it must be stopped also on failure.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We already use test_env::do_with_async in this function
but we didn't take full advantage of it to simplify the
implementation.
Do that before further changes are made.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
generations to idx
The function used `calculate_generation_for_new_table` for
the sstables generation. The so-called `generations` are just used
to generate key indices.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
While repair requested by user is performed, some tables
may be dropped. When the repair proceeds to these tables,
it should skip them and continue with others.
When no_such_column_family is thrown during user requested
repair, it is logged and swallowed. Then the repair continues with
the remaining tables.
Fixes: #13045Closes#13068
* github.com:scylladb/scylladb:
repair: fix indentation
repair: continue user requested repair if no_such_column_family is thrown
repair: add find_column_family_if_exists function
The `storage_options` describes where sstables should be located. Currently the object reside on keyspace_metadata, but is thus not available at the place it's needed the most -- the `table::make_sstable()` call. This set converts keyspace_metadata::storage_opts to be lw-shared-ptr and shares the ptr with class table.
refs: #12523 (detached small change from large PR)
Closes#13212
* github.com:scylladb/scylladb:
table: Keep storage options lw-shared-ptr
keyspace_metadata: Make storage options lw-shared-ptr
Clang-17 warns when we tries to delete a pointer to a class with virtual
function(s) but without marking its dtor virtual. in this change, we
mark the dtor of the base class of `table_listener` virtual to address
the warning.
we have another solution though -- to mark `table_listener` `final`. as we
don't destruct `table_listener` with a pointer to its base classes. but
it'd be much simpler to just mark the dtor virtual of its base class
with virtual method(s). it's much idiomatic this way, and less error-prune.
this change should silence the warning like:
```
In file included from /home/kefu/dev/scylladb/test/boost/data_listeners_test.cc:9:
In file included from /usr/include/boost/test/unit_test.hpp:18:
In file included from /usr/include/boost/test/test_tools.hpp:46:
In file included from /usr/include/boost/test/tools/old/impl.hpp:20:
In file included from /usr/include/boost/test/tools/assertion_result.hpp:21:
In file included from /usr/include/boost/shared_ptr.hpp:17:
In file included from /usr/include/boost/smart_ptr/shared_ptr.hpp:17:
In file included from /usr/include/boost/smart_ptr/detail/shared_count.hpp:27:
In file included from /usr/include/boost/smart_ptr/detail/sp_counted_impl.hpp:35:
In file included from /home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/memory:78:
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/unique_ptr.h:100:2: error: delete called on non-final 'table_listener' that has virtual functions but non-virtual destructor [-Werror,-Wdelete-non-abstract-non-virtual-dtor]
delete __ptr;
^
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/unique_ptr.h:405:4: note: in instantiation of member function 'std::default_delete<table_listener>::operator()' requested here
get_deleter()(std::move(__ptr));
^
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/stl_construct.h:88:15: note: in instantiation of member function 'std::unique_ptr<table_listener>::~unique_ptr' requested here
__location->~_Tp();
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13198
this change should address the FTBFS with Clang-17.
turns out we are comparing a mutation with an
optimized_optional<mutation>. and Clang-17 does not want to convert the
LHS, which is a mutation to optimized_optional<mutation> for performing
the comparison using operator==(const optimized_optional<mutation>&),
desipte that optimized_optional(const T& obj) is not marked explicit.
this is understandable.
so, in this change, instead of relying on the implicit conversion, we
just
* check if the optional actually holds a value
* and compare the value by deferencing the optional.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13196
It will be needed by S3 driver to parse multipart-upload messages from
server
refs: #12523
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13158
[avi: regenerate toolchain]
Closes#13192
this series intends to deprecate `::join()`, as it always materializes a range into a concrete string. but what we always want is to print the elements in the given range to stream, or to a seastar logger, which is backed by fmtlib. also, because fmtlib offers exactly the same set of features implemented by to_string.hh, this change would allow us to use fmtlib to replace to_string.hh for better maintainability, and potentially better performance. as fmtlib is lazy evaluated, and claims to be performant under most circumstances.
Closes#13163
* github.com:scylladb/scylladb:
utils: to_string: move join to namespace utils
treewide: use fmt::join() when appropriate
row_cache: pass "const cache_entry" to operator<<
Even after last fixups, the documentation still had some issues with
compilation instructions in particular. I also ran a spelling and
grammar check on the text, and fixed issues found by it.
Closes#13206
Print the semaphore stats below the permit listing and remove the
currently redundant "Total: " line.
Some of the stats printed here are already exported as metrics, but
instead of trying to cherry-pick and risk some metrics falling through
the cracks, just print everything, there aren't that many anyway.
When diagnosing problems, knowing why permits were queued is very
valuable. Record the reason in a new stats, one for each reason a permit
can be queued.
* add a new test KIND "UNIT", which provides its own main()
* add all tests which were not included yet
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
For compatibility with Cassandra, this patch changes the exception
type thrown when trying to alter a permission that is not applicable
on the given resource from an Invalid query to a Syntax exception.
Currently, when granting a permission on a funciton resource, we only
check if the function exists, regardless of whether it's a user
or a builtin function. We should not support altering permissions
on builtin functions, so this patch adds a check for confirming
that the found function is not builtin.
Tables need to know which storage their sstables need to be located at,
so class table needs to have itw reference of the storage options. The
thing can be inherited from the keyspace metadata.
Tests sometimes create table without keyspace at hand. For those use
default-initialized storage options (which is local storage).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Today the storage options are embedded into metadata object. In the
future the storage options will need to be somehow referenced by the
class table too. Using plan reference doesn't look safe, turn the
storage options into lw-shared-ptr instead.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
`join` can easily be confused with boost::algorithm::join
so make it more visible that we're using scylla's
utils implementation.
Also, move `struct print_with_comma` to utils::internal.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
now that fmtlib provides fmt::join(). see
https://fmt.dev/latest/api.html#_CPPv4I0EN3fmt4joinE9join_viewIN6detail10iterator_tI5RangeEEN6detail10sentinel_tI5RangeEEERR5Range11string_view
there is not need to revent the wheel. so in this change, the homebrew
join() is replaced with fmt::join().
as fmt::join() returns an join_view(), this could improve the
performance under certain circumstances where the fully materialized
string is not needed.
please note, the goal of this change is to use fmt::join(), and this
change does not intend to improve the performance of existing
implementation based on "operator<<" unless the new implementation is
much more complicated. we will address the unnecessarily materialized
strings in a follow-up commit.
some noteworthy things related to this change:
* unlike the existing `join()`, `fmt::join()` returns a view. so we
have to materialize the view if what we expect is a `sstring`
* `fmt::format()` does not accept a view, so we cannot pass the
return value of `fmt::join()` to `fmt::format()`
* fmtlib does not format a typed pointer, i.e., it does not format,
for instance, a `const std::string*`. but operator<<() always print
a typed pointer. so if we want to format a typed pointer, we either
need to cast the pointer to `void*` or use `fmt::ptr()`.
* fmtlib is not able to pick up the overload of
`operator<<(std::ostream& os, const column_definition* cd)`, so we
have to use a wrapper class of `maybe_column_definition` for printing
a pointer to `column_definition`. since the overload is only used
by the two overloads of
`statement_restrictions::add_single_column_parition_key_restriction()`,
the operator<< for `const column_definition*` is dropped.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Cranelift-codegen 0.92.0 and wasmtime 5.0.0 have security issues
potentially allowing malicious UDFs to read some memory outside
the wasm sandbox. This patch updates them to versions 0.92.1
and 5.0.1 respectively, where the issues are fixed.
Fixes#13157Closes#13171
This commit adds branch-5.2 to the list of branches
for which we want to build the docs. As a result,
version 5.2 will be added to the version selector.
NOTE: Version 5.2 will be marked as unstable and
an appropriate message will be shown to the user.
After 5.2 is released, branch-5.2 needs to be
moved from UNSTABLE_VERSIONS to LATEST_VERSION
(where is should replace branch-5.1)
Closes#13200
Add an API call to wait for all shards to reach the current shard 0
gossiper version. Throws when timeout is reached.
Closes#12540
* github.com:scylladb/scylladb:
api: gossiper: fix alive nodes
gms, service: lock live endpoint copy
gms, service: live endpoint copy method
this is the 13rd changeset of a series which tries to give an overhaul to the CMake building system. this series has two goals:
- to enable developer to use CMake for building scylla. so they can use tools (CLion for instance) with CMake integration for better developer experience
- to enable us to tweak the dependencies in a simpler way. a well-defined cross module / subsystem dependency is a prerequisite for building this project with the C++20 modules.
this changeset includes following changes:
- build: cmake: increase per link job mem to 4GiB
- build: cmake: add missing sources to test-lib
- build: cmake: add more tests
- build: cmake: remote quotes in "include()" commands
- build: cmake: drop unnecessary linkages
Closes#13199
* github.com:scylladb/scylladb:
build: cmake: drop unnecessary linkages
build: cmake: remote quotes in "include()" commands
build: cmake: add more tests
build: cmake: add missing sources to test-lib
build: cmake: increase per link job mem to 4GiB
The translated Cassandra unit tests in cassandra_tests/validation/operations/
reproduced three bugs in GROUP BY's interaction with LIMIT and PER PARTITION
LIMIT - issue #5361, #5362 and #5363. Unfortunately, those test functions
are very long, and each test fails on all of these issues and a few more,
making it difficult to use these tests to verify when those tests have
been fixed. In other words, ideally a patch for issue 5361 should un-xfail
some reproducing test for this issue - but all the existing tests will
continue to fail after fixing 5361, because of other remaining bugs.
So in this patch, I created a new test file test_group_by.py with my own
tests for the GROUP BY feature. I tried to explore the different
capabilities of the GROUP BY feature, its different success and error
paths, and how GROUP BY interacts with LIMIT and PER PARTITION LIMIT.
As usual, I created many small test functions and not one huge test
function, and as a result we now have 5 xfailing tests which each
reproduces one bug and when the bug is fixed, it will start to pass.
All tests added here pass on Cassandra.
Refs #5361
Refs #5362
Refs #5363
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13136
Fixes https://github.com/scylladb/scylladb/issues/13138
Fixes https://github.com/scylladb/scylladb/issues/13153
This PR:
- Fixes outdated information about the recommended OS. Since version 5.2, the recommended OS should be Ubuntu 22.04 because that OS is used for building the ScyllaDB image.
- Adds the OS support information for version 5.2.
This PR (both commits) needs to be backported to branch-5.2.
Closes#13188
* github.com:scylladb/scylladb:
doc: Add OS support for version 5.2
doc: Updates the recommended OS to be Ubuntu 22.04
lld is multi-threaded in some phases, based on observation, it could
spawn up to 16 threads for each link job. and each job could take up
to more than 3 GiB memory in total. without the change, we can run
into OOM with a machine without abundant memory, so increase the
per-link-job mem accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
operator<<(..) does not mutate the cache_entry parameter passed to it.
also, without this change fmtlib is not able to format given cache_entry
parameter, as the caller formatter has "const" specifier.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Add a helper that, given a C++ function, deduces its arguument types
and wraps the function in marshalling/unmarshalling code.
The native function expects non-null inputs, so an additional helper is
called to decide what to do if nulls are encountered. One such
helper is return_accumulator_on_null (since that's the default
behavior of aggregates), and the other is return_any_nonnull(),
useful for reductions.
Currently, big_decimal(0) will select the big_decimal(string_view)
constructor (via 0 -> const char* -> string_view conversions).
0 is important for initializing aggregates, so fix it ahead of
using it.
We'll need many scalar functions to implement aggregates in terms
of scalars, so we add an internal_scalar_function class to reduce
boilerplate. The new class proxies the scalar function into a
native noncopyable_function provided by the constructor.
Currently, aggregate functions are implemented in a statefull manner.
The accumulator is stored internally in an aggregate_function::aggregate,
requiring each query to instantiate new instances (see
aggregate_function_selector's constructor, and note how it's called
from selector::new_instance()).
This makes aggregates hard to use in expressions, since expressions
are stateless (with state only provided to evaluate()). To facilitate
migration towards stateless expressions, we define a
stateless_aggregate_function (modelled after user-defined aggregates,
which are already stateless). This new struct defines the aggregate
in terms of three scalar functions: one to aggregate a new input into
an accumulator (provided in the first parameter), one to finalize an
accumulator into a result, and one to reduce two accumulators for
parallelized aggregation.
An adapter of the new struct to the aggregate_function interface is
also provided, to allow for incremental migration in the following
patches.
Previously, we moved cql3::functions::function to the
db::functions namespace, since functions are a part of the
data dictionary, which is independent of cql3. We do the
same now for scalar_function, since we wish to make use
of it in a new db::functions::stateless_aggregate_function.
A stub remains in cql3/functions to avoid churn.
A read that requested memory and has to wait for it can be registered as inactive. This can happen for example if the memory request originated from a background I/O operation (a read-ahead maybe).
Handling this case is currently very difficult. What we want to do is evict such a read on-the-spot: the fact that there is a read waiting on memory means memory is in demand and so inactive reads should be evicted. To evict this reader, we'd first have to remove it from the memory wait list, which is almost impossible currently, because `expiring_fifo<>`, the type used for the wait list, doesn't allow for that. So in this PR we set out to make this possible first, by transforming all current queues to be intrusive lists of permits. Permits are already linked into an intrusive list, to allow for enumerating all existing permits. We use these existing hooks to link the permits into the appropriate queue, and back to `_permit_list` when they are not in any special queue. To make this possible we first have to make all lists store naked permits, moving all auxiliary data fields currently stored in wrappers like `entry` into the permit itself. With this, all queues and lists in the semaphore are intrusive lists, storing permits directly, which has the following implications:
* queues no longer take extra memory, as all of them are intrusive
* permits are completely self-sufficient w.r.t to queuing: code can queue or dequeue permits just with a reference to a permit at hand, no other wrapper, iterator, pointer, etc. is necessary.
* queues don't keep permits alive anymore; destroying a permit will automatically unlink it from the respective queue, although this might lead to use-after-free. Not a problem in practice, only one code-path (`reader_concurrenc_semaphore::with_permit()`) had to be adjusted.
After all that extensive preparations, we can now handle the case of evicting a reader which is queued on memory.
Fixes: #12700Closes#12777
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: handle reader blocked on memory becoming inactive
reader_concurrency_semaphore: move _permit_list next to the other lists
reader_permit: evict inactive read on timeout
reader_concurrency_semaphore: move inactive_read to .cc
reader_concurrency_semaphore: store permits in _inactive_reads
reader_concurrency_semaphore: inactive_read: de-inline more methods
reader_concurrency_semaphore: make _ready_list intrusive
reader_permit: add wait_for_execution state
reader_concurrency_semaphore: make wait lists intrusive
reader_concurrency_semaphore: move most wait_queue methods out-of-line
reader_concurrency_semaphore: store permits directly in queues
reader_permit: introduce (private) operator * and ->
reader_concurrency_semaphore: remove redundant waiters() member
reader_concurrency_semaphore: add waiters counter
reader_permit: use check_abort() for timeout
reader_concurrency_semaphore: maybe_dump_permit_diagnostics(): remove permit list param
reader_concurrency_semaphroe: make foreach_permit() const
reader_permit: add get_schema() and get_op_name() accessors
reader_concurrency_semaphore: mark maybe_dump_permit_diagnostics as noexcept
This patch fixes 2 small issues with the Wasm UDF documentation that
recently got uploaded:
1. a link was unnecessarily wrapped in angle brackets
2. a link did not redirect to the correct page due to a missing ":doc:" tag
Closes#13193
Our end goal (#12642) is to mark raft tables to use
schema commitlog. There are two similar
cases in code right now - `with_null_sharder`
and `set_wait_for_sync_to_commitlog` `schema_builder`
methods. The problem is that if we need to
mark some new schema with one of these methods
we need to do this twice - first in
a method describing the schema
(e.g. `system_keyspace::raft()`) and second in the
function `create_table_from_mutations`, which is not
obvious and easy to forget.
`create_table_from_mutations` is called when schema object
is reconstructed from mutations, `with_null_sharder`
and `set_wait_for_sync_to_commitlog` must be called from it
since the schema properties they describe are
not included in the mutation representation of the schema.
This series proposes to distinguish between the schema
properties that get into mutations and those that do not.
The former are described with `schema_builder`, while for
the latter we introduce `schema_static_props` struct and
the `schema_builder::register_static_configurator` method.
This way we can formulate a rule once in the code about
which schemas should have a null sharder/be synced, and it will
be enforced in all cases.
Closes#13170
* github.com:scylladb/scylladb:
schema.hh: choose schema_commitlog based on schema_static_props flag
schema.hh: use schema_static_props for wait_for_sync_to_commitlog
schema.hh: introduce schema_static_props, use it for null_sharder
database.cc: drop ensure_populated and mark_as_populated
just came across this part of code, as `maybe_yield()` is a wrapper
around "if should_yield(): yield()", so better off using it for more
concise code.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13107
Fixes https://github.com/scylladb/scylladb/issues/13138
This PR fixes the outdated information about the recommended
OS. Since version 5.2, the recommended OS should be Ubuntu 22.04
because that OS is used for building the ScyllaDB image.
This commit needs to be backported to branch-5.2.
On start scylla checks if the option is set. It's nowadays useless, as
it had been removed from seastar (see 9e34779c update)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13148
this is the 12nd changeset of a series which tries to give an overhaul to the CMake building system. this series has two goals:
- to enable developer to use CMake for building scylla. so they can use tools (CLion for instance) with CMake integration for better developer experience
- to enable us to tweak the dependencies in a simpler way. a well-defined cross module / subsystem dependency is a prerequisite for building this project with the C++20 modules.
this changeset includes following changes:
- build: cmake: remove Seastar from the option name
- build: cmake: add missing sources in test-lib and utils
- build: cmake: do not include main.cc in scylla-main
- build: cmake: define SEASTAR_TESTING_MAIN for SEASTAR tests
- build: cmake: add more tests
Closes#13180
* github.com:scylladb/scylladb:
build: cmake: add more tests
build: cmake: define SEASTAR_TESTING_MAIN for SEASTAR tests
build: cmake: do not include main.cc in scylla-main
build: cmake: add missing sources in test-lib and utils
build: cmake: remove Seastar from the option name
This is a translation of Cassandra's CQL unit test source file
validation/operations/SelectGroupByTest.java into our cql-pytest
framework.
This test file contains only 8 separate test functions, but each of them
is very long checking hundreds of different combinations of GROUP BY with
other things like LIMIT, ORDER BY, etc., so 6 out of the 7 tests fail on
Scylla on one of the bugs listed below - most of the tests actually fail
in multiple places due to multiple bugs. All tests pass on Cassandra.
The tests reproduce six already-known Scylla issues and one new issue:
Already known issues:
Refs #2060: Allow mixing token and partition key restrictions
Refs #5361: LIMIT doesn't work when using GROUP BY
Refs #5362: LIMIT is not doing it right when using GROUP BY
Refs #5363: PER PARTITION LIMIT doesn't work right when using GROUP BY
Refs #12477: Combination of COUNT with GROUP BY is different from Cassandra
in case of no matches
Refs #12479: SELECT DISTINCT should refuse GROUP BY with clustering column
A new issue discovered by these tests:
Refs #13109: Incorrect sort order when combining IN, GROUP BY and ORDER BY
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13126
In debug mode the timings are:
view_schema_test: 90 sec
cql_query_test: 170 sec
memtable_test: 2090 sec
cql_functions_test: 2591 sec
other tests that are in/out of this list are not that obvious, but the
former two apparently deserve being replaced with the latter two.
Timings for dev/release modes are not that horrible, but the "first pair
is notably smaller than the latter" relation also exists.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13142
Today, the SSTable generation provides a hint on which shard owns a
particular SSTable. That hint determines which shard will load the
SSTable into memory.
With upcoming UUID generation, we will no longer have this hint
embedded into the SSTable generation, meaning that SSTables will be
loaded at random shards. This is not good because shards will have
to reference memory from other shards to access the SSTable
metadata that was allocated elsewhere.
This patch changes sstable_directory to:
1) Use generation value to only determine which shard will calculate
the owner shards for SSTables. Essentially works like a round-robin
distribution.
2) The shard assigned to compute the owners for a SSTable will do
so reading the minimum from disk, usually only Scylla file is
needed.
3) Once that shard finished computing the owners, it will forward
the SSTable to the shard that own it.
4) Shards will later load SSTables locally that were forwarded to
them.
Closes#13114
* github.com:scylladb/scylladb:
sstables: sstable_directory: Load SSTable at the shard that actually own it
sstables: sstable_directory: Give sstable_info_vector a more descriptive name
sstables: Allow owner shards to be computed for a partially loaded SSTable
sstables: Move SSTable loading to sstable_directory::sort_sstable()
sstables: Move sstable_directory::sort_sstable() to private interface
sstables: Restore indentation in sstable_directory::sort_sstable()
sstables: Coroutinize sstable_directory::sort_sstable()
sstables: sstable_directory: Extract sstable loading from process_descriptor()
sstables: sstable_directory: Separate private fields from methods
sstables: Coroutinize sstable_directory::process_descriptor
* test/boost: add more tests: all tests listed in test/boost/CMakeLists.txt
should build now.
* rust: add inc library, which is used for testing.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
change the option name to "LINK_MEM_PER_JOB" as this is not
a Seastar option, but a top-level project option.
Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Under some circumstances, service_level_controller renames service
levels for internal purposes. However, the per-service-level metrics
registered by storage_proxy keep the name seen at first registration
time. This sometimes leads to mislabeled metrics.
Fix that by re-registering the metrics after scheduling groups
are renamed.
Fixes scylladb/scylla-enterprise#2755
Closes#13174
This series handles errors when aborting node operations and prints them rather letting them leak and be exposed to the user.
Also, cleanup the node_ops logging formats when aborting different node ops
and add more error logging around errors in the "worker" nodes.
Closes#12799
* github.com:scylladb/scylladb:
storage_service: node_ops_signal_abort: print a warning when signaling abort
storage_service: s/node_ops_singal_abort/node_ops_signal_abort/
storage_service: node_ops_abort: add log messages
storage_service: wire node_ops_ctl for node operations
storage_service: add node_ops_ctl class to formalize all node_ops flow
repair: node_ops_cmd_request: add print function
repair: do_decommission_removenode_with_repair: log ignore_nodes
repair: replace_with_repair: get ignore_nodes as unordered_set
gossiper: get_generation_for_nodes: get nodes as unordered_set
storage_service: don't let node_ops abort failures mask the real error
This patch finishes the refactoring. We introduce the
use_schema_commitlog flag in schema_static_props
and use it to choose the commitlog in
database::add_column_family. The only
configurator added declares what was originally in
database::add_column_family - all
tables from schema_tables keyspace
should use schema_commitlog.
This patch continues the refactoring, now we move
wait_for_sync_to_commitlog property from schema_builder to
schema_static_props.
The patch replaces schema_builder::set_wait_for_sync_to_commitlog
and is_extra_durable with two register_static_configurator,
one in system_keyspace and another in system_distributed_keyspace.
They correspond to the two parts of the original disjunction
in schema_tables::is_extra_durable.
Simplified, more direct version of "dependency injection".
I.e. caller/initiator (main/cql_test_env) provides a set of
services it will eventually start. Configurable can remember
these. And use, at least after "start" notification.
Closes#13037
Our goal (#12642) is to mark raft tables to use
schema commitlog. There are two similar
cases in code right now - with_null_sharder
and set_wait_for_sync_to_commitlog schema_builder
methods. The problem is that if we need to
mark some new schema with one of these methods
we need to do this twice - first in
a method describing the schema
(e.g. system_keyspace::raft()) and second in the
function create_table_from_mutations, which is not
obvious and easy to forget.
create_table_from_mutations is called when schema object
is reconstructed from mutations, with_null_sharder
and set_wait_for_sync_to_commitlog must be called from it
since the schema properties they describe are
not included in the mutation representation of the schema.
This patch proposes to distinguish between the schema
properties that get into mutations and those that do not.
The former are described with schema_builder, while for
the latter we introduce schema_static_props struct and
the schema_builder::register_static_configurator method.
This way we can formulate a rule once in the code about
which schemas should have a null sharder, and it will
be enforced in all cases.
Until now, the instructions on generating wasm files and using them
for Scylla UDFs were stored in docs/dev, so they were not visible
on the docs website. Now that the Rust helper library for UDFs
is ready, and we're inviting users to try it out, we should also
make the rest of the Wasm UDF documentation readily available
for the users.
Closes#13139
- build: cmake: remove test which does not exist yet
- build: cmake: document add_scylla_test()
- build: cmake: extract index, repair and data_dictionary out
- build: cmake: extract scylla-main out
- build: cmake: find Snappy before using it
- build: cmake: add missing linkages
- build: cmake: add missing sources to test-lib
- build: cmake: link sstables against libdeflate
- build: cmake: link Boost::regex against ICU::uc
Closes#13110
* github.com:scylladb/scylladb:
build: cmake: link Boost::regex against ICU::uc
build: cmake: link sstables against libdeflate
build: cmake: add missing sources to test-lib
build: cmake: add missing linkages
build: cmake: find Snappy before using it
build: cmake: extract scylla-main out
build: cmake: extract index, repair and data_dictionary out
build: cmake: document add_scylla_test()
build: cmake: remove test which does not exist yet
There was some logic to call mark_as_populate at
the appropriate places, but the _populated field
and the ensure_populated function were
not used by anyone.
The `database::stop` method is sometimes hanging and it's always hard to spot where exactly it sleeps. Few more logging messages would make this much simpler.
refs: #13100
refs: #10941Closes#13141
* github.com:scylladb/scylladb:
database: Increase verbosity of database::stop() method
large_data_handler: Increase verbosity on shutdown
large_data_handler: Coroutinize .stop() method
Today, the SSTable generation provides a hint on which shard owns a
particular SSTable. That hint determines which shard will load the
SSTable into memory.
With upcoming UUID generation, we will no longer have this hint
embedded into the SSTable generation, meaning that SSTables will be
loaded at random shards. This is not good because shards will have
to reference memory from other shards to access the SSTable
metadata that was allocated elsewhere.
This patch changes sstable_directory to:
1) Use generation value to only determine which shard will calculate
the owner shards for SSTables. Essentially works like a round-robin
distribution.
2) The shard assigned to compute the owners for a SSTable will do
so reading the minimum from disk, usually only Scylla file is
needed.
3) Once that shard finished computing the owners, it will forward
the SSTable to the shard that own it.
4) Shards will later load SSTables locally that were forwarded to
them.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Today, owner shards can only be computed for a fully loaded SSTable.
For upcoming changes in the SSTable loader, we want to load the minimum
from disk to be able to compute the set of shards owning the SSTable.
If sharding metadata is available, it means we only need to read
TOC and Scylla components.
Otherwise, Summary must be read to provide first and last keys for
compute_shards_for_this_sstable() to operate on them instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The reason for this change is that we'll want to fully load the
SSTable only at the destination shard.
Later, sort_sstable() will calculate set of owner shards for a
SSTable by only loading scylla metadata file.
If it turns out that the SSTable belongs to current shard, then
we'll fully load the SSTable using the new and fresh
sstable_directory::load_sstable().
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Will make it easier for process_descriptor to process the SSTable
without having to fully load the SSTable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Following the expected coding convention. It's also somewhat
disturbing to see them mixed up.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When one of column families requested for repair does not exist, we should
repair all other requested column families.
no_such_column_family exception is caught and logged, and repair continues.
Kill said read's memory requests with std::bad_alloc and dequeue it from
the memory wait list, then evict it on the spot.
Now that `_inactive_reads` just store permits, we can do this easily.
Add an member of type `inactive_read` to reader permit, and store permit
instances in `_inactive_reads`. This list is now just another intrusive
list the permit can be linked into, depending on its state.
Inactive read handles now just store a reader permit pointer.
Following the same scheme we used to make the wait lists intrusive.
Permits are added to the ready list intrusive list while waiting to be
executed and moved back to the _permit_list when de-queued from this
list.
We now use a conditional variable for signaling when there are permits
ready to be executed.
This patch adds an Alternator test reproducing issue #6389 - that
concurrent TagResource and/or UntagResource operations was broken and
some of the concurrent modifications were lost.
The test has two threads, one loops adds and removes a tag A, the
other adds and removes a tag B. After we add tag A, we expect tag A
to be there - but due to issue #6389 this modification was sometimes
lost when it raced with an operation on B.
This test consistently failed before issue #6389 was fixed, and passes
now after the issue was fixed by the previous patches. The bug reproduces
by chance, so it requires a fairly long loop (a few seconds) to be sure
it reproduces - so is marked a "veryslow" test and will not run in CI,
but can be used to manually reproduce this issue with:
test/alternator/run --runveryslow test_tag.py::test_concurrent_tag
Refs #6389.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The previous patches introduced the function modify_tags() as a
safe version of update_tags(), and switched all uses of update_tags()
to use modify_tags().
So now that the unsafe update_tags() is no longer use, we can drop it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Allows a configurable to subscribe to life cycle notifications for scylla app.
I.e. do stuff on start/stop.
Also allow configurables in cql_test_env
v2:
* Fix camel casing
* Make callbacks future<> (should have been. mismerge?)
Closes#13035
Alternator modifies tags in three operations - TagResource, UntagResource
and UpdateTimeToLive (the latter uses a tag to store the TTL configuration).
All three operations were implemented by three separate steps:
1. Read the current tags.
2. Modify the tags according to the desired operation.
3. Write the modified tags back with update_tags().
This implementation was not safe for concurrent operations - some
modifications may be be lost. We fix this in this patch by using the new
modify_tags() function introduced in the previous patch, which performs
all three steps under one lock so the tag operations are serialized and
correctly isolated.
Fixes#6389
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The existing utility function update_tags() for modifying tags in a
schema (used mainly by Alternator) is not safe for concurrent operations:
The function first reads the old tags, then modifies them and writes
them back. If two such calls happen concurrently, both calls may read
the same old tags, make different modifications, and then both write
the new tags, with one's write overwriting the other's.
So in this patch, we introduce a new utility function, modify_tags(),
to provide a concurrency-safe read-modify-write operation on tags.
The new function takes a modification function and calls the read,
modify and write steps together under a single lock. The new function
also takes a table name instead of a schema object - because we need
to read the schema under the lock, because might have already been
changed by some other concurrent operation.
This patch only introduces the new function, it doesn't change any
code to use it yet, and doesn't remove the unsafe update_tags() function.
We'll do those things in the next patches.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
A migration_manager holds a reference to a storage_proxy, and uses it
internally a lot - e.g., to gain access to the data_dictionary. Users
of migration_manager might also benefit from this storage_proxy - we
will see such a case in the next patches. So let's provide a getter
for the storage_proxy.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When creating a deletion log for a bunch of sstables the code checks
that all sstables share the same "storage" by lexicographically
comparing their prefixes. That's not correct, as filesystem paths may
refer to the same directory even if not being equal.
So far that's been mostly OK, because paths manipulations were done in
simple forms without producing unequal paths. Patch 8a061bd8 (sstables,
code: Introduce and use change_state() call) triggerred a corner case.
fs::path foo("/foo");
sstring sub("");
foo = foo / sub;
produces a correct path of "/foo/", but the trailing slash breaks the
aforementioned assumption about prefixes comparison. As a result, when
an sstable moves between, say, staging and normal locations it may gain
a trailing slash breaking the deletion log creation code.
The fix is to restrict the deletion log creation not to rely on path
strings comparison completely and trim the trailing slash if it happens.
A test is included.
fixes: #13085
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13090
The compilation of wasm UDFs is performed by a call to a foreign
function, which cannot be divided with yielding points and, as a
result, causes long reactor stalls for big UDFs.
We avoid them by submitting the compilation task to a non-seastar
std::thread, and retrieving the result using seastar::alien.
The thread is created at the start of the program. It executes
tasks from a queue in an infinite loop.
All seastar shards reference the thread through a std::shared_ptr
to a `alien_thread_runner`.
Considering that the compilation takes a long time anyway, the
alien_thread_runner is implemented with focus on simplicity more
than on performance. The tasks are stored in an std::queue, reading
and writing to it is synchronized using an std::mutex for reading/
writing to the queue, and an std::condition_variable waiting until
the queue has elements.
When the destructor of the alien runner is called, an std::nullopt
sentinel is pushed to the queue, and after all remaining tasks are
finished and the sentinel is read, the thread finishes.
Fixes#12904Closes#13051
* github.com:scylladb/scylladb:
wasm: move compilation to an alien thread
wasm: convert compilation to a future
These two are static binaries, so no need in yum/apt-installing them with dependencies.
Just download with curl and put them into /urs/local/bin with X-bit set.
This is needed for future object-storage work in order to run unit tests against minio.
refs: #12523
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
[avi: regenerate frozen toolchain]
Closes#13064Closes#13099
Adds two test cases which test what happens when we perform an LWT UPDATE, but the partition/clustering key has 0 possible values. This can happen e.g when a column is supposed to be equal to two different values (`c = 0 AND c = 1`).
Empty partition ranges work properly, empty clustering range currently causes a crash (#13129).
I added tests for both of these cases.
Closes#13130
* github.com:scylladb/scylladb:
cql-pytest/test_lwt: test LWT update with empty clustering range
cql-pytest/test_lwt: test LWT update with empty partition range
CQL supports type casting using C-style casts.
For example it's possible to do: `blob_column = (blob)funcReturningInt()`
This functionality is pretty limited, we only allow such casts between types that have a compatible binary representation. Compatible means that the bytes will stay unchanged after the conversion.
This means that it's legal to cast an int to blob (int is just a 4 byte blob), but it's illegal to cast a bigint to int (change 4 bytes -> 8 bytes).
This simplifies things, to cast we can just reinterpret the value as the other type.
Another use of C-style casts are type hints. Sometimes it's impossible to infer the exact type of an expression from the context. In such cases the type can be specified by casting the expression to this type.
For example: `overloadedFunction((int)?)`
Without the cast it would be impossible to guess what should be the bind marker's type. The function is overloaded, so there are many possible argument types. The type hint specifies that the bind marker has type int.
An interesting thing is that such casts don't have to be explicit. CQL allows to put an int value in a place where a blob value is expected and it will be automatically converted without any explicit casting.
---
I started looking at our implementation of casts because of #12900. In there the author expressed the need to specify a type hint for bind marker used to pass the WASM code. It could be either `(text)?` for text WASM, or `(blob)?` for binary WASM. This specific use of type hints wasn't supported because there was no `receiver` and the implementation of `prepare_expression` didn't handle that. Preparing casts without a receiver should be easy to implement - we can infer the type of the expression by looking at the type to which the expression is cast.
But while reading `prepare_expression` for `expr::cast` I noticed that the code there is a bit strange. The implementation prepared the expression to cast using the original `receiver` instead of a receiver with the cast type. This caused some issues because of which casting didn't work as expected.
For example it was possible to do:
```cql
blob_column = (blob)funcReturningInt()
```
But this didn't work at all:
```cql
blob_column = (blob)(int)12323
```
It tried to prepare `untyped_contant(12323)` with a `blob` receiver, which fails.
This makes `expr::cast` useless for casting. Casting when the representation is compatible is already implicit. I couldn't find a single case where adding a cast would change the behavior in any way.
There was some use for it as a type hint to choose a specific overload of a function, but it was worthless for casting.
Cassandra has the same issue, I created a `cql-pytest` test and it showed that we behave in the same way as Cassandra does.
I decided to improve this. By preparing the expression using a receiver with the cast type, `expr::cast` becomes actually useful for casting values. Things like `(blob)(int)12323` now work without any issues.
This diverges from the behavior in Cassandra, but it's an extension, not a breaking incompatibility.
---
This PR improves `prepare_expression` for `expr::cast` in the following ways:
1) Support for more complex casts by preparing the expression using a different receiver. This makes casts like `(blob)(int)123` possible
2) Support preparing `expr::cast` without a receiver. Type inference chooses the cast type as the type of the expression.
3) Add pytest tests for C-style casts
`2)` Is needed for #12900, the other changes is just something I decided to do since I was already working on this piece of code.
Closes#13053
* github.com:scylladb/scylladb:
expr_test: more tests for preparing bind variables with type hints
prepare_expr: implement preparing expr::cast with no receiver
prepare_expr: use :user formatting in cast_prepare_expression
prepare_expr: remove std::get<> in cast_prepare_expression
prepare_expr: improve cast_prepare_expression
prepare_expr: improve readability in cast_prepare_expression
cql-pytest: test expr::cast in test_cast.py
This series aims to allow users to set permissions on user-defined functions.
The implementation is based on Cassandra's documentation and should be fully compatible: https://cassandra.apache.org/doc/latest/cassandra/cql/security.html#cql-permissionsFixes: #5572Fixes: #10633Closes#12869
* github.com:scylladb/scylladb:
cql3: allow UDTs in permissions on UDFs
cql3: add type_parser::parse() method taking user_types_metadata
schema_change_test: stop using non-existent keyspace
cql3: fix parameter names in function resource constructors
cql3: handle complex types as when decoding function permissions
cql3: enforce permissions for ALTER FUNCTION
cql-pytest: add a (failing) test case for UDT in UDF
cql-pytest: add a test case for user-defined aggregate permissions
cql-pytest: add tests for function permissions
cql3: enforce permissions on function calls
selection: add a getter for used functions
abstract_function_selector: expose underlying function
cql3: enforce permissions on DROP FUNCTION
cql3: enforce permissions for CREATE FUNCTION
client_state: add functions for checking function permissions
cql-pytest: add a case for serializing function permissions
cql3: allow specifying function permissions in CQL
auth: add functions_resource to resources
It was suggested as candidate from one of previous reviews, so here it is.
Closes#13140
* github.com:scylladb/scylladb:
distributed_loader: Indentation fix after previous patch
distributed_loader: Coroutinize reshape() helper
Many sstable test cases create tempdir on their own to create sstables with. Sometimes it's justified when the test needs to check files on disk by hand for some validation, but often all checks are fs-agnostic. The latter case(s) can be patched to work on top of any storage, in particular -- on top of object storage. To make it work tests should stop creating sstables explicitly in tempdir and this PR does exactly that.
All relevant occurrences of tempdir are removed from test cases, instead the sstable::test_env's tempdir is used. Next, the test_env::{create_sstable|reusable_sst} are patched not to accept the `fs::path dir` argument and pick the env's tempdir. Finally, the `make_sstable_easy` helper is patched to use path-less env methods too.
refs: #13015Closes#13116
* github.com:scylladb/scylladb:
test,sstables: Remove path from make_sstable_easy()
test,lib: Remove wrapper over reusable_sst and move the comment
test: Make "compact" test case use env dir
test,compaction: Use env tempdir in some more cases
test,compaction: Make check_compacted_sstables() use env's dir
test: Relax making sstable with sequential generation
test/sstable::test_env: Keep track of auto-incrementing generation
test/lib: Add sstable maker helper without factory
test: Remove last occurrence of test_env::do_with(rval, ...)
test,sstables: Dont mess with tempdir where possible
test/sstable::test_env: Add dir-less sstables making helpers
test,sstables: Use sstables::test_env's tempdir with sweeper
test,sstables: Use sstables::test_env's tempdir
test/lib: Add tempdir sweeper
test/lib: Open-code make_sstabl_easy into make_sstable
test: Remove vector of mutation interposer from test_key_count_estimation
add type constraits for
`sstable_directory::parallel_for_each_restricted()`, to enforce the
constraints on the function so it should be invocable with the argument
of specified type. this helps to prevent the problems of passing
function which accepts `pair<key, value>` or `tuple<key, value>`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
`parallel_for_each_restricted()` maps the elements in the given
container with the specified function. in this case, the elements is
of type `unordered_map::value_type`, which is a `pair<const Key, Value>`.
to convert it to a `tuple<Key, Value>`, the constructor of the tuple
is called. but what we intend to do here is but to access the second
element in the `pair<>` here.
in this change, the function's signature is changed to match
`scan_descriptors_map::value_type` to avoid the unnecessary overhead of
constructor of `tuple<>`. also, because the underlying
`max_concurrent_for_each()` does not pass a xvalue to the given func,
instead, it just pass `*s.begin` to the function, where `s.begin` is
an `Iterator` returned by `std::begin(container)`. so let's just use
a plain reference as the parameter type for the function.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Fix API call to wait for all shards to reach the current shard 0
gossiper version. Throws when timeout is reached.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
- sstables: mark param of sstable::*_from_sstring() const
- sstables: mark param of reverse_map() const
- sstables: mark static lookup table const
Closes#13115
* github.com:scylladb/scylladb:
sstables: mark static lookup table const
sstables: mark param of reverse_map() const
sstables: mark param of sstable::*_from_sstring() const
This reverts commit 49e0d0402d, reversing
changes made to 25cf325674.
An old version of PR #13115 was accidentally merged into `master` (it
was dequeued concurrently while a running next promotion job included
it).
Revert the merge. We'll merge the new version as a follow-up.
Major compaction can be started from both storage_service and column_family
api. The first allows to compact a subset of tables in given keyspace,
while the latter - given table in given keyspace.
As major compaction started from storage_service has a wider scope,
we use its mechanisms for column_family's one. That makes it more consistent
and reduces number of classes that would be needed to cover the major
compaction with task manager's tasks.
I've no idea why the quotes are there at all, it works even without
them. However, with quotes gdb-13 fails to find the _all_threads static
thread-local variable _unless_ it's printed with gdb "p" command
beforehand.
fixes: #13125
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13132
Drop do_with(), keep the needed variable on stack.
Replace repeat() with plain loop + yield.
Keep track of run_custom_job()'s exception.
Indentation is deliberately left broken.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently, when preparing an authorization statement on a specific
function, we're trying to "prepare" all cql types that appear in
the function signature while parsing the statement. We cannot
do that for UDTs, because we don't know the UDTs that are present
in the databse at parsing time. As a result, such authorization
statements fail.
To work around this problem, we postpone the "preparation" of cql
types until the actual statement validation and execution time.
Until then, we store all type strings in the resource object.
The "preparation" happens in the `maybe_correct_resource` method,
which is called before every `execute` during a `check_access` call.
At that point, we have access to the `query_processor`, and as a
result, to `user_types_metadata` which allows us to prepare the
argument types even for UDTs.
In a future patch, we don't have access to a `user_types_storage`
while we want to parse a type, but we do have access to a
`user_types_metadata`, which is enough to parse the type.
We add a variant of the `type_parser::parse()` that takes
a `user_types_metadata` instead of a `user_types_storage` to be
able to parse a type also in the described context.
The current implementation of CQL type parsing worked even
when given a string representing a non-existent keyspace, as
long as the parsed type was one of the "native" types. This
implementation is going to change, so that we won't parse
types given an incorrect keyspace name.
When using `do_with_cql_env`, a "ks" keyspace is created by
default, and "tests" keyspace is not. The tests for reverse
schemas in `schema_change_test` were using the "tests"
keyspace, so in order to make the tests work after the future
changes, they now use the existing "ks" keyspace.
In some places, the parameter name used when constructing
a resource object was 'function_name', while the actual
argument was the signature of a function, which is particularly
confusing, because function names also appear frequently in these
contexts. This patch changes the identifiers to more accurately
reflect, what they represent.
Currently, we're parsing types that appear in a function resource
using abstract_type::parse_type, which only works with simple types.
This patch changes it to db::marshal::type_parser::parse, which
can also handle collections.
We also adjust the test_grant_revoke_udf_permissions test so that
it uses both simple and complex types as parameters of the function
that we're granting/revoking permissions on.
Currently, the ALTER permission is only enforced on ALL FUNCTIONS
or on ALL FUNCTIONS IN KEYSPACE.
This patch enforces the permisson also on a specific function.
Our permissions system is currently incapable of figuring out
user-defined type definitions when preparing functions permissions.
This test case creates such a function, and it passes on Cassandra.
these tables are mappings from symbolic names to their string
representation. we don't mutate them. so mark them const.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
it does not mutate the map in which the value is looked up, so let's
mark map const. also, take this opportunity to use structured binding
for better readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The components_writer class from this list doesn't even exist
Also drop the forward declaration of mx::partition_reversing_data_source_impl
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13097
Add tests for preparing expr::cast which contains a bind variable,
with a known receiver.
expr::cast serves as a type hint for the bind variable.
It specifies what should be the type of the bind variable,
we must check that this type is compatible with the receiver
and fail in case it isn't
The following cases are tested:
Valid:
`text_col = (text)?`
`int_col = (int)?`
Invalid:
`text_col = (int)?`
`int_col = (text)?`
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Type inference in cast_prepare_expression was very limited.
Without a receiver it just gave up and said that it can't
infer the type.
It's possible to infer the type - an expression that
casts something to type bigint also has type bigint.
This can be implemented by creating a fake receiver
when the caller didn't specify one.
Type of this fake receiver will be c.type
and c.arg will be prepared using this receiver.
Note that the previous change (changing receiver
to cast_type_receiver in prepare_expression) is required
to keep the behaviour consistent.
Without it we would sometimes prepare c.arg using the
original receiver, and sometimes using a receiver
with type c.type.
Currently it's impossible to test this change
on live code. Every place that uses expr::cast
specifies a receiver.
A unit test is all that can be done at the moment
to ensure correctness.
In the future this functionality will be used in UDFs.
In https://github.com/scylladb/scylladb/pull/12900
it was requested to be able to use a type hint
to specify whether WASM code of the function
will be sent in binary or text form.
The user can convey this by typing
either `(blob)?` or `(text)?`.
In this case there will be no receiver
and type inference would fail.
After this change it will work - it's now possible
to prepare either of those and get an expression
with a known type.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
By default expressions are printed using the {:debug} formatting,
wich is intended for internal use. Error messages should use the
{:user} formatting instead.
cast_prepare_expression uses the default formatting in a few places
that are user facing, so let's change it to use {:user} formatting.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
A few times throughout cast_prepare_expression there's
a line which uses std::get<> to get the raw type of the cast.
`std::get<shared_ptr<cql3_type::raw>>(c.type)`
This is a dangerous thing to do. It might turn out that the variant
holds a different alternative and then it'll start throwing bad_variant_access.
In this case this would happen if someone called cast_prepare_expression
on an expression that is already prepared.
It's possible to modify the code in a way that avoids doing the std::get
altogether.
It makes the code more resilient and gives me a piece of mind.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Preparing expr::cast had some artificial limitations.
Things like this worked:
`blob_col = (blob)funcReturnsInt()`
But this didn't:
`blob_col = (blob)(int)1234`
This is caused by the line:
`prepare_expression(c.arg, db, keyspace, schema_opt, receiver)`
Here the code prepares the expression to be cast using the original
receiver which was passed to cast_prepare_expression.
In the example above this meant that it tried to prepare
untyped_constant(1234) using a receiver with type blob.
This failed because an integer literal is invalid for a blob column.
To me it looks like a mistake. What it should do instead
is prepare the int literal using the type (int) and then
see if int can be cast to blob, by checking if these types
have compatible binary representation.
This can be achieved by using `cast_type_receiver` instead of `receiver`.
Making this small change makes it possible to use the cast
in many situations where it was previously impossible.
The tests have to be updated to reflect the change,
some of them ow deviate from Cassandra, so they have
to be marked scylla_only.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The function allows extracting used function definitions
from given selection. Thanks to that, it will be possible
to verify if the callee has proper permissions to execute
given functions.
This test case checks that granting function permissions
result in correct serialization of the permissions - so that
reading system_auth.role_permissions and listing the permissions
via CQL with `LIST permission OF role` works in a compatible way
with both Scylla and Cassandra.
This commit allows users to specify the following resources:
- ALL FUNCTIONS
- ALL FUNCTIONS IN KEYSPACE ks
- FUNCTION f(int, double)
The permissions set for these resources are not enforced yet.
This commit adds "functions" resource to our authorization
resources. The implementation strives to be compatible
with Cassandra both from CQL level and serialization,
i.e. so that entries in system_auth.role_permissions table
will be identical if CassandraAuthorizer is used.
This commit adds a way of representing these resources
in-memory, but they are not enforced as permissions yet.
The following permissions are supported:
```
CREATE ALL FUNCTIONS
CREATE ALL FUNCTIONS IN KEYSPACE <ks>
ALTER ALL FUNCTIONS
ALTER ALL FUNCTIONS IN KEYSPACE <ks>
ALTER FUNCTION <f>
DROP ALL FUNCTIONS
DROP ALL FUNCTIONS IN KEYSPACE <ks>
DROP FUNCTION <f>
AUTHORIZE ALL FUNCTIONS
AUTHORIZE ALL FUNCTIONS IN KEYSPACE <ks>
AUTHORIZE FUNCTION <f>
EXECUTE ALL FUNCTIONS
EXECUTE ALL FUNCTIONS IN KEYSPACE <ks>
EXECUTE FUNCTION <f>
```
as per
https://cassandra.apache.org/doc/latest/cassandra/cql/security.html#cql-permissions
Add a test case which performs an LWT UPDATE, but the partition key
has 0 possible values, because it's supposed to be equal to two
different values.
Such queries used to cause problems in the past.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Related: https://github.com/scylladb/scylladb/issues/13119
This commit removes the pages that describe Enterprise only features
from the Open Source documentation:
- Encryption at Rest
- Workload Prioritization
- LDAP Authorization
- LDAP Authentication
- Audit
In addition, it removes most of the information about Incremental
Compaction Strategy (ICS), which is replaced with links to the
Enterprise documentation.
The changes above required additional updates introduced with this
commit:
- The links to Enterprise-only features are replaced with the
corresponding links in the Enterprise documentation.
- The redirections are added for the removed pages to be redirected to
the corresponding pages in the Enterprise documentation.
This commit must be reverted in the scylla-enterprise repository to
avoid deleting the Enterprise-only content from the Enterprise docs.
Closes#13123
Used while the permit is in the _ready_list, waiting for the execution
loop to pick it up. This just acknowledging the existence of this
wait-state. This state will now show up in permit diagnostics printouts
and we can now determine whether a permit is waiting for execution,
without checking which queue it is in.
Instead of using expiring_fifo to store queued permits, use the same
intrusive list mechanism we use to keep track of all permits.
Permits are now moved between the _permit_list and the wait queues,
depending on which state they are in. This means _permit_list is now not
the definitive list containing all permits, instead it is the list
containing all permits that are not in a more specialized queue at the
moment.
Code wishing to iterate over all permits should now use
foreach_permits(). For outside code, this was already the only way and
internal users are already patched.
Making the wait lists intrusive allows us to dequeue a permit from any
position, with nothing but a permit reference at hand. It also means
the wait queues don't have any additional memory requirements, other
than the memory for the permit itself.
Timeout while being queued is now handled by the permit's on_timeout()
callback.
Use the node_ops_ctl methods for the basic
flow of: start, start_heartbeat_updater, prepare,
send_to_all, done|abort
As well for querying pending ops for decommission.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Instead of the `entry` wrapper. In _wait_list and _ready_list, that is.
Data stored in the `entry` wrapper is moved to a new
`reader_permit::auxiliary_data` type. This makes the reader permit
self-sufficient. This in turn prepares the ground for the ability to
de-queue a permit from any queue, with nothing but a permit reference at
hand: no need to have back pointer to wrappers and/or iterators.
Currently the reader_permit has some private methods that only the
semaphore's internal calls. But this method of communication is not
consistent, other times the semaphore accesses the permit impl directly,
calling methods on that.
This commit introduces operator * and -> for reader_permit. With this,
the semaphore internals always call the reader_permit::impl methods
direcly, either via a direct reference, or via the above operators.
This makes the permit internface a little narrower and reduces
boilerplate code.
Use it to keep track of all permits that are currently waiting on
something: admission, memory or execution.
Currently we keep track of size, by adding up the result of size() of
the various queues. In future patches we are going to change the queues
such that they will not have constant time size anymore, move to an
explicit counter in preperation to that.
Another change this commit makes is to also include ready list entries
in this counter. Permits in the ready list are also waiters, they wait
to be executed. Soon we will have a separate wait state for this too.
Instead of having callers use get_timeout(), then compare it against the
current time, set up a timeout timer in the permit, which assigned a new
`_ex` member (a `std::exception_ptr`) to the appropriate exception type
when it fires.
Callers can now just poll check_abort() which will throw when `_ex`
is not null. This is more natural and allows for more general reasons
for aborting reads in the future.
This prepares the ground for timeouts being managed inside the permit,
instead of by the semaphore. Including timing out while in a wait queue.
All node operations we currently support go through
similar basic flow and may add some op-specific logic
around it.
1. Select the nodes to sync with (this is op specific).
2. hearbeat updater
3. send prepare req
4. perform the body of the node operation
5. send done
--
on any error: send abort
node_ops_ctl formalizes all those steps and makes
sure errors are handled in all steps, and
the error causing abort is not masked by errors
in the abort processing, and is propagated upstream.
Some of the printouts repeat the node operation description
to remain backward compatible so not to break dtests
that wait for them.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The compilation of wasm UDFs is performed by a call to a foreign
function, which cannot be divided with yielding points and, as a
result, causes long reactor stalls for big UDFs.
We avoid them by submitting the compilation task to a non-seastar
std::thread, and retrieving the result using seastar::alien.
The thread is created at the start of the program. It executes
tasks from a queue in an infinite loop.
All seastar shards reference the thread through a std::shared_ptr
to a `alien_thread_runner`.
Considering that the compilation takes a long time anyway, the
alien_thread_runner is implemented with focus on simplicity more
than on performance. The tasks are stored in an std::queue, reading
and writing to it is synchronized using an std::mutex for reading/
writing to the queue, and an std::condition_variable waiting until
the queue has elements.
When the destructor of the alien runner is called, an std::nullopt
sentinel is pushed to the queue, and after all remaining tasks are
finished and the sentinel is read, the thread finishes.
This param is from a time when _permit_list was not accessible from the
outside, so it was passed along the semaphore instance to avoid making
the diagnostics methods friends.
To allow the semaphore freedom in how permits are stored, the
diagnostics code is instead made to use foreach_permit(), instead of
accessing the underlying list directly.
As the diagnostics code wants reader_permit::impl& directly, a new
variant of foreach_permit() passing impl references is introduced.
It already is conceptually, as it passes const references to the permits
it iterates over. The only reason it wasn't const before is a technical
issue which is solved here with a const_cast.
Currently failing to abort a node operation will
throw and mask the original failure handled in the catch block.
See #12333 for example.
Fixes#12798
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
- sstables: remove unused function
- sstables: mark param of sstable::*_from_sstring() const
- sstables: mark param of reverse_map() const
- sstables: mark static lookup table const
Closes#13115
* github.com:scylladb/scylladb:
sstables: mark static lookup table const
sstables: mark param of reverse_map() const
sstables: mark param of sstable::*_from_sstring() const
sstables: remove unused function
The method in question is only called with env's tempdir, so there's no
point in explicitly passing it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a wonderful comment describing what the reusable_sst is for near
one of its wrappers. It's better to drop the wrapper and move the
comment to where it belongs.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Both already do so, but get the tempdir explicitly. It's possible to
make them much shorter by not carrying this variable over the code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's in fact using it already via argument. Next patch will do the same
with another call, but having this change separately makes the next
patch shorter and easier to review.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Many test cases populate sstable with a factory that at the same time
serves as a stable maintainer of a monitomic generation. Those can be
greately relaxed by re-using the recently introduced generation from the
test_env.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Lots of test cases make sstables with monotonically incrementing
generation values. In Scylla code this counter is maintained in class
table, but sstable tests not always have it. To mimic this behavior, the
test_env can keep track of the generation, so that callers just don't
mess with it (next patch).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a make_sstable_containing() helper that creates sstable and
populates it with mutations (and makes some post validation). The helper
accepts a factory function that should make sstable for it.
This patch shuffles this helper a bit by introducing an overload that
populates (and validates) the already existing sstable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's the lonely test case that uses the mentioned template to carry
its own instance of tempdir over its lifetime. Patch the case to re-use
the already existing env's tempdir and drop the template.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Beneficiary of the previuous patch -- those cases that make sstables in
env's tempdir can now enjoy not mentioning this explicitly and letting
the env specify the sstable making path itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Lots of (most of) test cases out there generate sstables inside env's
temporary directory. This patch adds some sugar to env that will allow
test cases omit explicit env.tempdir() call.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Continuation of the previous patch. Some test cases are sensitive to
having the temp directory clean, so patch them similarly, but equip with
the sweeper on entry instead of their own temprid instance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The one is maintained by the env throughout its lifetime. For many test
cases there's no point in generating tempdir on their own, so just
switch to using env's one.
The code gets longer lines, but this is going to change really soon.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is a RAII-sh helper that cleans temp directory on destruction. To
be used in cases when a test needs to do several checks over clean
temporary directory (future patches).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The former helper is going to get rid of the fs::path& dir argument,
but the latter cannot yet live without it. The simplest solution is to
open-code the helper until better times.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test generates a vector of mutation to be later passed into
make_sstable() helper which just applies them to memtable. The test case
can generate memtable directly. This makes it possible to stop using the
local tempdir in this test case by future patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
these tables are mappings from symbolic names to their string
representation. we don't mutate them. so mark them const.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
it does not mutate the map in which the value is looked up, so let's
mark map const. also, take this opportunity to use structured binding
for better readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
- api: reference httpd::* symbols like 'httpd::*'
- alternator: using chrono_literals before using it
- api: s/request/http::request/
the last two commits were inspired Pavel's comment of
> It looks like api/ code was caught by some using namespace seastar::httpd shortcut.
they should be landed before we merge and include https://github.com/scylladb/seastar/pull/1536 in Scylla.
Closes#13095
* github.com:scylladb/scylladb:
api: reference httpd::* symbols like 'httpd::*'
alternator: using chrono_literals before using it
api: s/request/http::request/
- distributed_loader: print log without using fmt::format()
- distributed_loader: correct a typo in comment
Closes#13108
* github.com:scylladb/scylladb:
distributed_loader: correct a typo in comment
distributed_loader: print log without using fmt::format()
these dependencies were found when trying to compile
`user_function_test`. whenever a library libfoo references another one,
say, libbar, the corresponding linkage from libfoo to libbar is added.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
it was an oversight in 11124ee972,
which added a test not yet included master HEAD yet. so let's
drop it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Our documentation states that writing an item with "USING TTL 0" means it
should never expire. This should be true even if the table has a default
TTL. But Scylla mistakenly handled "USING TTL 0" exactly like having no
USING TTL at all (i.e., it took the default TTL, instead of unlimited).
We had two xfailing tests demonstrating that Scylla's behavior in this
is different from Cassandra. Scylla's behavior in this case was also
undocumented.
By the way, Cassandra used to have the same bug (CASSANDRA-11207) but
it was fixed already in 2016 (Cassandra 3.6).
So in this patch we fix Scylla's "USING TTL 0" behavior to match the
documentation and Cassandra's behavior since 2016. One xfailing test
starts to pass and the second test passes this bug and fails on a
different one. This patch also adds a third test for "USING TTL ?"
with UNSET_VALUE - it behaves, on both Scylla and Cassandra, like a
missing "USING TTL".
The origin of this bug was that after parsing the statement, we saved
the USING TTL in an integer, and used 0 for the case of no USING TTL
given. This meant that we couldn't tell if we have USING TTL 0 or
no USING TTL at all. This patch uses an std::optional so we can tell
the case of a missing USING TTL from the case of USING TTL 0.
Fixes#6447
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13079
logger.info() is able to format the given arguments with the format
string, so let's just let it do its job.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Move replication logic for live endpoint across shards to a separate
method
This will be used by API get alive nodes.
As this is now in a method and outside gossiper::run(), assert it's
called from shard 0.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
In issue #5283 we noted that the auto_snapshot option is not useful
in Alternator (as we don't offer any API to restore the snapshot...),
and suggested that we should automatically disable this option for
Alternator tables. However, this issue has been open for more than three
years, and we never changed this default.
So until we solve that issue - if we ever do - let's add a paragraph
in docs/alternator/alternator.md recommending to the user to disable
this option in the configuration themselves. The text explains why,
and also provides a link to the issue.
Refs #5283
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13103
Add a test which performs an UPDATE and
tries to pass an UNSET_VALUE as a value
for the primary key.
There is also an LWT variant of this test
that tries to set an UNSET_VALUE
in the IF condition.
These two tests are analogous to
test_insert_update_where and
test_insert_update_where_lwt,
but use an UPDATE instead of INSERT.
It's useful to test UPDATE as well as INSERT.
When I was developing a fix for #13001
I initially added the condition for unset value
inside insert_statement, but this didn't handle
update statements. These two tests allowed me
to see that UPDATE still causes a crash.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#13058
Undefined behavior because the evaluation order is undefined.
With GCC, where evaluation is right-to-left, schema will be moved
once it's forwarded to make_flat_mutation_reader_from_mutations_v2().
The consequence is that memory tracking of mutation_fragment_v2
(for tracking only permit used by view update), which uses the schema,
can be incorrect. However, it's more likely that Scylla will crash
when estimating memory usage for row, which access schema column
information using schema::column_at(), which in turn asserts that
the requested column does really exist.
Fixes#13093.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13092
When Fedora 37 came out, we discovered that its "pytest" script started
to run Python with the "-s" option, which caused problems for packages
installed personally via pip. We fixed this by adding our own wrapper
script test/pytest.
But this bug (https://bugzilla.redhat.com/show_bug.cgi?id=2152171) was
already fixed in Fedora 37, and the new version already reached our
dbuild. So we no longer need this wrapper script. Let's remove it.
Fixes#12412
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13083
cast_prepare_expression takes care of preparing expr::cast,
which is responsible for CQL C-style casts.
At the first glance it can be hard to figure out what exactly
does it do, so I added some comments to make things clearer.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
CQL supports C-style casts with the destination type specified
inside parenthesis e.g `blob_column = (blob)funcThatReturnsInt()`.
These casts can be used to convert values of types
that have compatible binary representation, or as a type hint
to specify the type where the situation is ambiguous.
I didn't find any cql-pytest tests for this feature,
so I added some.
It looks like the feature works, but only partially.
Doing things like this works:
`blob_column = (blob)funcThatReturnsInt()`
But trying to do something a bit more complex fails:
`blob_column = (blob)(int)1234`
This is the case in both Cassandra and Scylla,
the tests introduced in this commit pass on both of them.
In future commits I will extend this feature
to support the more complex cases as well,
then some tests will have to be marked scylla_only.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
We have seen users unintentionally use RF=1 or RF=2 for a keyspace.
We would like to have an option for a minimal RF that is allowed.
Cassandra recently added, in Cassandra 4.1 (see apache/cassandra@5fdadb2
and https://issues.apache.org/jira/browse/CASSANDRA-14557), exactly such
a option, called "minimum_keyspace_rf" - so we chose to use the same option
name in Scylla too. This means that unlike the previous "safe mode"
options, the name of this option doesn't start with "restrict_".
The value of the minimum_keyspace_rf option is a number, and lower
replication factors are rejected with an error like:
cqlsh> CREATE KEYSPACE x WITH REPLICATION = { 'class' : 'SimpleStrategy',
'replication_factor': 2 };
ConfigurationException: Replication factor replication_factor=2 is
forbidden by the current configuration setting of minimum_keyspace_rf=3.
Please increase replication factor, or lower minimum_keyspace_rf set in
the configuration.
This restriction applies to both CREATE KEYSPACE and ALTER KEYSPACE
operations. It applies to both SimpleStrategy and NetworkTopologyStrategy,
for all DCs or a specific DC. However, a replication factor of zero (0)
is *not* forbidden - this is the way to explicitly request not to
replicate (at all, or in a specific DC).
For the time being, minimum_keyspace_rf=0 is still the default, which
means that any replication factor is allowed, as before. We can easily
change this default in a followup patch.
Note that in the current implementation, trying to use RF below
minimum_keyspace_rf is always an error - we don't have a syntax
to make into just a warning. In any case the error message explains
exactly which configuration option is responsible for this restriction.
Fixes#8891.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#9830
This patch fixes a problem which affects decommission and removenode
which may lead to data consistency problems under conditions which
lead one of the nodes to unliaterally decide to abort the node
operation without the coordinator noticing.
If this happens during streaming, the node operation coordinator would
proceed to make a change in the gossiper, and only later dectect that
one of the nodes aborted during sending of decommission_done or
removenode_done command. That's too late, because the operation will
be finalized by all the nodes once gossip propagates.
It's unsafe to finalize the operation while another node aborted. The
other node reverted to the old topolgy, with which they were running
for some time, without considering the pending replica when handling
requests. As a result, we may end up with consistency issues. Writes
made by those coordinators may not be replicated to CL replicas in the
new topology. Streaming may have missed to replicate those writes
depending on timing.
It's possible that some node aborts but streaming succeeds if the
abort is not due to network problems, or if the network problems are
transient and/or localized and affect only heartbeats.
There is no way to revert after we commit the node operation to the
gossiper, so it's ok to close node_ops sessions before making the
change to the gossiper, and thus detect aborts and prevent later aborts
after the change in the gossiper is made. This is already done during
bootstrap (RBNO enabled) and replacenode. This patch canges removenode
to also take this approach by moving sending of remove_done earlier.
We cannot take this approach with decommission easily, because
decommission_done command includes a wait for the node to leave the
ring, which won't happen before the change to the gossiper is
made. Separating this from decommission_done would require protocol
changes. This patch adds a second-best solution, which is to check if
sessions are still there right before making a change to the gossiper,
leaving decommission_done where it was.
The race can still happen, but the time window is now much smaller.
The PR also lays down infrastructure which enables testing the scenarios. It makes node ops
watchdog periods configurable, and adds error injections.
Fixes#12989
Refs #12969Closes#13028
* github.com:scylladb/scylladb:
storage_service: node ops: Extract node_ops_insert() to reduce code duplication
storage_service: Make node operations safer by detecting asymmetric abort
storage_service: node ops: Add error injections
service: node_ops: Make watchdog and heartbeat intervals configurable
when comparing the disabled warnings specified by `configured.py` and the ones specified by `cmake/mode.common.cmake`, it turns out we are now able to enable more warning options. so let's enable them. the change was tested using Clang-17 and GCC-13.
there are many errors from GCC-13, like:
```
/home/kefu/dev/scylladb/db/view/view.hh:114:17: error: declaration of ‘column_kind db::view::clustering_or_static_row::column_kind() const’ changes meaning of ‘column_kind’ [-fpermissive]
114 | column_kind column_kind() const {
| ^~~~~~~~~~~
```
so the build with GCC failed.
and with this change, Clang-17 is able to build build the tree without warnings.
Closes#13096
* github.com:scylladb/scylladb:
build: enable more warnings
test: do not initialize plain number with {}
test: do not initialize a time_t with braces
After we move the compilation to a alien thread, the completion
of the compilation will be signaled by fulfilling a seastar promise.
As a result, the `precompile` function will return a future, and
because of that, other functions that use the `precompile` functions
will also become futures.
We can do all the neccessary adjustments beforehand, so that the actual
patch that moves the compilation will contain less irrelevant changes.
The code for compare_endpoints originates at the dawn of time (bc034aeaec)
and is called on the fast path from storage_proxy via `sort_by_proximity`.
This series considerably reduces the function's footprint by:
1. carefully coding the many comparisons in the function so to reduce the number of conditional banches (apparently the compiler isn't doing a good enough job at optimizing it in this case)
2. avoid sstring copy in topology::get_{datacenter,rack}
Closes#12761
* github.com:scylladb/scylladb:
topology: optimize compare_endpoints
to_string: add print operators for std::{weak,partial}_ordering
utils: to_sstring: deinline std::strong_ordering print operator
move to_string.hh to utils/
test: network_topology: add test_topology_compare_endpoints
One test in test/cql-pytest/test_batch.py accidentally had the asyncio
marker, despite not using any async features. Remove it. The test still
runs fine.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13002
- build: cmake: use different names for output of check_cxx_compiler_flag
- build: cmake: only add supported warning flags to CMAKE_CXX_FLAGS
- build: cmake: limit the number of link job
Closes#13098
* github.com:scylladb/scylladb:
build: cmake: limit the number of link job
build: cmake: only add supported warning flags to CMAKE_CXX_FLAGS
build: cmake: use different names for output of check_cxx_compiler_flag
it turns out we have `using namespace httpd;` in seastar's
`request_parser.rl`, and we should not rely on this statement to
expose the symbols in `seatar::httpd` to `seastar` namespace.
in this change,
* api/*.hh: all httpd symbols are referenced by `httpd::*`
instead of being referenced as if they are in `seastar`.
* api/*.cc: add `using namespace seastar::httpd`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
we should assume that some included header does this for us.
we'd have following compiling failure if seastar's
src/http/request_parser.rl does not `using namespace httpd;` anymore.
```
/home/kefu/dev/scylladb/alternator/streams.cc:433:55: error: no matching literal operator for call to 'operator""h' with argument of type 'unsigned long long' or 'const char *', and no matching literal operator template
static constexpr auto dynamodb_streams_max_window = 24h;
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This patch reorganizes and extends CQL related metrics.
Before this patch we only had counters for specific CQL requests.
However, many times we need to reason about the size of CQL queries: corresponding
requests and response sizes.
This patch adds corresponding metrics:
- Arranges all 3 per-opcode statistics counters in a single struct.
- Defines a vector of such structs for each CQL opcode.
- Adjusts statistics updates accordingly - the code is much simpler
now.
- Removes old metrics that were accounting some CQL opcodes.
- Adds new per-opcode metrics for requests number, request and response sizes:
- New metrics are of a derived kind - rate() should be applied to them.
- There are 3 new metrics names:
- 'cql_requests_count'
- 'cql_request_bytes'
- 'cql_response_bytes'
- New metrics have a per-opcode label - 'kind'.
For example:
A number of response bytes for an EXECUTE opcode on shard 0 looks as follows:
scylla_transport_cql_response_bytes{kind="EXECUTE",shard="0"}
Ref #13061
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20230302154816.299721-1-vladz@scylladb.com>
when comparing the disabled warnings specified by `configured.py`
and the ones specified by `cmake/mode.common.cmake`, it turns out
we are now able to enable more warning options. so let's enable them.
the change was tested using Clang-17 and GCC-13.
there are many errors from GCC-13, like:
```
/home/kefu/dev/scylladb/db/view/view.hh:114:17: error: declaration of ‘column_kind db::view::clustering_or_static_row::column_kind() const’ changes meaning of ‘column_kind’ [-fpermissive]
114 | column_kind column_kind() const {
| ^~~~~~~~~~~
```
so the build with GCC failed.
and with this change, Clang-17 is able to build build the tree without
warnings.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
time_t is defined as a "Arithmetic type capable of representing times".
so we can just initialize it with 0 without braces. this change should
silence warning like:
```
test/boost/aggregate_fcts_test.cc:238:45: error: braces around scalar initializer [-Werror,-Wbraced-scalar-init]
auto tp = db_clock::from_time_t({ 0 }) + std::chrono::milliseconds(1);
^~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The sstable_compaction_test::simple_backlog_controller_test makes
sstables with empty dir argument. Eventually this means that sstables
happen in / directory [1], which's not nice.
As a side effect this also makes sstable::storage::prefix() returns
empty string which, in turn, confuses the code that tries to analyze the
prefix contents (refs: #13090)
[1] See, e.g. logs from https://jenkins.scylladb.com/job/releng/job/Scylla-CI/4757/consoleText
```
INFO 2023-03-06 21:23:04,536 [shard 0] compaction - [Compact ks.cf 51489760-bc54-11ed-a08c-7d3f1d77e2e4] Compacting [/la-1-big-Data.db:level=0:origin=]
```
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13094
we have another solution, to mark db_user_types_storage `final`. as we
don't destruct `db_user_types_storage` with a pointer to any of its base
classes. but it'd be much simpler to just mark the dtor virtual of the
first base class which has virtual method(s). it's much idiomatic this
way, and less error-prune.
this change should silence following warning:
```
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/stl_construct.h:88:2: error: destructor called on non-final 'replica::db_user_types_storage' that has virtual functions but non-virtual destructor [-Werror,-Wdelete-non-abstract-non-virtual-dtor]
__location->~_Tp();
^
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/stl_construct.h:149:12: note: in instantiation of function template specialization 'std::destroy_at<replica::db_user_types_storage>' requested here
std::destroy_at(__pointer);
^
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/alloc_traits.h:674:9: note: in instantiation of function template specialization 'std::_Destroy<replica::db_user_types_storage>' requested here
{ std::_Destroy(__p); }
^
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/shared_ptr_base.h:613:28: note: in instantiation of function template specialization 'std::allocator_traits<std::allocator<void>>::destroy<replica::db_user_types_storage>' requested here
allocator_traits<_Alloc>::destroy(_M_impl._M_alloc(), _M_ptr());
^
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/shared_ptr_base.h:599:2: note: in instantiation of member function 'std::_Sp_counted_ptr_inplace<replica::db_user_types_storage, std::allocator<void>, __gnu_cxx::_S_atomic>::_M_dispose' requested here
_Sp_counted_ptr_inplace(_Alloc __a, _Args&&... __args)
^
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/shared_ptr_base.h:972:6: note: in instantiation of function template specialization 'std::_Sp_counted_ptr_inplace<replica::db_user_types_storage, std::allocator<void>, __gnu_cxx::_S_atomic>::_Sp_counted_ptr_inplace<replica::database &>' requested here
_Sp_cp_type(__a._M_a, std::forward<_Args>(__args)...);
^
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/shared_ptr_base.h:1712:14: note: in instantiation of function template specialization 'std::__shared_count<>::__shared_count<replica::db_user_types_storage, std::allocator<void>, replica::database &>' requested here
: _M_ptr(), _M_refcount(_M_ptr, __tag, std::forward<_Args>(__args)...)
^
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/shared_ptr.h:464:4: note: in instantiation of function template specialization 'std::__shared_ptr<replica::db_user_types_storage>::__shared_ptr<std::allocator<void>, replica::database &>' requested here
: __shared_ptr<_Tp>(__tag, std::forward<_Args>(__args)...)
^
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/shared_ptr.h:1009:14: note: in instantiation of function template specialization 'std::shared_ptr<replica::db_user_types_storage>::shared_ptr<std::allocator<void>, replica::database &>' requested here
return shared_ptr<_Tp>(_Sp_alloc_shared_tag<_Alloc>{__a},
^
/home/kefu/dev/scylladb/replica/database.cc:313:24: note: in instantiation of function template specialization 'std::make_shared<replica::db_user_types_storage, replica::database &>' requested here
, _user_types(std::make_shared<db_user_types_storage>(*this))
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13062
When the WASM UDFs were first introduced, the LANGUAGE required in
the CQL statements to use them was "xwasm", because the ABI for the
UDFs was still not specified and changes to it could be backwards
incompatible.
Now, the ABI is stabilized, but if backwards incompatible changes
are made in the future, we will add a new ABI version for them, so
the name "xwasm" is no longer needed and we can finally
change it to "wasm".
Closes#13089
There are two places that do it -- commitlog and batchlog replayers. Both can have local system-keyspace reference and use system-keyspace local query-processor for it. The peering save_truncation_record() is not that simple and is not patched by this PR
Closes#13087
* github.com:scylladb/scylladb:
system_keyspace: Unstatic get_truncation_record()
system_keyspace: Unstatic get_truncated_at()
batchlog_manager: Add system_keyspace dependency
main: Swap batchlog manager and system keyspace starts
system_keyspace: Unstatic get_truncated_position()
system_keyspace: Remove unused method
commitlog: Create commitlog_replayer with system keyspace
test: Make cql_test_env::get_system_keyspace() return sharded
commiltlog: Line-up field definitions
docs/alternator/compatibility.md mentions a known problem that
Alternator Streams are divided into too many "shards". This patch
add a link to a github issue to track our work on this issue - like
we did for most other differences mentioned in compatibility.md.
Refs #13080
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13081
* use the value of disabled_warnings, not the variable name for warning
options, otherwise we'd checking options like `-Wno-disabled_warnings`.
* use different names for the output of check_cxx_compiler_flag() calls.
as the output variable of check_cxx_compiler_flag(..) call is cached,
we cannot reuse it for checking different warning options,
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
seastar::httpd::request was deprecated in favor of `seastar::http::request`
since bdd5d929891d2cb821eca25896e25ed4ff658b7a.
so let's use the latter. this change also silences the warning of:
```
/home/kefu/dev/scylladb/api/authorization_cache.cc: In function ‘void api::set_authorization_cache(http_context&, seastar::httpd::routes&, seastar::sharded<auth::service>&)’:
/home/kefu/dev/scylladb/api/authorization_cache.cc:19:104: error: ‘using seastar::httpd::request = struct seastar::http::request’ is deprecated: Use http::request instead [-Werror=deprecated-declarations]
19 | httpd::authorization_cache_json::authorization_cache_reset.set(r, [&auth_service] (std::unique_ptr<request> req) -> future<json::json_return_type> {
| ^~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Instead of open-coding the same, in an incomplete way.
clear_inactive_reads() does incomplete eviction in severeal ways:
* it doesn't decrement _stats.inactive_reads
* it doesn't set the permit to evicted state
* it doesn't cancel the ttl timer (if any)
* it doesn't call the eviction notifier on the permit (if there is one)
The list goes on. We already have an evict() method that all this
correctly, use that instead of the current badly open-coded alternative.
This patch also enhances the existing test for clear_inactive_reads()
and adds a new one specifically for `stop()` being called while having
inactive reads.
Fixes: #13048Closes#13049
in general, the more static analysis the merrier. with the updated
Seastar, which includes the commit of "core/sstring: define <=> operator
for sstring", all defaulted '<=> operator' which previously rely
on sstring's operator<=> will not be deleted anymore, so we can
enable `-Wdefaulted-function-deleted` now.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12861
this change also includes change to main, to make this commit compile.
see below:
* seastar 9b6e181e42...9cbc1fe889 (46):
> Merge 'Make io-tester jobs share sched classes' from Pavel Emelyanov
> io_tester.md: Update the `rps` configuration option description
> io_tester: Add option to limit total number of requests sent
> Merge 'Keep outgoing queue all cancellable while negotiating (again)' from Pavel Emelyanov
> io_tester: Add option to share classes between jobs
> rpc: Abort connection if send_entry() fails
> Merge 'build: build dpdk with `-fPIC` if BUILD_SHARED_LIBS' from Kefu Chai
> build: cooking.sh: use the same BUILD_SHARED_LIBS when building ingredients
> build: cooking.sh: use the same generator when building ingredients
> core/memory: handle `strerror_r` returning static string
> Merge 'build, rpc: lz4 related cleanups' from Kefu Chai
> build, rpc: do not support lz4 < 1.7.3
> build: set the correct version when finding lz4
> build: include CheckSymbolExists
> rpc: do not include lz4.h in header
> build: set CMP0135 for Cooking.cmake
> docs: drop building-*.md
> Merge 'seastar-addr2line: cleanups' from Kefu Chai
> seastar-addr2line: refactor tests using unittest
> seastar-addr2line: extract do_test() and main()
> seastar-addr2line: do not import unused modules
> scheduling: add a `rename` callback to scheduling_group_key_config
> reactor: syscall thread: wakeup up reactor with finer granularity
> build: build dpdk with `-fPIC` if BUILD_SHARED_LIBS
> build: extract dpdk_extra_cflags out
> core/sstring: remove a temporary variable
> Merge 'treewide: include what we use, and add a checkheaders target' from Kefu Chai
> perftune.py: auto-select the same number of IRQ cores on each NUMA
> prometheus: remove unused headers
> core/sstring: define <=> operator for sstring
> Merge 'core: s/reserve_additional_memory/reserve_additional_memory_per_shard/' from Kefu Chai
> include: do not include <concepts> directly
> coding_style: note on self-contained header requirement
> circileci: build checkheaders in addition to default target
> build: add checkheaders target
> net/toeplitz: s/u_int/unsigned/
> net/tcp-stack: add forward declaration for seastar::socket
> core, net, util: include used headers
* main: set reserved memory for wasm on per-shard basis
this change is a follow-up of
f05d612da8 and
4a0134a097.
this change depends on the related change in Seastar to reserve
additional memory on a per-shard basis.
per Wojciech Mitros's comment:
> it should have probably been 50MB per shard
in other words, as we always execute the same set of udf on all
shards. and since one cannot predict the number of shards, but she
could have a rough estimation on the size of memory a regular (set
of) udf could use. so a per-shard setting makes more sense.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
There was a bug in `expr::search_and_replace`.
It doesn't preserve the `order` field of binary_operator.
`order` field is used to mark relations created
using the SCYLLA_CLUSTERING_BOUND.
It is a CQL feature used for internal queries inside Scylla.
It means that we should handle the restriction as a raw
clustering bound, not as an expression in the CQL language.
Losing the SCYLLA_CLUSTERING_BOUND marker could cause issues,
the database could end up selecting the wrong clustering ranges.
Fixes: #13055
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#13056
* throw marshal_exception if not the whole string is parsed, we
should error out if the parsed string contains gabage at the end.
before this change, we silent accept uuid like
"ce84997b-6ea2-4468-9f02-8a65abf4wxyz", and parses it as
"ce84997b-6ea2-4468-9f02-8a65abf4". this is not correct.
* throw marshal_exception if stoull() throws,
`stoull()` throws if it fails to parse a string to an unsigned long
long, we should translate the exception to `marshal_exception`, so
we can handle these exception in a consistent manner.
test is updated accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13069
Now when both callers of this method are non-static, it can be made
non-static too. While at it make two more changes:
1. move the thing to private
2. remove explicit cql3::query_processor::cache_internal::yes argument,
the system_keyspace::execute_cql() applies it on itw own
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The manager will need system ks to get truncation record from, so add it
explicitly. Start-stop sequence no allows that
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The former needs the latter to get truncation records from and will thus
need it as explicit dependency. In order to have it bathlog needs to
start after system ks. This works as starting batchlog manager doesn't
do anything that's required by system keyspace. This is indirectly
proven by cql-test-env in which batchlog manager starts later than it
does in main
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The replayer code needs system keyspace to fetch truncation records
from, thus it needs this explicit dependency. By the time it runs system
keyspace is fully initialized already
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
instead of passing '0' in the initializer list to do aggregate
initialization, just use zero initialization. simpler this way.
also, this helps to silence a `-Wmissing-braces` warning, like
```
/home/kefu/dev/scylladb/auth/passwords.cc:21:43: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
static thread_local crypt_data tlcrypt = {0, };
^
{}
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13060
because `seastar::to_sstring()` defaults to `fmt::format_to()`. so
any type which is supported by `fmt::formatter()` is also supported
by `seastar::to_sstring()`. and the behavior of existing implementation
is exactly the same as the defaulted one.
so let's drop the specialization and let
`fmt::formatter<sstables::generation_type>` do its job.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13070
- build: cmake: find ANTLR3 before using it
- build: cmake: define FMT_DEPRECATED_OSTREAM
- build: cmake: add include directory for lua
- build: cmake: link redis against db
Closes#13071
* github.com:scylladb/scylladb:
build: cmake: add more tests
build: cmake: find and link against RapidJSON
build: cmake: link couple libraries as whole archive
build: cmake: find ANTLR3 before using it
build: cmake: define FMT_DEPRECATED_OSTREAM
build: cmake: add include directory for lua
build: cmake: link redis against db
in general, the more static analysis the merrier. these warnings were previously added to silence warnings from Clang and/or GCC, but since we've addressed all of them, let's reenable them to detect potential issues early.
Closes#13063
* github.com:scylladb/scylladb:
build: reenable disabled warnings
test: lib: do not return a local reference
dht: incremental_owned_ranges_checker: use lower_bound()
types: reimplement in terms of a variable template
query_id: extract into new header
test/cql-pytest: test for CLUSTERING ORDER BY verification in MV
test/cql-pytest: allow "run-cassandra" without building Scylla
build: reenable unused-{variable,lambda-capture} warnings
test: reader_concurrency_semaphore_test: define target_memory in debug mode
flat_mutation_reader_test: cleanup, seastar::async -> SEASTAR_THREAD_TEST_CASE
make_nonforwardable: test through run_mutation_source_tests
make_nonforwardable: next_partition and fast_forward_to when single_partition is true
make_forwardable: fix next_partition
flat_mutation_reader_v2: drop forward_buffer_to
nonforwardable reader: fix indentation
nonforwardable reader: refactor, extract reset_partition
nonforwardable reader: add more tests
nonforwardable reader: no partition_end after fast_forward_to()
nonforwardable reader: no partition_end after next_partition()
nonforwardable reader: no partition_end for empty reader
api::failure_detector: mark set_phi_convict_threshold unimplemented
test: memtable_test: mark dummy variable for loop [[maybe_unused]]
idl-compiler: mark captured this used
raft: reference this explicitly
util/result_try: reference this explicitly
sstables/sstables: mark dummy variable for loop [[maybe_unused]]
treewide: do not define/capture unused variables
service: storage_service: clear _node_ops in batch
cql-pytest: add tests for sum() aggregate
build: cmake: extract mutation,db,replica,streaming out
build: cmake: link the whole auth
build: cmake: extract thrift out
build: cmake: expose scylla_gen_build_dir from "interface"
build: cmake: find libxcrypt before using it
build: cmake: find Thrift before using it
build: cmake: support thrift < 0.11.0
test/cql-pytest: move aggregation tests to one file
Revert "Revert "storage_service: Enable Repair Based Node Operations (RBNO) by default for all node ops""
storage_service: Wait for normal state handler to finish in replace
storage_service: Wait for normal state handler to finish in bootstrap
row_cache: pass partition_start though nonforwardable reader
doc: fix the version in the comment on removing the note
doc: specify the versions where Alternator TTL is no longer experimental
in general, the more static analysis the merrier. these warnings
were previously disabled to silence warnings from Clang and/or GCC,
but since we've addressed all of them, let's reenable them to
detect potential issues early.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
the type of return value of `get_table_views()` is a reference, so we
cannot return a reference to a temporary value.
in this change, a member variable is added to hold the _table_schema,
so it can outlive the function call.
this should silence following warning from Clang:
```
test/lib/expr_test_utils.cc:543:16: error: returning reference to local temporary object [-Werror,-Wreturn-stack-address]
return {view_ptr(_table_schema)};
^~~~~~~~~~~~~~~~~~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
despite that RapidJSON is a header-only library, we still need to
find it and "link" against it for adding the include directory.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
turns out we are using static variables to register entries in
global registries, and these variables are not directly referenced,
so linker just drops them when linking the executables or shared
libraries. to address this problem, we just link the whole archive.
another option would be create a linker script or pass
--undefined=<symbol> to linker. neither of them is straightforward.
a helper function is introduced to do this, as we cannot use CMake
3.24 as yet.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
if ANTLR3's header files are not installed into the /usr/include, or
other directories searched by compiler by default. there are chances,
we cannot build the tree. so we have to find it first. as /opt/scylladb
is the directory where `scylla-antlr35-c++-dev` is installed on
debian derivatives, this directory is added so the find package module
can find the header files.
```
In file included from /home/kefu/dev/scylla/db/legacy_schema_migrator.cc:38:
In file included from /home/kefu/dev/scylla/cql3/util.hh:21:
/home/kefu/dev/scylla/build/cmake/cql3/CqlParser.hpp:55:10: fatal error: 'antlr3.hpp' file not found
^~~~~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
otherwise, we'd have
```
In file included from /home/kefu/dev/scylla/redis/keyspace_utils.cc:19:
In file included from /home/kefu/dev/scylla/db/query_context.hh:14:
In file included from /home/kefu/dev/scylla/cql3/query_processor.hh:24:
In file included from /home/kefu/dev/scylla/lang/wasm_instance_cache.hh:19:
/home/kefu/dev/scylla/lang/wasm.hh:14:10: fatal error: 'rust/wasmtime_bindings.hh' file not found
^~~~~~~~~~~~~~~~~~~~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This commit makes the following changes to the docs landing page:
- Adds the ScyllaDB enterprise docs as one of three tiles.
- Modifies the three tiles to reflect the three flavors of ScyllaDB.
- Moves the "New to ScyllaDB? Start here!" under the page title.
- Renames "Our Products" to "Other Products" to list the products other
than ScyllaDB itself. In addtition, the boxes are enlarged from to
large-4 to look better.
The major purpose of this commit is to expose the ScyllaDB
documentation.
docs: fix the link
Closes#13065
Implementation of task_manager's task that covers major keyspace compaction
on one shard.
Closes#12662
* github.com:scylladb/scylladb:
test: extend major keyspace compaction tasks test
compaction: create task manager's task for major keyspace compaction on one shard
- build: cmake: extract more subsystem out into its own CMakeLists.txt
- build: cmake: remove swagger_gen_files
- build: cmake: remove stale TODO comments
- build: cmake: expose scylla_gen_build_dir
- build: cmake: link against cryptopp
- build: cmake: add missing source to utils
- build: cmake: move lib sources into test-lib
- build: cmake: add test/perf
Closes#13059
* github.com:scylladb/scylladb:
build: cmake: add expr_test test
build: cmake: allow test to specify the sources
build: cmake: add test/perf
build: cmake: move lib sources into test-lib
build: cmake: add missing source to utils
build: cmake: link against cryptopp
build: cmake: expose scylla_gen_build_dir
build: cmake: remove stale TODO comments
build: cmake: remove swagger_gen_files
build: cmake: extract more subsystem out into its own CMakeLists.txt
Fixes#12810
We did not update total_size_on_disk in commitlog totals when use o_dsync was off.
This means we essentially ran with no registered footprint, also causing broken comparisons in delete_segments.
Closes#12950
* github.com:scylladb/scylladb:
commitlog: Fix updating of total_size_on_disk on segment alloc when o_dsync is off
commitlog: change type of stored size
This method requires callers to remember that the sstable is the collection of files on a filesystem and to know what exact directory they are all in. That's not going to work for object storage, instead, sstable should be moved between more abstract states.
This PR replaces move_to_new_dir() call with the change_state() one that accepts target sub-directory string and moves files around. Currently supported state changes:
* staging -> normal
* upload -> normal | staging
* any -> quarantine
All are pretty straightforward and move files between table basedir subdirectories with the exception that upload -> quarantine should move into upload/quarantine subdirectory. Another thing to keep in mind, that normal state doesn't have its subdir but maps directory to table's base directory.
Closes#12648
* github.com:scylladb/scylladb:
sstable: Remove explicit quarantization call
test: Move move_to_new_dir() method from sstable class
sstable, dist.-loader: Introduce and use pick_up_from_upload() method
sstables, code: Introduce and use change_state() call
distributed_loader: Let make_sstables_available choose target directory
some tests are compiled from more source files, so add an extra
parameter, so they can customize the sources.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
due to circular dependency: the .cc files under the root of
project references the symbols defined by the source files under
subdirectories, but the source files under subdirectories also
reference the symbols defined by the .cc files under the root
of project, the targets in test/perf do not compile. but
the general structure is created.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
since we include cryptopp/ headers, we need find it and link against
it explicitly, instead of relying on seastar to do this.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This patch fixes a problem which affects decommission and removenode
which may lead to data consistency problems under conditions which
lead one of the nodes to unliaterally decide to abort the node
operation without the coordinator noticing.
If this happens during streaming, the node operation coordinator would
proceed to make a change in the gossiper, and only later dectect that
one of the nodes aborted during sending of decommission_done or
removenode_done command. That's too late, because the operation will
be finalized by all the nodes once gossip propagates.
It's unsafe to finalize the operation while another node aborted. The
other node reverted to the old topolgy, with which they were running
for some time, without considering the pending replica when handling
requests. As a result, we may end up with consistency issues. Writes
made by those coordinators may not be replicated to CL replicas in the
new topology. Streaming may have missed to replicate those writes
depending on timing.
It's possible that some node aborts but streaming succeeds if the
abort is not due to network problems, or if the network problems are
transient and/or localized and affect only heartbeats.
There is no way to revert after we commit the node operation to the
gossiper, so it's ok to close node_ops sessions before making the
change to the gossiper, and thus detect aborts and prevent later aborts
after the change in the gossiper is made. This is already done during
bootstrap (RBNO enabled) and replacenode. This patch canges removenode
to also take this approach by moving sending of remove_done earlier.
We cannot take this approach with decommission easily, because
decommission_done command includes a wait for the node to leave the
ring, which won't happen before the change to the gossiper is
made. Separating this from decommission_done would require protocol
changes. This patch adds a second-best solution, which is to check if
sessions are still there right before making a change to the gossiper,
leaving decommission_done where it was.
The race can still happen, but the time window is now much smaller.
Fixes#12989
Refs #12969
instead of using a while loop for finding the lower_bound,
just use std::lower_bound() for finding if current node owns given
token. this has two advantages:
* better readability: as lower_bound is exactly what this loop
calculates.
* lower_bound uses binary search for searching the element,
this algorithm should be faster than linear under most
circumstances.
* lower_bound uses std::advance() and prefix increment operator,
this should be more performant than the postfix increment operator.
as it does not create an temporary instance of iterator.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13008
data_type_for() is a function template that converts a C++
type to a database dynamic type (data_type object).
Instead of implementing a function per type, implement a variable
template instance. This is shorter and nicer.
Since the original type variables (e.g. long_type) are defined separately,
use a reference instead of copying to avoid initialization order problems.
To catch misuses of data_type_for the general data_type_for_v variable
template maps to some unused tag type which will cause a build error
when instantiated.
The original motivation for this was to allow for partial
specialization of data_type_for() for tuple types, but this isn't
really workable since the native type for tuples is std::vector<data_value>,
not std::tuple, and I only checked this after getting the work done,
so this isn't helping anything; it's just a little nicer.
Closes#13043
This PR adds a note to the Alternator TTL section to specify in which Open Source and Enterprise versions the feature was promoted from experimental to non-experimental.
The challenge here is that OSS and Enterprise are (still) **documented together**, but they're **not in sync** in promoting the TTL feature: it's still experimental in 5.1 (released) but no longer experimental in 2022.2 (to be released soon).
We can take one of the following approaches:
a) Merge this PR with master and ask the 2022.2 users to refer to master.
b) Merge this PR with master and then backport to branch-5.1. If we choose this approach, it is necessary to backport https://github.com/scylladb/scylladb/pull/11997 beforehand to avoid conflicts.
I'd opt for a) because it makes more sense from the OSS perspective and helps us avoid mess and backporting.
Closes#12295
* github.com:scylladb/scylladb:
doc: fix the version in the comment on removing the note
doc: specify the versions where Alternator TTL is no longer experimental
This small series reorganizes the existing functional tests for aggregation (min, max, count) and adds additional tests for sum reproducing the strange (but Cassandra-compatible) behavior described in issue #13027.
Closes#13038
* github.com:scylladb/scylladb:
cql-pytest: add tests for sum() aggregate
test/cql-pytest: move aggregation tests to one file
query_id currently lives query-request.hh, a busy place
with lots of dependencies. In turn it gets pulled by
uuid.idl.hh, which is also very central. This makes
test/raft/randomized_nemesis_test.cc which is nominally
only dependent on Raft rebuild on random header file changes.
Fix by extracting into a new header.
Closes#13042
The series fixes the `make_nonforwardable` reader, it shouldn't emit `partition_end` for previous partition after `next_partition()` and `fast_forward_to()`
Fixes: #12249Closes#12978
* github.com:scylladb/scylladb:
flat_mutation_reader_test: cleanup, seastar::async -> SEASTAR_THREAD_TEST_CASE
make_nonforwardable: test through run_mutation_source_tests
make_nonforwardable: next_partition and fast_forward_to when single_partition is true
make_forwardable: fix next_partition
flat_mutation_reader_v2: drop forward_buffer_to
nonforwardable reader: fix indentation
nonforwardable reader: refactor, extract reset_partition
nonforwardable reader: add more tests
nonforwardable reader: no partition_end after fast_forward_to()
nonforwardable reader: no partition_end after next_partition()
nonforwardable reader: no partition_end for empty reader
row_cache: pass partition_start though nonforwardable reader
- treewide: do not define/capture unused variables
- sstables/sstables: mark dummy variable for loop [[maybe_unused]]
- util/result_try: reference this explicitly
- raft: reference this explicitly
- idl-compiler: mark captured this used
- build: reenable unused-{variable,lambda-capture} warnings
Closes#12915
* github.com:scylladb/scylladb:
build: reenable unused-{variable,lambda-capture} warnings
test: reader_concurrency_semaphore_test: define target_memory in debug mode
api::failure_detector: mark set_phi_convict_threshold unimplemented
test: memtable_test: mark dummy variable for loop [[maybe_unused]]
idl-compiler: mark captured this used
raft: reference this explicitly
util/result_try: reference this explicitly
sstables/sstables: mark dummy variable for loop [[maybe_unused]]
treewide: do not define/capture unused variables
service: storage_service: clear _node_ops in batch
Since commit 73e258fc34, Scylla has partial
verification for the CLUSTERING ORDER BY clause in CREATE MATERIALIZED
VIEW. Specifically, invalid column names are rejected. But for reasons
explained in issue #12936 and in the test in this patch, Cassandra
demands that if CLUSTERING ORDER BY appears it must list all the
clustering columns, with no duplicates, and do so in the right order.
This patch replaces an existing test which suggested it is fine
(an extention over Cassandra) to accept a partial list of clustering
columns, by a test that verifies that such a partial list, or an
incorrectly-ordered list, or list with duplicates, should be rejected.
The new test fails on Scylla, and passes on Cassandra, so marked as xfail.
Refs #12936.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12938
This pr fixes the seastar::rpc::closed_error error in the test_topology suite and enables RBNO by default.
Closes#12970
* github.com:scylladb/scylladb:
Revert "Revert "storage_service: Enable Repair Based Node Operations (RBNO) by default for all node ops""
storage_service: Wait for normal state handler to finish in replace
storage_service: Wait for normal state handler to finish in bootstrap
Before this patch, all scripts which use test/cql-pytest/run.py
looked for the Scylla executable as their first step. This is usually
the right thing to do, except in two cases where Scylla is *not* needed:
1. The script test/cql-pytest/run-cassandra.
2. The script test/alternator/run with the "--aws" option.
So in this patch we change run.py to only look for Scylla when actually
needed (the find_scylla() function is called). In both cases mentioned
above, find_scylla() will never get called and the script can work even
if Scylla was never built.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13010
- build: cmake: support thrift < 0.11.0
- build: cmake: find Thrift before using it
- build: cmake: find libxcrypt before using it
- build: cmake: expose scylla_gen_build_dir from "interface"
- build: cmake: extract thrift out
- build: cmake: link the whole auth
- build: cmake: extract mutation,db,replica,streaming out
Closes#12990
* github.com:scylladb/scylladb:
build: cmake: extract mutation,db,replica,streaming out
build: cmake: link the whole auth
build: cmake: extract thrift out
build: cmake: expose scylla_gen_build_dir from "interface"
build: cmake: find libxcrypt before using it
build: cmake: find Thrift before using it
build: cmake: support thrift < 0.11.0
now that all -Wunused-{variable,lambda-capture} warnings are taken
care of. let's reenable these warnings so they can help us to identify
potential issues.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
as `cf_name` is not derived from any class, it's viable
to mark it `final`.
this change is created to to silence the warning from Clang,
like:
```
/home/kefu/.local/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_LOCALE -DFMT_SHARED -DHAVE_LZ4_COMPRESS_DEFAULT -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=6 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/cmake/gen -I/home/kefu/dev/scylladb/build/cmake -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/cmake/seastar/gen/include -Wall -Werror -Wno-mismatched-tags -Wno-missing-braces -Wno-c++11-narrowing -O0 -g -gz -std=gnu++20 -U_FORTIFY_SOURCE -DSEASTAR_SSTRING -Wno-error=unused-result "-Wno-error=#warnings" -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT CMakeFiles/scylla.dir/data_dictionary/data_dictionary.cc.o -MF CMakeFiles/scylla.dir/data_dictionary/data_dictionary.cc.o.d -o CMakeFiles/scylla.dir/data_dictionary/data_dictionary.cc.o -c /home/kefu/dev/scylladb/data_dictionary/data_dictionary.cc
In file included from /home/kefu/dev/scylladb/data_dictionary/data_dictionary.cc:9:
In file included from /home/kefu/dev/scylladb/data_dictionary/data_dictionary.hh:11:
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/optional:287:2: error: destructor called on non-final 'cql3::cf_name' that has virtual functions but non-virtual destructor [-Werror,-Wdelete-non-abstract-non-virtual-dtor]
_M_payload._M_value.~_Stored_type();
^
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/optional:318:4: note: in instantiation of member function 'std::_Optional_payload_base<cql3::cf_name>::_M_destroy' requested here
_M_destroy();
^
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/optional:439:57: note: in instantiation of member function 'std::_Optional_payload_base<cql3::cf_name>::_M_reset' requested here
_GLIBCXX20_CONSTEXPR ~_Optional_payload() { this->_M_reset(); }
^
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/optional:514:17: note: in instantiation of member function 'std::_Optional_payload<cql3::cf_name>::~_Optional_payload' requested here
constexpr _Optional_base() = default;
^
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/optional:739:17: note: in defaulted default constructor for 'std::_Optional_base<cql3::cf_name>' first required here
constexpr optional(nullopt_t) noexcept { }
^
/home/kefu/dev/scylladb/cql3/statements/raw/batch_statement.hh:37:28: note: in instantiation of member function 'std::optional<cql3::cf_name>::optional' requested here
: cf_statement(std::nullopt)
^
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/optional:287:23: note: qualify call to silence this warning
_M_payload._M_value.~_Stored_type();
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13039
This flag designates that we should consume only one
partition from the underlying reader. This means that
attempts to move to another partition should cause an EOS.
When next_partition is called, the buffer could
contain partition_start and possibly static_row.
In this case clear_buffer_to_next_partition will
not remove anything from the buffer and the
reader position should not change. Before this patch,
however, we used to set _end_of_stream=false,
which violated the forwardable-reader
contract - the data of the next partition
was emitted after the data of the first partition
without intermediate EOS.
This bug was found when debugging
test_make_nonforwardable_from_mutations_as_mutation_source flakiness.
A corresponding focused test_make_forwardable_next_partition
has been added to exercise this problem.
This patch fixes the problem with method fast_forward_to
which is similar to the one with next_partition, no
partition_end should be injected for the partition if
fast_forward_to was called inside it.
Before the patch, nonforwardable reader injected
partition_end unconditionally. This caused problems
in case next_partition() was called, the downstream
reader might have already injected its own
partition_end marker, and the one from nonforwardable
reader was a duplicate.
Fixes: #12249
The patch introduces the _partition_is_open flag,
inject partition_end only if there was some data
in the input reader.
A simple unit test has been added for
the nonforwardable reader which checks this
new behaviour.
The WASM UDF implementation has changed since the last time the docs
were written. In particular, the Rust helper library has been
released, and using it should be the recommended method.
Some decisions that were only experimental at the start, were also
"set in stone", so we should refer to them as such.
The docs also contain some code examples. This patch adds tests for
these examples to make sure that they are not wrong and misleading.
Closes#12941
small_vector should be feature-wise compatible with std::vector<>,
let's add operator<=> for it.
also, there is not needd to define operator!=() explicitly, C++20
define this for us if operator==() is defined, so let's drop it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13032
Currently the ranges_to_stream variable lives
on the caller state, and do_streaming() moves its
contents down to request_ranges/transfer_ranges
and then calls clear() to make it ready for reuse.
This works in principle but it makes it harder
for an occasional reader of this code to figure out
what going on.
This change transfers control of the ranges_to_stream vector
to do_streaming, by calling it with (std::exchange(do_streaming, {}))
and with that that moved vector doesn't need to be cleared by
do_streaming, and the caller is reponsible for readying
the variable for reuse in its for loop.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than copying the ranges vector.
Note that add_transfer_ranges itself cannot simply move the ranges
since it copies them for multiple tables.
While at it, move also the keyspace and column_family strings.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The ranges can be moved rather than copied to both
`request_ranges` and `transfer_ranges` as they are only cleared
after this point.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
After calling get_range_fetch_map, ranges_for_keyspace
is not used anymore.
Synchronously destroying it may potentially stall in large clusters
so use utils::clear_gently to gently clear the map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use const& to refer to the input ranges and endpoints
rather than copying them individually along the way
more than needed to.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than calling nr_ranges_to_stream() inside `do_streaming`.
As nr_ranges_to_stream depends on the `_to_stream` that will be updated
only later on after the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
range_vec is used for calculating nr_ranges_to_stream.
Currently, the ranges_to_stream that were
moved out of range_vec are push back on exception,
but this isn't safe, since they may have moved already
to request_ranges or transfer_ranges.
Instead, erase the ranges we pass to do_streaming
only after it succeeds so on exception, range_vec
will not need adjusting.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
without C++23 `std::ranges::repeat_view`, it'd be cumbersume to
implement a loop without dummy variable. this change helps to
silence following warning:
```
test/boost/memtable_test.cc:1135:26: error: unused variable 'value' [-Werror,-Wunused-variable]
for (int value : boost::irange<int>(0, num_flushes)) {
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
sometime the captured `this` is used in the generated C++ code,
while some time it is not. to reenable `-Wunused-lambda-capture`
warning, let's mark this `this` as used.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Clang complains that the captured `this` is not used, like
```
/home/kefu/dev/scylladb/raft/fsm.hh:644:21: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture]
auto visitor = [this, from, msg = std::move(msg)](const auto& state) mutable {
^
/home/kefu/dev/scylladb/raft/server.cc:738:11: note: in instantiation of function template specialization 'raft::fsm::step<raft::append_request>' requested here
_fsm->step(from, std::move(append_request));
^
```
but `step(..)` is a non-static member function of `fsm`, so `this`
is actually used. to silence Clang's warning, let's just reference it
explicitly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
quote from Avi's comment
> It's supposed to be illegal to call handle(...) without this->,
> because handle() is a dependent name (but many compilers don't
> insist, gcc is stricter here). So two error messages competed,
> and "unused this capture" won.
without this change, Clang complains that `this` is not used with
`-Wunused-lambda-capture`.
in this change, `this` is used. in this change, `this` is explicitly
referenced to silence Clang's warning.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, _node_ops are cleared one after another in
`storage_service::node_ops_abort()` when `ops_uuid` is not specified.
but this
* is not efficient
* is not quite readable
* introduces an unused variable
so, in this change, we just clear it in batch. this should silence
a `-Wno-unused-variable` warning from Clang.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This patch adds regression tests for the strange (but Cassandra-compatible)
behavior described in issue #13027 - that sum of no results returns 0
(not null or nothing), and if also asking for p, we get a null there too.
Refs #13027.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
- main: expose tools as a vector<>
- main: use a struct for representing tool
- main: track tools descriptin in tool struct
- main: add missing descriptions for tools
- main: move get_tools() into main()
Fixes#13026Closes#13030
* github.com:scylladb/scylladb:
main: move get_tools() into main()
main: add missing descriptions for tools
main: track tools descriptin in tool struct
main: use a struct for representing tool
main: expose tools as a vector<>
without this change, linker would like to remove the .o which is not
referenced by auther translation units. but we do use static variables
to, for instance, register classess to a global registry.
so, let's force the linker to include the whole archive.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
also, move "interface" linkage from scylla to "thrift", because
it is "thrift" who is using "interface".
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
as it builds headers like "gen/Cassandra.h", and the target
uses "interface" via these headers, so "interface" is obliged
to expose this include directory.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
we should find libxcrypt library before using it. in this change,
Findlibxcrypt.cmake is added to find libxcrypt library.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
we should find Thrift library before using it. in this change,
FindThrift.cmake is added to find Thrift library.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
We had separate test files test_minmax.py and test_count.py but the
separate was artificial (and test_count.py even had one test using
min()). Now I that want to add another test for sum(), I don't know
where to put it. So in this patch I combine test_minmax.py and
test_count.py into one test file - test_aggregate.py, and we can
later add sum() tests in the same file.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
to silence the warning from rpmbuild, like
```
RPM build warnings:
line 202: It's not recommended to have unversioned Obsoletes: Obsoletes: tuned
```
more specific this way. quote from the commit message of
303865d979 for the version number:
> tuned 2.11.0-9 and later writes to kerned.sched_wakeup_granularity_ns
> and other sysctl tunables that we so laboriously tuned, dropping
> performance by a factor of 5 (due to increased latency). Fix by
> obsoleting tuned during install (in effect, we are a better tuned,
> at least for us).
with this change, it'd be easier to identify potential issues when
building / packaging.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12721
the default generated operator<=> is exactly the same as the
handcrafted one. so let compiler do its job. also, since
operator<=> is defaulted, there is no need to define operator==
anymore, so drop it as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
the default generated operator<=> is exactly the same as the
handcrafted one. so let compiler do its job. also, since
operator<=> is defaulted, there is no need to define operator==
anymore, so drop it as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
the default generated operator<=> is exactly the same as the
handcrafted one. so let compiler do its job. also, since
operator<=> is defaulted, there is no need to define operator==
anymore, so drop it as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
the default generated operator<=> is exactly the same as the
handcrafted one. so let compiler do its job.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
there is not need to have a dedicated function which is only consumed
by `main()`. so let's move the body of `get_tools()` into `main`. and
with this change, a plain C array would suffice. so just use a plain
array for tools.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
so we can encapsulate the description of a certain tool in this
struct with a more readable field name in comparison with a tuple<>,
if we want to track all tools in this vector.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
so, in addition to looking up a tool by the name in it, we will be
able to list all tools in this vector. this change paves the road to
a more general solution to handle `--list-tools`.
in this change
* `lookup_main_func()` is replaced by `get_tools()`.
* instead of checking `main_func` out of the if block,
check it in the `if` block. as we already know if we have a matched
tool in the `if` block, and we can early return right there.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
test_broadcast_kv_store does not use await or yield at all, so
there is no need to mark it with "asyncio" mark.
tested using
```
SCYLLA_HOME=$HOME/scylla build/cmake/scylla --overprovisioned --developer-mode=yes --consistent-cluster-management=true --experimental-features=broadcast-tables
...
pytest broadcast_tables/test_broadcast_tables.py
```
the test still passes.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13006
In storage_service::handle_state_normal, storage_service::notify_joined
will be called which drops the rpc connections to the node becomes
normal. This causes rpc calls with that node fail with
seastar::rpc::closed_error error.
Consider this:
- n1 in the cluster
- n2 is added to join the cluster
- n2 sees n1 is in normal status
- n2 starts bootstrap process
- notify_joined on n2 closes rpc connection to n1 in the middle of
bootstrap
- n2 fails to bootstrap
For example, during bootstrap with RBNO, we saw repair failed in a
test that sets ring_delay to zero and does not wait for gossip to
settle.
repair - repair[9cd0dbf8-4bca-48fc-9b1c-d9e80d0313a2]: sync data for
keyspace=system_distributed_everywhere, status=failed:
std::runtime_error ({shard 0: seastar::rpc::closed_error (connection is
closed)})
This patch fixes the race by waiting for the handle_state_normal handler
to finish before the bootstrap process.
Fixes#12764Fixes#12956
now that Seastar can be built as shared libraries, we can use it for
faster development iteration with less disk usage.
in this change
* configure.py:
- 'build_seastar_shared_libs' is added as yet another mode value,
so different modes have its own setting. 'debug' and 'dev' have
this enabled, while other modes disable it.
- link scylla with rpath specified, so it can find `libseastar.so`
in build directory.
* install.sh: remove the rpath as the rpath in the elf image will
not be available after the relocatable package is installed, also
rpmbuild will error out when it uses check-rpaths to verify
the elf images (executables and shared libraries), as the rpath
encoded in them are not known ones. patchelf() will take care of
the shared libraries linked by the executables. so we don't need
to worry about libseastar.so or libseastar_testing.so.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12801
Recently, we overhauled the error handling of UNSET_VALUE in various
places where it is not allowed. This patch adds two more regression
tests for this error handling. Both tests pass on Scylla today, pass
on Cassandra, but fail on earlier Scylla (e.g., I tested 5.1.5):
The first test does INSERT into clustering key UNSET_VALUE.
An UNSET_VALUE is designed to skip part of the write - not an entire
write - so this attempt should fail - not silently be skipped.
The write indeed fails with an error on Cassandra, and on recent
Scylla, but silently did nothing in older Scylla which leads this
test to fail there.
The second test does the same thing with LWT (adding an "IF NOT EXISTS")
added to the insert. Scylla's failure here was even more spectacular -
it crashed (as reported in issue #13001) instead of silently skipping
the right. The test passes on Scylla today and on Cassandra, which
both report the failure cleanly.
Refs #13001.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13007
Now the nonforwardable reader unconditionally produces
a partition_end, even if the input reader was empty.
This is strange in itself, but it also hinders to
properly fix its next_partition() method, which is
our ultimate goal. So we are going to change this
and produce partition_end only if there were some
data in the stream. However, this makes a problem:
now we pop partition_start from the underlying reader
in autoupdating_underlying_reader::move_to_next_partition
and manually push it back to downstream readers
bypassing nonforwardable reader. This means if we
change the logic in nonforwardable reader as described
we will end up with partition_start without partition_end
in the downstream readers.
This patch rectifies this by making sure that
nonforwardable will see the initial partition_start.
We inject this partition_start just before the
nonforwardable reader, into delegating_reader.
This also makes the result type of
range_populating_reader::operator() a bit simpler,
we don't need to pass partition_start anymore.
Cassandra is very strict in the CLUSTERING ORDER BY clause which it
allows when creating a materialized view - if it appears, it must
list all the clustering columns of the view. Scylla is less strict -
a subset of the clustering columns may be specified. But Scylla was
*too* lenient - a user could specify non-clustering columns and even
non-existent columns and Scylla would not fail the MV creation.
This patch fixes that - with it MV creation fails if anything besides
clustering columns are listed on CLUSTERING ORDER BY.
An xfailing test we had for this case no longer fails after this
patch so its xfail mark is removed. We also add a few more corner
cases to the tests.
This patch also fixs one C++ test which had exactly the error that this
patch detects - the test author tried to use the partition key, instead
of the clustering key, in CLUSTERING ORDER BY (this error had no effect
because the specified order, "asc", was the default anyway).
Fixes#10767
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12885
`load_one_schema()` and `load_schemas_from_file()` are dropped,
as they are neither used by `scylla-sstable` or tested by
`schema_loader_test.cc` . the latter tests `load_schemas()`, which
is quite the same as `load_one_schema_from_file()`, but is more
permissive in the sense that it allows zero schema or more than
one schema in the specified path.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13003
Said method can now throw `std::bad_alloc` since aab5954. All call-sites should have been adapted in the series introducing the throw, but some managed to slip through because the oom unit test didn't run in debug mode. This series fixes the remaining unpatched call-sites and makes sure the test runs in debug mode too, so leaks like this are detected.
Fixes: #12767Closes#12756
* github.com:scylladb/scylladb:
test/boost/reader_concurreny_semaphore_test: run oom protection tests in debug mode
treewide: adapt to throwing reader_concurrency_semaphore::consume()
Closes#12071
* github.com:scylladb/scylladb:
docs/dev: building.md: mention node-exporter packages
docs/dev: building.md: replace `dev` with `<mode>` in list of debs
`effective_replication_map` is not a base class of any other class. so
there is no need to mark any of its member function as `virtual`. this
change should address following waring from Clang:
```
/home/kefu/dev/scylladb/seastar/include/seastar/core/shared_ptr.hh:205:9: error: delete called on non-final 'locator::effective_replication_map' that has virtual functions but non-virtual destructor [-Werror,-Wdelete-non-abstract-non-virtual-dtor]
delete value_ptr;
^
/home/kefu/dev/scylladb/seastar/include/seastar/core/shared_ptr.hh:202:9: note: in instantiation of member function 'seastar::internal::lw_shared_ptr_accessors_esft<locator::effective_replication_map>::dispose' requested here
dispose(static_cast<T*>(counter));
^
/home/kefu/dev/scylladb/seastar/include/seastar/core/shared_ptr.hh:317:27: note: in instantiation of member function 'seastar::internal::lw_shared_ptr_accessors_esft<locator::effective_replication_map>::dispose' requested here
accessors<T>::dispose(_p);
^
/home/kefu/dev/scylladb/locator/abstract_replication_strategy.hh:263:12: note: in instantiation of member function 'seastar::lw_shared_ptr<locator::effective_replication_map>::~lw_shared_ptr' requested here
return make_lw_shared<effective_replication_map>(std::move(rs), std::move(tmptr), std::move(replication_map), replication_factor);
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12992
since "token()" computes the token for a given partition key,
if we pass the key of the wrong type, it should reject.
in this change,
* we validate the keys before returning the "token()" function.
* drop the "xfail" decorator from two of the tests. they pass
now after this fix.
* change the tests which previously passed the wrong number of
arguments containing null to "token()" and expect it to return
null, so they verify that "token()" should reject these
arguments with the expected error message.
Fixes#10448
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12991
turns out what we need is a fmt::formatter<sstables::generation_type>
not operator<<(ostream&, sstables::generation_type), as its only use
case is the formatter used by seastar::format().
to specialize fmt::formatter<sstables::generation_type>
* allows us to be one step closer to drop `FMT_DEPRECATED_OSTREAM`
* allows us to customize the way how generation_type is printed by
customizing the format specifier.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12983
as, in C++20, compiler is able to generate the operator==() for us,
and the default generated one is identical to what we have now.
also, in C++20, operator!=() is generated by compiler if operator==()
is defined, so we can dispense with the former.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead of the family of comparison operators, just define <=>. as
in C++20, compiler will define all six comparison operators for us.
in this change, the operator<=> is defined, so we can more compacted
code.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Node ops has the following procedure:
1 for node in sync_nodes
send prepare cmd to node
2 for node in sync_nodes
send heartbeat cmd to node
If any of the prepare cmd in step 1 takes longer than the heartbeat
watchdog timeout, the heartbeat in step 2 will be too late to update the
watchdog, as a result the watchdog will abort the operation.
To prevent slow prepare cmd kills the node operations, we can start the
heartbeat earlier in the procedure.
Fixes#11011Fixes#12969Closes#12980
Aggregation query on counter column is failing because forward_service is looking for function with counter as an argument and such function doesn't exist. Instead the long type should be used.
Fixes: #12939Closes#12963
* github.com:scylladb/scylladb:
test:boost: counter column parallelized aggregation test
service:forward_service: use long type when column is counter
It's known that reading large cells in reverse cause large allocations.
Source: https://github.com/scylladb/scylladb/issues/11642
The loading is preliminary work for splitting large partitions into
fragments composing a run and then be able to later read such a run
in an efficiency way using the position metadata.
The splitting is not turned on yet, anywhere. Therefore, we can
temporarily disable the loading, as a way to avoid regressions in
stable versions. Large allocations can cause stalls due to foreground
memory eviction kicking in.
The default values for position metadata say that first and last
position include all clustering rows, but they aren't used anywhere
other than by sstable_run to determine if a run is disjoint at
clustering level, but given that no splitting is done yet, it
does not really matter.
Unit tests relying on position metadata were adjusted to enable
the loading, such that they can still pass.
Fixes#11642.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12979
Shard id is logged twice in repair (once explicitly, once added by logger).
Redundant occurrence is deleted.
shard_repair_task_impl::id (which contains global repair shard)
is renamed to avoid further confusion.
Fixes: #12955Closes#12959
* github.com:scylladb/scylladb:
repair: rename shard_repair_task_impl::id
repair: delete redundant shard id from logs
Task manager task implementation that covers the major keyspace
compaction which can be start through /storage_service/keyspace_compaction/
api.
Closes#12661
* github.com:scylladb/scylladb:
test: add test for major keyspace compaction tasks
compaction: create task manager's task for major keyspace compaction
compaction: copy run_on_existing_tables to task_manager_module.cc
compaction: add major_compaction_task_impl
compacition: add pure virtual compaction_task_impl
compaction: add compaction module getter to compaction manager
`LZ4_compress_default()` was introduced in liblz4 v1.7.3, despite
that the release note (https://github.com/lz4/lz4/releases/tag/v1.7.3)
of v1.7.3 didn't mention this. if we check the commit which added
this API, we can find all releases including it: see
```
$ git tag --contains 1b17bf2ab8cf66dd2b740eca376e2d46f7ad7041
lz4-r130
r129
r130
r131
rc129v0
v1.7.3
v1.7.4
v1.7.4.2
v1.7.5
v1.8.0
v1.8.1
v1.8.1.2
v1.8.2
v1.8.3
v1.9.0
v1.9.1
v1.9.2
v1.9.3
v1.9.4
```
and v1.7.3 was released in Nov 17, 2016. some popular distros
releases also package new enough liblz4:
- fedora 35 ships lz4-devel 1.9.3,
- CentOS 7 ships lz4-devel 1.8.3
- debian 10 ships liblz4-dev 1.8.3
- ubuntu 18.04 ships liblz4-dev r131
so, in this change, we drop the support of liblz4 < 1.7.3 for better
code readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12971
It was used in sstables streaming code up until e5be3352 (database,
streaming, messaging: drop streaming memtables) or nearby, then the
whole feature was reworked.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12967
Some test cases can be made a bit more compact by using the sugar provided by the aforementioned sugar
Closes#12965
* github.com:scylladb/scylladb:
test: Make use of reusable_sst default format
tests: Use reusable_sst() where applicable
The sstable_test_env::reusable_sst() has default value for the format
argument. Patch the test cases that don't use one while at it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The reusable_sst() is intented to be used to load the pre-existing
sstable from the test/resources directory and .load() them. Some test
cases, however, still do it "by hand".
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
and move `redis/protocol_parser.rl` related rules into `redis`, as
it is a file used for the implementation of redis.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
it already includes the necessary bits used by test-perf, so let's
just link the latter to the former.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
BUILD_TESTING is an option exposed by CTest module, so let's
include CTest module, and check if BUILD_TESTING is enabled
before include boost based tests.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
All major compaction tasks will share some methods like
type or abort. The common part of the tasks should be
inherited from major_compaction_task_impl.
The initial intent was to reduce the fanout of shared_sstable.hh through
v.u.g.hh -> cql_test_env.hh chain, but it also resulted in some shots
around v.u.g.hh -> database.hh inclusion.
By and large:
- v.u.g.hh doesn't need database.hh
- cql_test_env.hh doesn't need v.u.g.hh (and thus -- the
shared_sstable.hh) but needs database.hh instead
- few other .cc files need v.u.g.hh directly as they pulled it via
cql_test_env.hh before
- add forward declarations in few other places
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12952
The sstable_directory now combines two activities:
* scans the list of files in /var/lib/data and generates sstable-s object from it
* maintains the found sstable-s throughout necessary processing (populate/reshard/reshape)
The former part is in fact storage-specific. If sstables are on a filesystem, then it should be scanned with listdir, there can be dangling files, like temp-TOC, pending deletion log and comonents not belonging to any TOCs. If sstables are on some other storage, then this part should work some other way.
Said that, the sstable_directory is to be split into two pieces -- lister and "processing state". The latter would (may?) require renaming the sstable_directory into something more relevant, but that's huge and intrusive change. For now, just collect the lister stuff in one place.
Closes#12843
* github.com:scylladb/scylladb:
sstable_directory: Keep lister internals private
sstable_directory: Move most of .commit_directory_changes() on lister
sstable_directory: Remove temporary aliases
sstable_directory: Move most of .process_sstable_dir() on lister
sstable_directory: Move .handle_component() to components_lister
sstable_directory: Keep files_for_removal on scan_state
sstable_directory: Keep components_lister aboard
sstable_directory: Keep scan_state on components_lister
Fixes#12810
We did not update total_size_on_disk in commitlog totals when use o_dsync was off.
This means we essentially ran with no registered footprint, also causing broken
comparisons in delete_segments.
Refs #11710
Allows reusing regex for segment matching (for opening left-over segments after crash).
Should remove any stalls caused by commitlog replay preparation.
v2: Add unit test for descriptor parsing
Closes#12112
Check the first fragment before dereferencing it, the fragment might be
empty, in which case move to the next one.
Found by running range scan tests with random schema and random data.
Fixes: #12821Fixes: #12823Fixes: #12708Closes#12824
Currently, applying a schema change on a replica works like this:
Collect all affected keyspaces from incoming mutations
Read current state of schema
Apply the mutations
Read new state of schema
The "Read ... state of schema" step reads all kinds of schema
objects. In particular, to read the "table" objects, it does the
following:
for every affected keyspace k:
read all mutations from system_schema.tables for k
extract all existing table names from those mutations
for every existing table:
read mutations from {tables, columns, indexes, view_virtual_columns, ...} for that table
As you can see, the number of reads performed is O(nr tables in a
keyspace), not O(nr tables in a change). This means that making a
sequence of schema changes, like adding a table, is quadratic.
Another aspect which magnifies this is that we don't read those tables
using a single scan, but issue individual queries for each table
separately.
This patch optimizes this by considering only affected tables when
reading schema for the purpose of diff calculation.
When mutations contain multi-table deletions, we still read the
set of tables, like before. This could be optimized by looking
at the database to get the list, but it's not part of the patch.
I tested this using a test case provided by Kamil (kbr-scylla@53fe154)
./test.py --mode debug test_many_schema_changes -s
The test bootstraps a cluster and then creates about 40 schema
changes. Then a new node is bootstrapped and replays those changes via
group0.
In debug mode, each change takes roughly 2s to process before the
patch, and 0.5s after the patch.
The whole replay is reduced to 56% of what was before:
Before (1m19s) :
INFO 2023-01-20 19:44:35,848 [shard 0] raft_group0 - setup_group0: ensuring that the cluster has fully upgraded to use Raft...
INFO 2023-01-20 19:45:54,844 [shard 0] raft_group0 - setup_group0: waiting for peers to synchronize state...
After (45s):
INFO 2023-01-20 22:02:51,869 [shard 0] raft_group0 - setup_group0: ensuring that the cluster has fully upgraded to use Raft...
INFO 2023-01-20 22:03:36,834 [shard 0] raft_group0 - setup_group0: waiting for peers to synchronize state...
Closes#12592Closes#12592
There's a bunch of test cases that check how moving sstables files
around the filesystem works. These need the generic move_to_new_dir()
method from sstable, so move it there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When "uploading" an sstable scylla uses a short-cut -- the sstable's
files are to be put into upload/ subdir by the caller, then scylla just
pulls them in in the cheapest way possible -- by relinking the files.
When this happens sstable also changes its generation, which is the only
place where this happens at all. For object storage uploading is not
going to be _that_ simple, so for now add an fs-specific method to pick
up an sstable from upload dir with the intent to generalize it (if
possible) when object-storage uploading appears.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The call moves the sstable to the specified state.
The change state is translated into the storage driver state change
which is for todays filesystem storage means moving between directories.
The "normal" state maps to the base dir of the table, there's no
dedicated subdir for this state and this brings some trouble into the
play.
The thing is that in order to check if an sstable is in "normal" state
already its impossible to compare filename of its path to any
pre-defined values, as tables' basdirs are dynamic. To overcome this,
the change-state call checks that the sstable is in one of "known"
sub-states, and assumes that it's in normal state otherwise.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When sstables are loaded from upload/ subdir, the final step is to move
them from this directory into base or staging one. The uploading code
evaluates the target directory, then pushes it down the stack towards
make_sstables_available() method.
This patch replaces the path argument with bool to_staging one. The
goal is to remove the knowlege of exact sstable location (nowadays --
its files' path) from the distributed loader and keep it in sstable
object itself. Next patches will make full use of this change.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
- build: cmake: build release.cc as a library
- build: cmake: link alternator against cql3
- build: cmake: link scylla against xxHash::xxhash
- build: cmake: use lld or gold as linker if available
Closes#12942
* github.com:scylladb/scylladb:
build: cmake: use lld or gold as linker if available
build: cmake: link scylla against xxHash::xxhash
build: cmake: link alternator against cql3
build: cmake: build release.cc as a library
Was left unnoticed while 7c7eb81a ('Encapsulate filesystem access by
sstable into filesystem_storage subsclass')
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12946
Now the lister procides two-calls API to the user -- process and commit.
The rest can and should be marked as private.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Committing any changes made while scanning the storage is
storage-specific. Just like .process() was moved on lister, the
.commit() now does the same.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Previous patches created a bunch of local aliases-references in
components_lister::process(). This patch just removes those aliases, no
functional changes are made here.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Processing storage with sstable files/objects is storage-specific. The
components_lister is the right components to handle it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This method is in charge of collecting a found file on scan_state, it
logically belogs to the components_lister and its internals.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This list is the list of on-disk files, which is the property of
filesystem scan state. When committing directory changes (read: removing
those files) the list can be moved-from the state.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The lister is supposed to be alive throughout .process_sstable_dir() and
can die after .commit_directory_changes().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The scan_state keeps the state of listing directory with sstables. It
now lives on the .process_sstable_dir() stack, but it can as well live
on the lister itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fixes https://github.com/scylladb/scylladb/issues/12945
This PR removes the incorrect information and updates the link to the relevant page in the Manager docs.
Closes#12947
* github.com:scylladb/scylladb:
doc: update the link to the Restore page in the ScyllaDB Manager documentation
doc: remove the wrong info about IPs from the note on the Restore page
The series fixes a race in case of a leader change while
add_entry_on_leader is sleeping and an abort during raft shutdown.
* '12863-fix-v1' of github.com:scylladb/scylla-dev:
raft: abort applier fiber when a state machine aborts
raft: fix race in add_entry_on_leader that may cause incorrect log length accounting
The test relies on exact request size, this doesn't work if compression is applied. The driver enables compression only if both the server and the client agree on the codec to use. If compression package
(e.g. lz4) is not installed, the compression
is not used.
The trick with locally_supported_compressions is needed since I couldn't find any standard means to disable compression other than the compression flag
on the cluster object, which seemed too broad.
fixes: #12836Closes#12854
* github.com:scylladb/scylladb:
test_shed_too_large_request: clarify the comments
test_shed_too_large_request: use smaller test string
test_shed_too_large_request fix: disable compression
instead of adding `XXH_PRIVATE_API` to compile definitions, link
scylla against xxHash::xxhash, which provides this definition for us.
also move the comment on `XXH_PRIVATE_API` into `FindxxHash.cmake`,
where this definition is added.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
otherwise we'd have
```
In file included from /home/kefu/dev/scylladb/alternator/executor.cc:37:
/home/kefu/dev/scylladb/cql3/util.hh:21:10: fatal error: 'cql3/CqlParser.hpp' file not found
^~~~~~~~~~~~~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
so we can attach compiling definitions in a simpler way.
this change is based on Botond Dénes's change which gives an overhaul
to the existing CMake building system.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
- build: cmake: link cql3 against wasmtime_bindings
- build: cmake: output rust binding headers in expected dir
- build: cmake: link auth against cql3
Closes#12927
* github.com:scylladb/scylladb:
build: cmake: link auth against cql3
build: cmake: output rust binding headers in expected dir
build: cmake: link cql3 against wasmtime_bindings
In test_v2_apply_monotonically_is_monotonic_on_alloc_failures we
generate mutations with non-full continuity, so we should pass
is_evictable::yes to apply_monotonically(). Otherwise, it will assume
fully-continuous versions and not try to maintain continuity by
inserting sentinels.
This manifested in sporadic failures on continuity check.
Fixes#12882Closes#12921
* github.com:scylladb/scylladb:
test: mutation_test: Fix sporadic failure due to continuity mismatch
test: mutation_test: Fix copy-paste mistake in trace-level logging
this change was previously reverted by
cbc005c6f5 . it turns out this change
was but the offending change. so let's resurrect it.
`job` was introduced back in 782ebcece4,
so we could consume the option specified in DEB_BUILD_OPTIONS
environmental variable. but now that we always repackage
the artifacts prebuilt in the relocatable package. we don't build
them anymore when packaging debian packages. see
9388f3d626 . and `job` is not
passed to `ninja` anymore.
so, in this change, `job` is removed from debian/rules as well, as
it is not used.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12924
This is a translation of Cassandra's CQL unit test source file
validation/operations/CompactStorageTest.java into our cql-pytest
framework.
This very large test file includes 86 tests for various types of
operations and corner cases of WITH COMPACT STORAGE tables.
All 86 tests pass on Cassandra (except one using a deprecated feature
that needs to be specially enabled). 30 of the tests fail on Scylla
reproducing 7 already-known Scylla issues and 7 previously-unknown issues:
Already known issues:
Refs #3882: Support "ALTER TABLE DROP COMPACT STORAGE"
Refs #4244: Add support for mixing token, multi- and single-column
restrictions
Refs #5361: LIMIT doesn't work when using GROUP BY
Refs #5362: LIMIT is not doing it right when using GROUP BY
Refs #5363: PER PARTITION LIMIT doesn't work right when using GROUP BY
Refs #7735: CQL parser missing support for Cassandra 3.10's new "+=" syntax
Refs #8627: Cleanly reject updates with indexed values where value > 64k
New issues:
Refs #12471: Range deletions on COMPACT STORAGE is not supported
Refs #12474: DELETE prints misleading error message suggesting
ALLOW FILTERING would work
Refs #12477: Combination of COUNT with GROUP BY is different from
Cassandra in case of no matches
Refs #12479: SELECT DISTINCT should refuse GROUP BY with clustering column
Refs #12526: Support filtering on COMPACT tables
Refs #12749: Unsupported empty clustering key in COMPACT table
Refs #12815: Hidden column "value" in compact table isn't completely hidden
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12816
load-and-stream implements no policy when deciding which SSTables will go in
each streaming round (batch of 16 SSTables), meaning the choice is random.
It can take advantage of the fact that the LSM-tree layout, with ICS and LCS,
is a set of SSTable runs, where each run is composed of SSTables that are
disjoint in their key range.
By sorting SSTables to be streamed by their first key, the effect is that
SSTable runs will be incrementally streamed (in token order).
SSTable runs in the same replica group (or in the same node) will have their
content deduplicated, reducing significantly the amount of data we need to
put on the wire. The improvement is proportional to the space amplification
in the table, which again, depends on the compaction strategy used.
Another important benefit is that the destination nodes will receive SSTables
in token order, allowing off-strategy compaction to be more efficient.
This is how I tested it:
1) Generated a 5GB dataset to a ICS table.
2) Started a fresh 2-node cluster. RF=2.
3) Ran load-and-stream against one of the replicas.
BEFORE:
$ time curl -X POST "http://127.0.0.1:10000/storage_service/sstables/keyspace1?cf=standard1&load_and_stream=true"
real 4m40.613s
user 0m0.005s
sys 0m0.007s
AFTER:
$ time curl -X POST "http://127.0.0.1:10000/storage_service/sstables/keyspace1?cf=standard1&load_and_stream=true"
real 2m39.271s
user 0m0.005s
sys 0m0.004s
That's ~1.76x faster.
That's explained by deduplication:
BEFORE:
INFO 2023-02-17 22:59:01,100 [shard 0] stream_session - [Stream #79d3ce7a-ea47-4b6e-9214-930610a18ccd] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=3445376, received_partitions=2755835
INFO 2023-02-17 22:59:41,491 [shard 0] stream_session - [Stream #bc6bad99-4438-4e1e-92db-b2cb394039c8] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=3308288, received_partitions=2836491
INFO 2023-02-17 23:00:20,585 [shard 0] stream_session - [Stream #e95c4f49-0a2f-47ea-b41f-d900dd87ead5] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=3129088, received_partitions=2734029
INFO 2023-02-17 23:00:49,297 [shard 0] stream_session - [Stream #255cba95-a099-4fec-a72c-f87d5cac2b1d] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=2544128, received_partitions=1959370
INFO 2023-02-17 23:01:33,110 [shard 0] stream_session - [Stream #96b5737e-30c7-4af8-a8b8-96fecbcbcbd0] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=3624576, received_partitions=3085681
INFO 2023-02-17 23:02:20,909 [shard 0] stream_session - [Stream #3185a48b-fb9e-4190-88f4-5c7a386bc9bd] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=3505024, received_partitions=3079345
INFO 2023-02-17 23:03:02,039 [shard 0] stream_session - [Stream #0d2964dc-d5e3-4775-825c-97f736d14713] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=2808192, received_partitions=2655811
AFTER:
INFO 2023-02-17 23:12:49,155 [shard 0] stream_session - [Stream #bf00963c-3334-4035-b1a9-4b3ceb7a188a] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=2965376, received_partitions=1006535
INFO 2023-02-17 23:13:13,365 [shard 0] stream_session - [Stream #1cd2e3ac-a68b-4cb5-8a06-707e91cf59db] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=3543936, received_partitions=1406157
INFO 2023-02-17 23:13:37,474 [shard 0] stream_session - [Stream #5a278230-6b4b-461f-8396-c15df7092d03] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=3639936, received_partitions=1371298
INFO 2023-02-17 23:14:02,132 [shard 0] stream_session - [Stream #19f40dc3-e02a-4321-a917-a6590d99dd03] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=3638912, received_partitions=1435386
INFO 2023-02-17 23:14:26,673 [shard 0] stream_session - [Stream #d47507eb-2067-4e8f-a4f7-c82d5fbd4228] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=3561600, received_partitions=1423024
INFO 2023-02-17 23:14:49,307 [shard 0] stream_session - [Stream #d42ee911-253a-4de6-ac89-6a3c05b88d66] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=2382592, received_partitions=1452656
INFO 2023-02-17 23:15:10,067 [shard 0] stream_session - [Stream #1f78c1bf-8e20-41bd-95de-16de3fc5f86c] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=2632320, received_partitions=1252298
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20230219191924.37070-1-raphaelsc@scylladb.com>
Changing default manager from 56090 to 5090
@amnonh please review
@annastuchlik please change if other locations in Docs require this change
Closes#12682
they are part of the CQL type system, and are "closer" to types.
let's move them into "types" directory.
the building systems are updated accordingly.
the source files referencing `types.hh` were updated using following
command:
```
find . -name "*.{cc,hh}" -exec sed -i 's/\"types.hh\"/\"types\/types.hh\"/' {} +
```
the source files under sstables include "types.hh", which is
indeed the one located under "sstables", so include "sstables/types.hh"
instea, so it's more explicit.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12926
because it has a member variable whose type is a reference. and a
reference cannot be reassigned. this silences following warning from Clang:
```
/home/kefu/dev/scylladb/test/boost/chunked_vector_test.cc:152:27: error: explicitly defaulted move assignment operator is implicitly deleted [-Werror,-Wdefaulted-function-deleted]
exception_safe_class& operator=(exception_safe_class&&) = default;
^
/home/kefu/dev/scylladb/test/boost/chunked_vector_test.cc:132:31: note: move assignment operator of 'exception_safe_class' is implicitly deleted because field '_esc' is of reference type 'exception_safety_checker &'
exception_safety_checker& _esc;
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
as partition_reversing_data_source_impl has indirectly a member variable which
a member of reference type. this should addres following warning from
Clang:
```
/home/kefu/dev/scylladb/sstables/mx/partition_reversing_data_source.cc:476:43: error: explicitly defaulted move assignment operator is implicitly deleted [-Werror,-Wdefaulted-function-deleted]
partition_reversing_data_source_impl& operator=(partition_reversing_data_source_impl&&) noexcept = default;
^
/home/kefu/dev/scylladb/sstables/mx/partition_reversing_data_source.cc:365:19: note: move assignment operator of 'partition_reversing_data_source_impl' is implicitly deleted because field '_schema' is of reference type 'const schema &'
const schema& _schema;
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
as auth headers references cql3
```
In file included from /home/kefu/dev/scylladb/auth/authenticator.cc:16:
In file included from /home/kefu/dev/scylladb/cql3/query_processor.hh:24:
/home/kefu/dev/scylladb/lang/wasm_instance_cache.hh:20:10: fatal error: 'rust/cxx.h' file not found
^~~~~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
we include rust binding headers like `rust/wasmtime_bindings.hh`.
so they should be located in directory named "rust".
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
as it references headers provided by wasmtime_bindings:
```
In file included from /home/kefu/dev/scylladb/cql3/functions/user_function.cc:9:
In file included from /home/kefu/dev/scylladb/cql3/functions/user_function.hh:16:
/home/kefu/dev/scylladb/lang/wasm.hh:14:10: fatal error: 'rust/wasmtime_bindings.hh' file not found
^~~~~~~~~~~~~~~~~~~~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Currently the code will assert because cl pointer will be null and it
will be null because there is no mutations to initialize it from.
Message-Id: <20230212144837.2276080-3-gleb@scylladb.com>
In test_v2_apply_monotonically_is_monotonic_on_alloc_failures we
generate mutations with non-full continuity, so we should pass
is_evictable::yes to apply_monotonically(). Otherwise, it will assume
fully-continuous versions and not try to maintain continuity by
inserting sentinels.
This manifested in sporadic failures on continuity check.
Fixes#12882
this change is based on Botond Dénes's change which gave an overhaul
to the original CMake building system. this change is not enough
to build tests with CMake, as we still need to sort out the
dependencies.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
it turns out Boost::regex references ICU::i18n, but it fails to
bring the linkage to its public interface. so let's do this on behalf
of it.
```
: && /home/kefu/.local/bin/clang++ -Wall -Werror -Wno-c++11-narrowing -Wno-mismatched-tags -Wno-missing-braces -Wno-overloaded-virtual -Wno-parentheses-equality -Wno-unsupported-friend -march=westmere -O0 -g -gz CMakeFiles/scylla.dir/absl-flat_hash_map.cc.o CMakeFiles/$
ld.lld: error: undefined symbol: icu_67::Collator::createInstance(icu_67::Locale const&, UErrorCode&)
>>> referenced by icu.hpp:56 (/usr/include/boost/regex/icu.hpp:56)
>>> CMakeFiles/scylla.dir/utils/like_matcher.cc.o:(boost::re_detail_107500::icu_regex_traits_implementation::icu_regex_traits_implementation(icu_67::Locale const&))
>>> referenced by icu.hpp:61 (/usr/include/boost/regex/icu.hpp:61)
>>> CMakeFiles/scylla.dir/utils/like_matcher.cc.o:(boost::re_detail_107500::icu_regex_traits_implementation::icu_regex_traits_implementation(icu_67::Locale const&))
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
antlr3 generates code like `((foo == bar))`. but Clang does not
like it. let's disable this warning. and explore other options later.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
When we added zstd (f14e6e73bb), we used the static library
as we used some experimental APIs. However, now the dynamic
library works, so apparently the experimenal API is now standard.
Switch to the dynamic library. It doesn't improve anything, but it
aligns with how we do things.
Closes#12902
`int_range::make_singular()` accepts a single `int` as its parameter,
so there is no need to brace the paramter with `{}`. this helps to silence
the warning from Clang, like:
```
/home/kefu/dev/scylladb/test/perf/perf_fast_forward.cc:1396:63: error: braces around scalar initializer [-Werror,-Wbraced-scalar-init]
check_no_disk_reads(test(int_range::make_singular({100}))),
^~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12903
Fixes https://github.com/scylladb/scylla-docs/issues/4140
This PR adds a new Knowledge Base article about improved garbage collection in ICS. It's based on the document created by @raphaelsc https://docs.google.com/document/d/1fA7uBcN9tgxeHwCbWftPJz071dlhucoOYO1-KJeOG8I/edit?usp=sharing.
@raphaelsc Could you review it? I've made some improvements to the language and text organization, but I didn't add or remove any content, so it should be a quick review.
@tzach requested a diagram, but we can add it later. It would be great to have this content published asap.
Closes#12792
* github.com:scylladb/scylladb:
doc: add the new KB to the list of topics
doc: add a new KB article about timbstone garbage collection in ICS
Task ttl can be set with task manager test api, which is disabled
in release mode.
Move get_and_update_ttl from task manager test api to task manager
api, so that it can be used in release mode.
Closes#12894
The developer documentation from `building.md` suggested to run unit tests with `./tools/toolchain/dbuild test` command, however this command only invokes `test` bash tool, which immediately returns with status `1`:
```
[piotrs@new-host scylladb]$ ./tools/toolchain/dbuild test
[piotrs@new-host scylladb]$ echo $?
1
```
This was probably unintended mistake and what author really meant was invoking `dbuild ninja test`.
Closes#12890
Said tests require on being run with a limited amount of memory to be
really useful. When the memory amount is unexpected, they silently exit.
Which is exactly what they did in debug mode too, where the amount of
memory available cannot be controlled.
Disable the check in debug mode.
Said method can now throw `std::bad_alloc` since aab5954. All call-sites
should have been adapted in the series introducing the throw, but some
managed to slip through because the oom unit test didn't run in debug
mode. In this commit the remaining unpatched call-sites are fixed.
as `my_result_collector` has virtual function, and its dtor is not
marked virtual, Clang complains. let's mark its base class virtual
to be on the safe side.
```
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/unique_ptr.h:100:2: error: delete called on non-final 'my_result_collector' that has virtual functions but non-virtual destructor [-Werror,-Wdelete-non-abstract-non-virtual-dtor]
delete __ptr;
^
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/unique_ptr.h:405:4: note: in instantiation of member function 'std::default_delete<my_result_collector>::operator()' requested here
get_deleter()(std::move(__ptr));
^
/home/kefu/dev/scylladb/db/virtual_table.cc:134:25: note: in instantiation of member function 'std::unique_ptr<my_result_collector>::~unique_ptr' requested here
auto consumer = std::make_unique<my_result_collector>(s, permit, &pr, std::move(reader_and_handle.second));
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12879
The former is a convenience wrapper over the latter. There's no real benefit in using it, but having two test_env-s is worse than just one.
Closes#12794
* github.com:scylladb/scylladb:
sstable_utils: Move the test_setup to perf/
sstable_utils: Remove unused wrappers over test_env
sstable_test: Open-code do_with_cloned_tmp_directory
sstable_test: Asynchronize statistics_rewrite case
tests: Replace test_setup::do_with_tmp_directory with test_env::do_with(_async)?
This patch adds a reproducer for the bug described in issue #7964 -
The restriction `where k=1 and c=2` (when k,c are the key columns)
returns (at most) a single row so doesn't need ALLOW FILTERING,
but if we add a third restriction, say `v=2`, this still processes
at most a single row so doesn't need ALLOW FILTERING - and both
Scylla and Cassandra get it wrong - so it's marked with both xfail
and cassandra_bug.
The patch also adds another test that for longer partition slices,
e.g., `where k=1 and c>2`, although the slice itself doesn't need
filtering, if we add `v=2` here we suddenly do need ALLOW FILTERING,
because the slice itself may be a large number of rows, and adding
`v=2` may restrict it to just a few results. This test passes
on both Scylla and Cassandra.
Issue #7964 mentioned these scenarios and even had some example code,
but we never added it to the test suite, so we finally do it now.
Refs #7964
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12850
There are two methods to mess with compaction history -- update and get. The former had been patched to use local system-keyspace instance by 907fd2d3 (system_keyspace: De-static compaction history update) now it's time for the latter (spoiler: it's only used by the API handler)
Closes#12889
* github.com:scylladb/scylladb:
system_keyspace; Make get_compaction_history non static and drop qctx
api, compaction_manager: Get compaction history via manager
system_keyspace: Move compaction_history_entry to namespace scope
Tests of each module that is integrated with task manager use
calls to task manager api. Boilerplate to call, check status, and
get result may be reduced using functions.
task_manager_utils.py contains wrappers for task manager api
calls and helpers that may be reused by different tests.
Closes#12844
* github.com:scylladb/scylladb:
test: use functions from task_manager_utils.py in test_task_manager.py
test: add task_manager_utils.py
as an abstract base class `output_writer` is inherited by both
`json_output_writer` and `text_output_writer`. and `output_manager`
manages the lifecycles of used writers using
`std::unique_ptr<output_writer>`.
before this change, the dtor of `output_writer` is not marked as
virtual, so when its dtor is invoked, what gets called is the base
class's dtor. but the dtor of `json_output_writer` is non-trivial
in the sense that this class is aggregated by a bunch of member
variables. if we don't invoke its dtor when destroying this object,
leakage is expected.
so, in this change, the dtor of `output_writer` is marked as virtual,
this makes all of its derived classes' dtor virtual. and the right
dtor is always called.
test/perf is only designed for testing, and not used in production,
also, this feature was recently integrated into scylla executable in
228ccdc1c7.
so there is no need to backport this change.
change should also silence the warning from Clang 17:
```
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/unique_ptr.h:100:2: error: delete called on 'output_writer' that is abstract but has non-virtual destructor [-Werror,-Wdelete-abstract-non-virtual-dtor]
delete __ptr;
^
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/unique_ptr.h:405:4: note: in instantiation of member function 'std::default_delete<output_writer>::operator()' requested here
get_deleter()(std::move(__ptr));
^
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/stl_construct.h:88:15: note: in instantiation of member function 'std::unique_ptr<output_writer>::~unique_ptr' requested here
__location->~_Tp();
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12888
This patch adds a cql-pytest test for an old secondary-index bug
that was described three years ago in issue #5823. cql-pytest makes
it easy to run the same test against different versions of Scylla,
and it was used to check that the bug existed in Scylla 2.3.0 but
was gone by 2.3.5, and also not present in master or in 2021.1.
A bit about the bug itself:
A secondary index is useful for equality restrictions (a=2) but can't be
used for inequality restrictions (a>=2). In Scylla 3.2.0 we used to have a
bug that because the restriction a>=2 couldn't be used through the index,
it was ignored completely. This is of course a mistake.
Refs #5823
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12856
before this change, `seastar_memory_segment_store_backend`
is class with virtual method, but it does not have a virtual
dtor. but we do use a unique_ptr<segment_store_backend> to
manage the lifecycle of an intance of its derived class.
to enable the compiler to call the right dtor, we should
mark the base class's dtor as virtual. this should address
following warings from Clang-17:
```
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/unique_ptr.h:100:2: error: delete called on non-final 'logalloc::seastar_memory_segment_store_backend' that has virtual functions but non-virtual destructor [-Werror,-Wdelete-non-abstract-non-virtual-dtor]
delete __ptr;
^
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/unique_ptr.h:405:4: note: in instantiation of member function 'std::default_delete<logalloc::seastar_memory_segment_store_backend>::operator()' requested here
get_deleter()(std::move(__ptr));
^
/home/kefu/dev/scylladb/utils/logalloc.cc:812:20: note: in instantiation of member function 'std::unique_ptr<logalloc::seastar_memory_segment_store_backend>::~unique_ptr' requested here
: _backend(std::make_unique<seastar_memory_segment_store_backend>())
^
```
and
```
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/unique_ptr.h:100:2: error: delete called on 'logalloc::segment_store_backend' that is abstract but has non-virtual destructor [-Werror,-Wdelete-abstract-non-virtual-dtor]
delete __ptr;
^
/home/kefu/.local/bin/../lib/gcc/x86_64-pc-linux-gnu/13.0.1/../../../../include/c++/13.0.1/bits/unique_ptr.h:405:4: note: in instantiation of member function 'std::default_delete<logalloc::segment_store_backend>::operator()' requested here
get_deleter()(std::move(__ptr));
^
/home/kefu/dev/scylladb/utils/logalloc.cc:811:5: note: in instantiation of member function 'std::unique_ptr<logalloc::segment_store_backend>::~unique_ptr' requested here
contiguous_memory_segment_store()
^
```
Fixes#12872
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12873
Since 5c0f9a8180 ("mutation_partition: Switch cache of
rows onto B-tree") it's no longer in use, except in some
performance test, so remove it.
Although scylla-gdb.py is sometimes used with older releases,
it's so outdated we can remove it from there too.
Closes#12868
we should never return a reference to local variable.
so in this change, a reference to a static variable is returned
instead. this should address following warning from Clang 17:
```
/home/kefu/dev/scylladb/tools/schema_loader.cc:146:16: error: returning reference to local temporary object [-Werror,-Wreturn-stack-address]
return {};
^~
```
Fixes#12875
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12876
Now the call is done via the system_keyspace instance, so it can be
unmarked static and can use the local query processor instead of global
qctx.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now the API handler directly calls static method from system
keyspace. Patching it to call compaction manager instead will let the
latter use on-board plugged system keyspace for that. If the system
keyspace is not plugged, it means early boot or late shutdown, not a
good time to get compaction history anyway.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We currently configure only TimeoutStartSec, but probably it's not
enough to prevent coredump timeout, since TimeoutStartSec is maximum
waiting time for service startup, and there is another directive to
specify maximum service running time (RuntimeMaxSec).
To fix the problem, we should specify RunTimeMaxSec and TimeoutSec (it
configures both TimeoutStartSec and TimeoutStopSec).
Fixes#5430Closes#12757
- date: drop implicitly generated ctor
- date: use std::in_range() to check for invalid year
Closes#12878
* github.com:scylladb/scylladb:
date: use std::in_range() to check for invalid year
date: drop implicitly generated ctor
Commit 1365e2f13e (gms: feature_service: re-enable features on node
startup) re-enabled features on feature service very early, so that on
boot a node sees its "correct" features state before it starts loading
system tables and replaying commitlog.
However, checking features happens on all shards independently, so
re-enabling should also happen on all shards.
One faced problem is in extract_scylla_specific_keyspace_info(). This
helper is used when loading non-system keyspace to read scylla-specific
keyspace options. The helper is called on all shards and on all-but-zero
it evaluates the checked SCYLLA_KEYSPACES feature to false leaving the
specific data empty. As the result, different shards have different view
of keyspaces' configuration.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12881
because its base class of `writer_impl` has a member variable
`_validator`, which has its copy ctor deleted. let's just
drop the defaulted move ctor, as compiler is not able to
generate one for us.
```
/home/kefu/dev/scylladb/sstables/mx/writer.cc:805:5: error: explicitly defaulted move constructor is implicitly deleted [-Werror,-Wdefaulted-function-deleted]
writer(writer&& o) = default;
^
/home/kefu/dev/scylladb/sstables/mx/writer.cc:528:16: note: move constructor of 'writer' is implicitly deleted because base class 'sstable_writer::writer_impl' has a deleted move constructor
class writer : public sstable_writer::writer_impl {
^
/home/kefu/dev/scylladb/sstables/writer_impl.hh:29:48: note: copy constructor of 'writer_impl' is implicitly deleted because field '_validator' has a deleted copy constructor
mutation_fragment_stream_validating_filter _validator;
^
/home/kefu/dev/scylladb/mutation/mutation_fragment_stream_validator.hh:188:5: note: 'mutation_fragment_stream_validating_filter' has been explicitly marked deleted here
mutation_fragment_stream_validating_filter(const mutation_fragment_stream_validating_filter&) = delete;
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12877
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
There's the distribtued_loader::populate_column_family() helper that manages sstables on their way towards table on boot. The method naturally belongs the the table_population_metadata -- a helper class that in fact prepares the ground for the method in question.
This PR moves the method into metadata class and removes whole lot of extra alias-references and private-fields exporting methods from it. Also it keeps start_subdir and populate_c._f. logic close to each other and relaxes several excessive checks from them.
Closes#12847
* github.com:scylladb/scylladb:
distributed_loader: Rename table_population_metadata
distributed_loader: Dont check for directory presense twice
distributed_loader: Move populate calls into metadata.start()
distributed_loader: Remove local aliases and exporters
distributed_loader: Move populate_column_family() into population meta
It used to be just metadata by providing the meta for population, now it
does the population by itself, so rename it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Both start_subdir() and populate_subdir() check for the directory to
exist with explicit file_exists() check. That's excessive, if the
directory wasn't there in the former call, the latter can get this by
checking the _sstable_directories map.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This makes the metadata class export even shorter API, keeps the three
sub-directories scanned in one place and allows removing the zero-shard
assertion.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
column_condition is an LWT-specific boolean expression construct, but
recent work allowed it to be re-expressed in terms of generic expressions.
This series completes the work and eliminates the column_condition classes
and source file. Furthermore, a statement's IF clause is represented as a
single expression, rather than a vector of per-column conditions.
Closes#12597
* github.com:scylladb/scylladb:
cql3: modification_statement: unwrap unnecessary boolean_factors() call
cql3: modification_statement: use single expression for conditions
cql3: modification_statment: fix lwt null equality rules mangling
cql3: broadcast tables: tighten checks on conditions
cql3: grammar: communicate LWT IF conditions to AST as a simple expression
cql3: column_condition: fold into modification_statement
cql3: column_condition: inline column_condition_applies_to into its only caller
cql3: column_condition: inline column_condition_collect_marker_specification into its only caller
cql3: column_condition: eliminate column_condition class
cql3: column_condition: move expression massaging to prepare()
cql3: grammar: make columnCondition production return an expression
cql3: grammar: eliminate duplication in LWT IF clause "IN (...)" vs "IN ?"
cql3: grammar: remove duplication around columnCondition scalar/collection variants
cql3: grammar: extract column references into a new production
cql3: column_condition: eliminate column_condition::raw
After previous patch all local alias references in
populate_column_family() are no longer requires. Neither are the
exporting calls from the table_population_metadata class.
Some non-obvious change is capturing 'this' instead of 'global_table' on
calls that are cross-shard. That's OK, table_population_metadata is not
sharded<> and is designed for cross-shard usage too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This ownership change also requires the auto& = *this alias and extra
specification where to call reshard() and reshape() from.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
alternator headers are exposed to the target which links against it,
so let's expose them using the `target_include_directories()`.
also, `alternator` uses Seastar library and uses xxHash indirectly.
we should fix the latter by exposing the included header instead,
but for now, let's just link alternator directly to xxHash.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Scylla uses different build mode to customize the build for different
purposes. in this change, instead of having it in a python dictionary,
the customized settings are located in their own files, and loaded
on demand. we don't support multi-config generator yet.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
for better readability, and to silence following warning
from Clang 17:
```
/home/kefu/dev/scylladb/utils/date.h:5965:25: error: result of comparison of constant 9223372036854775807 with expression of type 'int' is always true [-Werror,-Wtautological-constant-out-of-range-compare]
Y <= static_cast<int64_t>(year::max())))
~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/kefu/dev/scylladb/utils/date.h:5964:57: error: result of comparison of constant -9223372036854775808 with expression of type 'int' is always true [-Werror,-Wtautological-constant-out-of-range-compare]
if (!(static_cast<int64_t>(year::min()) <= Y &&
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^ ~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
as one of its member variable does not have default constructor.
this silences following warning from Clang-17:
```
/home/kefu/dev/scylladb/utils/date.h:708:5: error: explicitly defaulted default constructor is implicitly deleted [-Werror,-Wdefaulted-function-deleted]
year_month_weekday() = default;
^
/home/kefu/dev/scylladb/utils/date.h:705:27: note: default constructor of 'year_month_weekday' is implicitly deleted because field 'wdi_' has no default constructor
date::weekday_indexed wdi_;
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
After 5badf20c7a applier fiber does not
stop after it gets abort error from a state machine which may trigger an
assertion because previous batch is not applied. Fix it.
Fixes#12863
In add_entry_on_leader after wait_for_memory_permit() resolves but before
the fiber continue to run the node may stop becoming the leader and then
become a leader again which will cause currently hold units outdated.
Detect this case by checking the term after the preemption.
There was a vague comment about CI using
larger limits for shedding. This turned out
to be false, and the real reason of different
limits is that Scylla handles the -m
command line option differently in
debug and release builds.
Debug builds use the default memory allocator
and the value of -m Scylla option
is given to each shard. In release builds
memory is evenly distributed between shards.
To accommodate for this we read the current
memory limit from Scylla metrics.
The helper class ScyllaMetrics was introduced to
handle metrics parsing logic. It can
potentially be reused for dealing with
metrics in other tests.
Currently, we use two vectors for static and regular column conditions,
each element referring to a single column. There's a comment that keeping
them separate makes things simpler, but in fact we always treat both
equally (except in one case where we look at just the regular columns
and check that no static column conditions exist).
Simplify by storing just a single expression, which can be a conjunction
of mulitple column conditions.
add_condition() is renamed to analyze_condition(), since it now longers
adds to the vectors.
search_and_replace() needs to return std::nullopt when it doesn't match,
or it doesn't recurse properly. Currently it doesn't break anything
because we only call the function on a binary_operator, but soon it will.
Instead of passing a vector of boolean factors, pass a single expression
(a conjunction). This prepares the way for more complex expressions, but
no grammar changes are made here.
The expression is stored as optional, since we'll need a way to indicate
whether an IF clause was supplied or not. We could play games with
boolean_factors(), but it becomes too tricky.
The expressions are broken down back to boolean factors during prepare.
We'll later consolidate them too.
Move column_condition_prepare() and its helper function into
modification_statement, its only caller. The column_condition.{cc,hh}
now become empty, so remove them.
This eliminates the column_condition concept, which was just a
custom expression, in favor of generic expressions. It still
has custom properties due to LWT specialness, but those custom
properties are isolated in column_condition_prepare().
This two-liner can be trivilly inlined with no loss of meaning. Indeed
it's less confusing, because "applies_to" became less meaningful once
we integrated the column_value component into the expression.
It's become a wrapper around expression, so peel it off. The
methods are converted free functions, with the intent to later
inline them into their callers, as they are also mostly just
wrappers.
columnCondition duplicates the grammar for scalar relations and subscripted
collection relations. Eliminate the duplication by introducing a
subscriptExpr production, which encapsulates the differences.
It's now a thin wrapper around an expression, so peel the wrapper
and keep just the expression. A boolean expression is, after all,
a condition, and we'll make the condition statement-wide soon
rather than apply just to a column.
This patch adds a reproducer to a static-column index lookup bug
described in issue #12829: The restriction `where pk=0 and s=1 and c=3`
where pk,c are the primary key and s is an indexed static column,
results in an internal error: "clustering column id 2 >= 2".
Unfortunately, because on_internal_error() crashes Scylla in debug
mode, we need to mark this failing test with skip instead of xfail.
Refs #12829
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12852
This function is called on the fast data path
from storage_proxy when sorting multiple endpoints
by proximity.
This change calculates numeric node diff metrics
based on each address proximity to a given node
(by <dc, rack, same node>) to eliminate logic
branches in the function and reduce its footprint.
based on objdump -d output, compare_endpoints
footprint was reduced by 58.5% (3632 / 8752 bytes)
with clang version 15.0.7 (Fedora 15.0.7-1.fc37)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a regression unit test for topology::compare_endpoint
before it's optimized in the following patches.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Schema related files are moved there. This excludes schema files that
also interact with mutations, because the mutation module depends on
the schema. Those files will have to go into a separate module.
Closes#12858
The test relies on exact request size, this doesn't
work if compression is applied. The driver enables
compression only if both the server and the client
agree on the codec to use. If compression package
(e.g. lz4) is not installed, the compression
is not used.
The trick with locally_supported_compressions is needed
since I couldn't find any standard means to disable
compression other than the compression flag
on the cluster object, which seemed too broad.
Fixes#12836
In issue #12828 it was noted that Scylla requires ALLOW FILTERING
for `where b=1 and c=1` where b is an indexed static column and
c is a clustering key, and it was suggested that this is a bug.
This patch adds a test that confirms that both Scylla and Cassandra
require ALLOW FILTERING in this case. We explain in a comment that
this requirement is expected (i.e., it's not a bug), as the `b=1`
may match a huge number of rows, and the `c=1` may further match just
a few of those - i.e., it is filtering.
This test is virtually identical to the test we already had for
`where a=1 and c=1` - when `a` is an indexed regular column.
There too, the ALLOW FILTERING is required.
Closes#12828 as "not a bug".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12848
More efficient than retrieving size from sstable_set::all() which
may involve copy of elements.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12835
this is the first step to reenable cmake to build scylla, so we can experiment C++20 modules and other changes before porting them to `configure.py` . please note, this changeset alone does not address all issues yet. as this is a low priority project, i want to do this in smaller (or tiny!) steps.
* build: cmake: s/Abseil/absl/
* build: cmake: sync with source files compiled in configure.py
* build: cmake: do not generate crc_combine_table at build time
* build: cmake: use packaged libdeflate
Closes#12838
* github.com:scylladb/scylladb:
build: cmake: add rust binding
build: cmake: extract cql3 and alternator out
build: cmake: use packaged libdeflate
build: cmake: do not generate crc_combine_table at build time
build: cmake: sync with source files compiled in configure.py
build: cmake: s/Abseil/absl/
Task manager api will be used in many tests. Thus, to make it easier
api calls to task manager are wrapped into functions in task_manager_utils.py.
Some helpers that may be reused in other tests are moved there too.
as it has a reference type member variable. and Clang 17 warns
at seeing this
```
/home/kefu/dev/scylladb/row_cache.hh:359:16: warning: explicitly defaulted move assignment operator is implicitly deleted [-Wdefaulted-function-deleted]
row_cache& operator=(row_cache&&) = default;
^
/home/kefu/dev/scylladb/row_cache.hh:214:20: note: move assignment operator of 'row_cache' is implicitly deleted because field '_tracker' is of reference type 'cache_tracker &'
cache_tracker& _tracker;
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
as one of the (indirected) member variables has a user-declared move
ctor, this prevents the compiler from generating the default copy ctor
or assignment operator for the classes containing `timer`.
```
/home/kefu/dev/scylladb/utils/histogram.hh:440:5: warning: explicitly defaulted copy constructor is implicitly deleted [-Wdefaulted-function-deleted]
timed_rate_moving_average_and_histogram(const timed_rate_moving_average_and_histogram&) = default;
^
/home/kefu/dev/scylladb/utils/histogram.hh:437:31: note: copy constructor of 'timed_rate_moving_average_and_histogram' is implicitly deleted because field 'met' has a deleted copy constructor
timed_rate_moving_average met;
^
/home/kefu/dev/scylladb/utils/histogram.hh:298:17: note: copy constructor of 'timed_rate_moving_average' is implicitly deleted because field '_timer' has a deleted copy constructor
meter_timer _timer;
^
/home/kefu/dev/scylladb/utils/histogram.hh:212:13: note: copy constructor of 'meter_timer' is implicitly deleted because field '_timer' has a deleted copy constructor
timer<> _timer;
^
/home/kefu/dev/scylladb/seastar/include/seastar/core/timer.hh:111:5: note: copy constructor is implicitly deleted because 'timer<>' has a user-declared move constructor
timer(timer&& t) noexcept : _sg(t._sg), _callback(std::move(t._callback)), _expiry(std::move(t._expiry)), _period(std::move(t._period)),
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
as `range_tombstone_list::reverter` has a member variable of
`const schema& _s`, which cannot be mutated, so it is not allowed
to have an assignment operator.
this change should address the warning from Clang 17:
```
/home/kefu/dev/scylladb/range_tombstone_list.hh:122:19: warning: explicitly defaulted move assignment operator is implicitly deleted [-Wdefaulted-function-deleted]
reverter& operator=(reverter&&) = default;
^
/home/kefu/dev/scylladb/range_tombstone_list.hh:111:23: note: move assignment operator of 'reverter' is implicitly deleted because field '_s' is of reference type 'const schema &'
const schema& _s;
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
as one of the (indirect) member variables of `query::result` is not
copyable, compiler refuses to create a copy ctor or an assignment
operator for us, an Clang 17 warns at seeing this.
so let's just drop them for better readability and more importantly
to preserve the correctness.
```
/home/kefu/dev/scylladb/query-result.hh:385:5: warning: explicitly defaulted copy constructor is implicitly deleted [-Wdefaulted-function-deleted]
result(const result&) = default;
^
/home/kefu/dev/scylladb/query-result.hh:321:34: note: copy constructor of 'result' is implicitly deleted because field '_memory_tracker' has a deleted copy constructor
query::result_memory_tracker _memory_tracker;
^
/home/kefu/dev/scylladb/query-result.hh:97:23: note: copy constructor of 'result_memory_tracker' is implicitly deleted because field '_units' has a deleted copy constructor
semaphore_units<> _units;
^
/home/kefu/dev/scylladb/seastar/include/seastar/core/semaphore.hh:500:5: note: 'semaphore_units' has been explicitly marked deleted here
semaphore_units(const semaphore_units&) = delete;
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* seastar 943c09f869...9b6e181e42 (34):
> semaphore: disallow move after used
> Revert "semaphore: assert no outstanding units when moved"
> reactor, tests: drop unused include
> spawn_test: prolong termination time to be more tolerant.
> net: s/offload_info()/get_offload_info()/
> Merge 'Extend http client with keep-alive connections' from Pavel Emelyanov
> util/gcc6-concepts.hh: drop gcc6-concepts.hh
> treewide: do not inline tls variables in shared library
> reactor: Remove --num-io-queues option
> build: correct the comment
> smp: do not inline function when BUILD_SHARED_LIBS
> iostream: always flush _fd in do_flush
> thread_pool: prevent missed wakeup when the reactor goes to sleep in parallel with a syscall completion
> Merge 'build: do not always build seastar as a static library' from Kefu Chai
> Revert "Merge 'Keep outgoing queue all cancellable while negotiating' from Pavel Emelyanov"
> Merge 'Keep outgoing queue all cancellable while negotiating' from Pavel Emelyanov
> memcached: prolong expiration time to be more tolerant
> treewide: add non-seastar "#include"s
> Merge 'Allow multiple abort requests' from Aleksandra Martyniuk
> app-template: remove duplicated includes
> include/seastar: s/SEASTAR_NODISCARD/[[nodiscard]]/
> prometheus: Don't report labels that starts with __
> memory: do not define variable only for assert
> reactor: set_shard_field_width() after resource::allocate()
> Merge 'reactor, core/resource: clean ups' from Kefu Chai
> util/concepts: include <concepts>
> build: use target_link_options() to pass options to linker
> iostream: add doxygen comment for eof()
> Merge 'util/print_safe, reactor: use concept for type constraints and refactory ' from Kefu Chai
> Right align the memory diagnostics
> Merge 'Add an API for the metrics layer to manipulate metrics dynamically.' from Amnon Heiman
> semaphore: assert no outstanding units when moved
> build: do not populate package registry by default
> build: stop detecting concepts support
Closes#12827
Move mutation-related files to a new mutation/ directory. The names
are kept in the global namespace to reduce churn; the names are
unambiguous in any case.
mutation_reader remains in the readers/ module.
mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this
patch.
This is a step forward towards librarization or modularization of the
source base.
Closes#12788
to replace tabs with spaces, for better readability if the editor
fails to render tabs with the right tabstop setting.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12839
Patch 55a8421e3d fixed an inefficiency when rebuilding
statistics with many compaction groups, but it incorrectly removed
the update for newly added SSTables. This patch restores it.
When a new SSTable is added to any of the groups, the stats are
incrementally updated (as before). On compaction completion,
statistics are still rebuilt by simply iterating through each
group, which keeps track of its own stats.
Unit tests are added to guarantee the stats are correct both after
compaction completion and memtable flush.
Fixes#12808.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12834
When rust wasmtime bindings were added, we commited Cargo.lock to
make sure a given version of Scylla always builds using the same
versions of rust dependencies. Therefore, it should not be present
in .gitignore.
Closes#12831
Wasmtime added some improvements in recent releases - particularly,
two security issues were patched in version 2.0.2. There were no
breaking changes for our use other than the strategy of returning
Traps - all of them are now anyhow::Errors instead, but we can
still downcast to them, and read the corresponding error message.
The cxx, anyhow and futures dependency versions now match the
versions saved in the Cargo.lock.
Closes#12830
The test test_scan.py::test_scan_long_partition_tombstone_string
checks that a full-table Scan operation ends a page in the middle of
a very long string of partition tombstones, and does NOT scan the
entire table in one page (if we did that, getting a single page could
take an unbounded amount of time).
The test is currently flaky, having failed in CI runs three times in
the past two months.
The reason for the flakiness is that we don't know exactly how long
we need to make the sequence of partition tombstones in the test before
we can be absolutely sure a single page will not read this entire sequence.
For single-partition scans we have the "query_tombstone_page_limit"
configuration parameter, which tells us exactly how long we need to
make the sequence of row tombstones. But for a full-table scan of
partition tombstones, the situation is more complicated - because the
scan is done in parallel on several vnodes in parallel and each of
them needs to read query_tombstone_page_limit before it stops.
In my experiments, using query_tombstone_limit * 4 consecutive tombstones
was always enough - I ran this test hundreds of times and it didn't fail
once. But since it did fail on Jenkins very rarely (3 times in the last
two months), maybe the multiplier 4 isn't enough. So this patch doubles
it to 8. Hopefully this would be enough for anyone (TM).
This makes this test even bigger and slower than it was. To make it
faster, I changed this test's write isolation mode from the default
always_use_lwt to forbid_rmw (not use LWT). This leaves the test's
total run time to be similar to what it was before this patch - around
0.5 seconds in dev build mode on my laptop.
Fixes#12817
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12819
these source files are out of sync with the source files listed
in `configured.py`. some of them were removed, some of them were
added. let's try to keep them in sync. this pave the road to a
working CMakeLists.txt
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
find abseil library with the name of absl, instead of "Abseil".
absl's cmake config file is provided with the name of
`abslConfig.cmake`, not `AbseilConfig.cmake`.
see also
cde2f0eaae/CMakeLists.txt (L198)
.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Currently, because serialize_visitor::operator() is not implemented for counters, we cannot convert a counter returned by a WASM UDF to bytes when returning from wasm::run_script().
We could disallow using counters as WASM UDF return types, but an easier solution which we're already using in Lua UDFs is treating the returned counters as 64-bit integers when deserializing. This patch implements the latter approach and adds a test for it.
Closes#12806
* github.com:scylladb/scylladb:
wasm udf: deserialize counters as integers
test_wasm.py: add utility function for reading WASM UDF saved in files
This patch is based on #12681, only last 3 commits are relevant.
As described in #12709, currently, when a UDF used in a UDA is replaced, the UDA is not updated until the whole node is restarted.
This patch fixes the issue by updating all affected UDAs when a UDF is replaced.
Additionally, it includes a few convenience changes
Closes#12710
* github.com:scylladb/scylladb:
uda: change the UDF used in a UDA if it's replaced
functions: add helper same_signature method
uda: return aggregate functions as shared pointers
Currently, effective_replication_map::do_get_ranges accepts
a functor that traverses the natural endpoints of each token
to decide whether a token range should be returned or not.
This is done by copying the natural endpoints vector for
each token. However, other than special strategies like
everywhere and local, the functor can be called on the
precalculated inet_address_vector_replica_set in the
replication_map and there's no need to copy it for each call.
for_each_natural_endpoint_until passes a reference to the function
down to the abstract replication strategy to let it work either
on the precalculated inet_address_vector_replica_set or
on a ad-hoc vector prepared by the replication strategy.
The function returns stop_iteration::yes when a match or mismatch
are found, or stop_iteration::no while it has no definite result.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12737
LWT `IF` (column_condition) duplicates the expression prepare and evaluation code. Annoyingly,
LWT IF semantics are a little different than the rest of CQL: a NULL equals NULL, whereas usually
NULL = NULL evaluates to NULL.
This series converts `IF` prepare and evaluate to use the standard expression code. We employ
expression rewriting to adjust for the slightly different semantics.
In a few places, we adjust LWT semantics to harmonize them with the rest of CQL. These are pointed
out in their own separate patches so the changes don't get lost in the flood.
Closes#12356
* github.com:scylladb/scylladb:
cql3: lwt: move IF clause expression construction to grammar
cql3: column_condition: evaluate column_condition as a single expression
cql3: lwt: allow negative list indexes in IF clause
cql3: lwt: do not short-circuit col[NULL] in IF clause
cql3: column_condition: convert _column to an expression
cql3: expr: generalize evaluation of subscript expressions
cql3: expr: introduce adjust_for_collection_as_maps()
cql3: update_parameters: use evaluation_inputs compatible row prefetch
cql3: expr: protect extract_column_value() from partial clustering keys
cql3: expr: extract extract_column_value() from evaluation machinery
cql3: selection: introduce selection_from_partition_slice
cql3: expr: move check for ordering on duration types from restrictions to prepare
cql3: expr: remove restrictions oper_is_slice() in favor of expr::is_slice()
cql3: column_condition: optimize LIKE with constant pattern after preparing
cql3: expr: add optimizer for LIKE with constant pattern
test: lib: add helper to evaluate an expression with bind variables but no table
cql3: column_condition: make the left-hand-side part of column_condition::raw
cql3: lwt: relax constraints on map subscripts and LIKE patterns
cql3: expr: fix search_and_replace() for subscripts
cql3: expr: fix function evaluation with NULL inputs
cql3: expr: add LWT IF clause variants of binary operators
cql3: expr: change evaluate_binop_sides to return more NULL information
In issue #12601, a dtest involving paging of ListStreams showed
incorrect results - the paged results had one duplicate stream and one
missing stream. We believe that the cause of this bug was that the
unsorted map of tables can change order between pages. In this patch
we add a test test_list_streams_paged_with_new_table which can
demonstrate this bug - by adding a lot of tables in mid-paging, we
cause the unsorted map to be reshufled and the paging to break.
This is not the same situation as in #12601 (which did not involve
new tables) but we believe it demonstrates the same bug - and check
its fix. Indeed this passes with the fix in pull request #12614 and
fails without it.
This patch also adds a second test, test_stream_arn_unchanging:
That test eliminates a guess we had for the cause of #12601. We
thought that maybe stream ARN changing on a table if its schema
version changes, but the new test confirms that it actually behaves
as expected (the stream ARN doesn't change).
Refs #12601
Refs #12614
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12616
This patch adds yet another reproducer for issue #10649, where a
the combination of filtering and LIMIT returns fewer results when
a secondary index is added to the table.
Whereas the previous tests we had for this issue involved a regular
(global) index, the new test uses a local index (a Scylla-only feature).
It shows that the same bug exists also for local indexes, as noticed
by a user in #12766.
Refs #10649
Refs #12766
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12783
ScyllaDB has wide variety of tools and source of information useful for
diagnosing problems. These are scattered all over the place and although
most of these are documented, there is currently no document listing all
the relevant tools and information sources when it comes to diagnosing a
problem.
This patch adds just that: a document listing the different tools and
information sources, with a brief description of how they can help in
diagnosing problems, and a link to the releveant dedicated documentation
pages.
Closes#12503
Recently we enabled RBNO by default in all topology operations. This
made the operations a bit slower (repair-based topology ops are a bit
slower than classic streaming - they do more work), and in debug mode
with large number of concurrent tests running, they might timeout.
The timeout for bootstrap was already increased before, do the same for
decommission/removenode. The previously used timeout was 300 seconds
(this is the default used by aiohttp library when it makes HTTP
requests), now use the TOPOLOGY_TIMEOUT constant from ScyllaServer which
is 1000 seconds.
Closes#12765
* github.com:scylladb/scylladb:
test/pylib: use larger timeout for decommission/removenode
test/pylib: scylla_cluster: rename START_TIMEOUT to TOPOLOGY_TIMEOUT
now that these variables are set, let's reuse them when appropriate.
less repeatings this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12802
Issue #7659, which we solved long ago, was about a query which included
a non-EQ restriction and wrongly picked up one of the indexes. It had
a short C++ regression test, but here we add a more elaborate Python
test for the same bug. The advantages of the Python test are:
1. The Python test can be run against any version of Scylla (e.g., to
whether a certain version contains a backport of the fix).
2. The Python test reproduces not only a "benign" query error, but also
an assertion-failed crash which happened when the non-EQ restriction
was an "IN".
3. The Python test reproduces the same bug not just for a regular
index, but also a local index.
I checked that, as expected, these tests pass on master, but fail
(and crash Scylla) in old branches before the fix for #7659.
Refs #7659.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12797
It's only used by the single test and apparently exists since the times
seastar was missing the future::discard_result() sugar
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12803
Currently, because serialize_visitor::operator() is not implemented
for counters, we cannot convert a counter returned by a WASM UDF
to bytes when returning from wasm::run_script().
We could disallow using counters as WASM UDF return types, but an
easier solution which we're already using in Lua UDFs is treating
the returned counters as 64-bit integers when deserializing. This
patch implements the latter approach and adds a test for it.
Currently, we're repeating the same os.path, open, read, replace
each time we read a WASM UDF from a file.
To reduce code bloat, this patch adds a utility function
"read_function_from_file" that finds the file and reads it given
a function name and an optional new name, for cases when we want
to use a different name in cql (mostly for unique_names).
Move long running topology tests out of `test_topology.py` and into their own files, so they can be run in parallel.
While there, merge simple schema tests.
Closes#12804
* github.com:scylladb/scylladb:
test/topology: rename topology test file
test/topology: lint and type for topology tests
test/topology: move topology ip tests to own file
test/topology: move topology test remove garbaje...
test/topology: move topology rejoin test to own file
test/topology: merge topology schema tests and...
test/topology: isolate topology smp params test
test/topology: move topology helpers to common file
Instead of the grammar passing expression bits to column_condition,
have the grammar construct an unprepared expression and pass it as
a whole. column_condition::raw then uses prepare_expression() to
prepare it.
The call to validate_operation_on_durations() is eliminated, since it's
already done be prepare_expression().
Some tests adjusted for slightly different wording.
Instead of laboriously hand-evaluating each expression component,
construct one expression for the entire column_condition during
prepare time, and evaluate it using the generic machinery.
LWT IF evaluates equality against NULL considering two NULLs
as equal. We handle that by rewriting such expressions to use
null_handling_style::lwt_nulls.
Note we use expr::evaluate() rather than is_satisfied_by(), since
the latter doesn't like functions on the top-level, which we have
due to LIKE with constant pattern optimization. evaluate() is more
generic anyway.
LWT IF clause errors out on negative list index. This deviates
from non-LWT subscript evaluation, PostgresQL, and too-large index,
all of which evaluate the subscript operation to NULL.
Make things more consistent by also evaluating list[-1] to NULL.
A test is adjusted.
Currently if an LWT IF clause contains a subscript with NULL
as the key, then the entire IF clause is evaluated as FALSE.
This is incorrect, because col[NULL] = NULL would simplify
to NULL = NULL, which is interpreted as TRUE using the LWT
comparisons. Even with SQL NULL handling, "col[NULL] IS NULL"
should evaluate to true, but since we short-circuit as soon
as we encounter the NULL key, we cannot complete the evaluation.
Fix by setting cell_value to null instead of returning immediately.
Tests that check for this were adjusted. Since the test changed
behavior from not applying the statement to applying it, a new
statement is added that undoes the previous one, so downstream
statements are not affected.
After this change, all components of column_condition are expressions.
One LWT-specific hack was removed from the evaluation path:
- lists being represented as maps is made transparent by
converting during evaluation with adjust_for_collections_as_maps()
column_condition::applies_to() previously handled a missing row
by materializing a NULL for the column being evaluated; now it
materializes a NULL row instead, since evaluation of the column is
moved to common code.
A few more cases in lwt_test became legal, though I'm not sure
exactly why in this patch.
Currently, evaluation of a subscript expression x[y] requires that
x be a column_value, but that's completely artificial. Generalize
it to allow any expression.
This is needed after we transform a LWT IF condition from
"a[x] = y" to "func(a)[x] = y", where func casts a from a
map represention of a list back to a list; but it's also generally
useful.
LWT and some list operations represent lists using a form like
their mutations, so that the mutation list keys can be recovered
and used to update the list. But the evaluation machinery knows
nothing about that, and will return the map-form even though the type
system thinks it is a list.
To handle that, add a utility to rewrite the expression so
that the value is re-serialized into the expected list form. The
rewrite is implemented as a scalar function taking the map form and
returning the list form.
update_parameters::prefetch_data is used for some list updates (which
need a read-before-write to determine the key to update) and for
LWT compare-and-swap. Currently they use a custom structure for
representing a read row.
Switch to the same structure that is used in evaluation_inputs (and
in SELECT statement evaluation) to the expression machinery can be reused.
The expression representation is irregular (with different fields for
the keys and regular/static columns), so we introduce an old_row
structure to hold both the clustering key and the regular row values
for cas_request.
A nice bonus is that we can use get_non_pk_values() to read the data
into the format expected by evaluation_inputs, but on the other hand
we have to adjust get_prefetched_list() to fix up the type of
the returned list (we return it as a map, not a list, so list updates
can access the index).
Partial clustering keys can exist in COMPACT STORAGE tables (though they
are exceedingly rare), and when LWT materializes a static row. Harden
extract_column_value() so it is ready for them.
Expression evaluation works with the evaluation_input structure to
compute values. As we move LWT column_condition towards expressions,
we'll start using evaluation_input, so provide this helper to ease
the transition.
Since expressions were introduced for SELECT statements, they
work with `selection` object to represent which table columns
they can work with. Probably a neutral representation would have
been better, but that's what we have now.
LWT works with partition_slice, so introduce a
selection_from_partition_slice() helper to bridge the two worlds.
Both LWT IF clause and SELECT WHERE clause check that a duration type
isn't used in an ordered comparison, since duration types are unordered
(is 1mo more or less than 30d?). As a first step towards centralizing this
check, move the check from restrictions into prepare. When LWT starts using
prepare, the duplication will be removed.
The error message was changed: the word "slice" is an internal term, and
a comparison does not necessarily have to be in a restriction (which is
also an internal term).
Tests were adjusted.
This just moves things around to put all the code we will kill in
one place.
Note the code was adjusted: before the move, it operated on
an unprepared untyped_constant; after the move it operates on
a prepared constant.
Compiling a pattern is expensive and so we should try to do it
at prepare time, if the pattern is a constant. Add an optimizer
that looks for such cases and replaces them with a unary function
that embeds the compiled pattern.
This isn't integrated yet with prepare_expr(), since the filtering
code isn't ready for generic expressions. Its first user will be LWT,
which contains the optimization already (filtering had it as well,
but lost it sometime during the expression rewrite).
A unit test is added.
Sometimes we want to defeat the expression optimizer's ability to
fold constant expressions. A bind variable is a convenient way to
do this, without the complexity of faking a schema and row inputs.
Add a helper to evaluate an expression with bind variable parameters,
doing all the paperwork for us.
A companion make_bind_variable() is added to likewise simplify
creating bind variables for tests.
LWT IF conditions are collected with the left-hand-side outside the
condition structure, then moved back to the prepared condition
structure during preparation. Change that so that the raw description
also contains the left-hand-side. This makes it more similar to expressions
(which LWT conditions aspire to be).
The change is mechanical; a bit of code that used to manage the std::pair
is moved to column_condition::raw::prepare instead. The schema is now also
passed since it's needed to prepare the left-hand-side.
Previously, we rejected map subscripts that are NULL, as well as
LIKE patterns that are NULL. General SQL expression evaluation
allows NULL everywhere, and doesn't raise errors - an expression
involving NULL generally yields NULL. Change the behavior to
follow that. Since the new behavior was previously disallowed,
no one should have been relying on it and there is no compatibility
problem.
Update the tests and note it as a CQL extension.
Function call evaluation rejects NULL inputs, unnecssarily. Functions
work well with NULL inputs. Fix by relaxing the check.
This currently has no impact because functions are not evaluated via
expressions, but via selectors.
LWT IF clause interprets equality differently from SQL (and the
rest of CQL): it thinks NULL equals NULL. Currently, it implements
binary operators all by itself so the fact that oper_t::EQ (and
friends) means something else in the rest of the code doesn't
bother it. However, we can't unify the code (in
column_condition.cc) with the rest of expression evaluation if
the meaning changes in different places.
To prepare for this, introduce a null_handling_style field to
binary_operator that defaults to `sql` but can be changed to
`lwt_nulls` to indicate this special semantic.
A few unit tests are added. LWT itself still isn't modified.
group0 members to own file
Move slow test for removenode with nodes not present in group0 to a
server after a sudden stop to a separate file.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
`paxos_response_handler::learn_decision` was calling
`cdc_service::augment_mutation_call` concurrently with
`storage_proxy::mutate_internal`. `augment_mutation_call` was selecting
rows from the base table in order to create the preimage, while
`mutate_internal` was writing rows to the table. It was therefore
possible for the preimage to observe the update that it accompanied,
which doesn't make any sense, because the preimage is supposed to show
the state before the update.
Fix this by performing the operations sequentially. We can still perform
the CDC mutation write concurrently with the base mutation write.
`cdc_with_lwt_test` was sometimes failing in debug mode due to this bug
and was marked flaky. Unmark it.
Also fix a comment in `cdc_with_lwt_test`.
Fixes#12098Closes#12768
* github.com:scylladb/scylladb:
test/cql-pytest: test_cdc: regression test for #12098
test/cql: cdc_with_lwt_test: fix comment
service: storage_proxy: sequence CDC preimage select with Paxos learn
... move them to their own file.
Schema verification tests for restart, add, and hard stop of server can
be done with the same cluster. Merge them in the same test case.
While there, move them to a separate file to be run independently as
this is a slow test.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Improve logging by printing the cluster at the end of each test.
Stop performing operations like attempting queries or dropping keyspaces on dirty clusters. Dirty clusters might be completely dead and these operations would only cause more "errors" to happen after a failed test, making it harder to find the real cause of failure.
Mark cluster as dirty when a test that uses it fails - after a failed test, we shouldn't assume that the cluster is in a usable state, so we shouldn't reuse it for another test.
Rely on the `is_dirty` flag in `PythonTest`s and `CQLApprovalTest`s, similarly to what `TopologyTest`s do.
Closes#12652
* github.com:scylladb/scylladb:
test.py: rely on ScyllaCluster.is_dirty flag for recycling clusters
test/topology: don't drop random_tables keyspace after a failed test
test/pylib: mark cluster as dirty after a failed test
test: pylib, topology: don't perform operations after test on a dirty cluster
test/pylib: print cluster at the end of test
Recently we enabled RBNO by default in all topology operations. This
made the operations a bit slower (repair-based topology ops are a bit
slower than classic streaming - they do more work), and in debug mode
with large number of concurrent tests running, they might timeout.
The timeout for bootstrap was already increased before, do the same for
decommission/removenode. The previously used timeout was 300 seconds
(this is the default used by aiohttp library when it makes HTTP
requests), now use the TOPOLOGY_TIMEOUT constant from ScyllaServer which
is 1000 seconds.
Force snapshot with schema changes while server down. Then verify schema when bringing back up the server.
Closes#12726
* github.com:scylladb/scylladb:
pytest/topology: check snapshot transfer
raft conf error injection for snapshot
test/pylib: one-shot error injection helper
Perform multiple LWT inserts to different keys ensuring none of them
observes a preimage.
On my machine this test reproduces the problem more than 50% of the time
in debug mode.
Currently, evaluate_binop_sides() returns std::nullopt if either
side is NULL.
Since we wish to to add binary operators that do consider NULL on
each side, make evaluate_binop_sides return the original NULLs
instead (as managed_bytes_opt).
Utimately I think evaluate_binop_sides() should disappear, but before
that we have to improve unset value checking.
The helping wrapper facilitates the usage of sharded<sstable_directory> for several test cases and the helper and its callers had deserved some cleanup over time.
Closes#12791
* github.com:scylladb/scylladb:
sstable_directory_test: Reindent and de-multiline
sstable_directory_test: Enlighten and rename sstable_from_existing_file
sstable_directory_test: Remove constant parallelizm parameter
Merging rows from different partition versions should preserve the LRU link of
the entry from the newer version. We need this in case we're merging two last
dummy entries where the older dummy is already unlinked from the LRU. The
newer dummy could be the last entry which is still holding the partition
entry linked in the LRU.
The mutation_partition_v2 merging didn't take the LRU link from the newer
entry, and we could end up with the partition entry not having any entries
linked in the LRU.
Introduced in f73e2c992f.
Fixes#12778Closes#12785
System keyspace is a keyspace with local replication strategy and thus
it does not need to be repaired. It is possible to invoke repair
of this keyspace through the api, which leads to runtime error since
peer_events and scylla_table_schema_history have different sharding logic.
For keyspaces with local replication strategy repair_service::do_repair_start
returns immediately.
Closes#12459
* github.com:scylladb/scylladb:
test: rest_api: check if repair of system keyspace returns before corresponding task is created
repair: finish repair immediately on local keyspaces
The sstable perf test uses test_setup ability to create temporary
directory and clean it and that's the only place that uses it. Move the
remainders of test_setup into perf/ so that no unit tests attempt to
re-use it (there's test_env for that).
Remove unused _walker and _create_directory while at it.
Mark protected stuff private while at it as well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The statistics_rewrite case uses the helper that creates a copy of the
provided static directory, but it's the only user of this helper. It's
better to open-code it into the test case.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It is ran inside async context and can be coded in a shorter form
without using deeply nested then-s
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The former helper is just a wrapper over the _async version of the
latter and also creates a tempdir and calls the fn with tempdir as an
argument. The test_env already has its own temp dir on board, so callers
can can be switched to using it.
Some test cases use the do_with_tmp_directory but generate chain of
futures without in fact using the async context. This patch addresses
that, so the change is not 100% mechanical unfortunately.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Many tests using sstable directory wrapper have broken indentation with
previous patching. Fix it. No functional changes.
Also, while at it, convert multiline wrapper calls into one-line, after
previous patch these are short enough for that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It used to be the sstable maker for sstable::test_env / cql_test_env,
now sstables for tests are made via sstables manager explicitly, so the
guy can be remaned to something more relevant to its current status.
Also, de-mark its constructors as explicit to make callers look shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This reverts commit e7d5e508bc. It ends up
failing continuous integration tests randomly. We don't know if it's
uncovering an existing bug, or if RBNO itself is broken, but for now we
need to revert it to unblock progress.
Now it's scattered between dist. loader and sstable directory code making the latter quite bloated. Keeping everything in distributed loader makes the sstable_directory code compact and easier to patch to support object storage backend.
Closes#12771
* github.com:scylladb/scylladb:
sstable_directory: Rename remove_input_sstables_from_reshaping()
sstable_directory: Make use of remove_sstables() helper
sstable_directory: Merge output sstables collecting methods
distributed_loader: Remove max_compaction_threshold argument from reshard()
distributed_loader: Remove compaction_manager& argument from reshard()
sstable_directory: Move the .reshard() to distributed_loader
sstable_directory: Add helper to load foreign sstable
sstable_directory: Add io-prio argument to .reshard()
sstable_directory: Move reshard() to distributed_loader.cc
distributed_loader: Remove compaction_manager& argument from reshape()
sstable_directory: Move the .reshape() to distributed loader
sstable_directory: Add helper to retrive local sstables
sstable_directory: Add io-prio argument to .reshape()
sstable_directory: Move reshape() to distributed_loader.cc
CQL transport sever error handling fixes and improvements:
* log failed requests with `DEBUG` level for easier debugging;
* in case of unhandled errors, deliver them to the client as `SERVER_ERROR`'s
* fix for `protocol_error`'s in case of shedded big requests;
* explicit tests have been written for the error handling problems above.
Closes#11949
* github.com:scylladb/scylladb:
transport server: fix "request size too large" handling
transport server: log failed requests with debug level
transport server: fix unexpected server errors handling
transport server: log client errors with debug level
It unlinks unshared sstables filtering some of them out. Name it
according to what it does without mentioning reshape/reshard.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently it's called remove_input_sstables_from_resharding() but it's
just unlinks sstables in parallel from the given list. So rename it not
to mention reshard and also make use of this "new" helper in the
remove_input_sstables_from_reshaping(), it needs exactly the same
functionality.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two of them collecting sstables from resharding and reshaping.
Both doing the same job except for the latter doesn't expect the list to
contain remote sstables.
This patch merges them together with the help of an extra sanity boolean
to check for the remote sstable not in the list. And renames the method
not to mention reshape/reshard.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We have a cql3::expr::expression::printer wrapper that annotates
an expression with a debug_mode boolean prior to formatting. The
fmt library, however, provides a much simpler alterantive: a custom
format specifier. With this, we can write format("{:user}", expr) for
user-oriented prints, or format("{:debug}", expr) for debug-oriented
prints (if nothing is specified, the default remains debug).
This is done by implementing fmt::formatter::parse() for the
expression type, can using expression::printer internally.
Since sometimes we pass expression element types rather than
the expression variant, we also provide a custom formatter for all
ExpressionElement Types.
Uses for expression::printer are updated to use the nicer syntax. In
one place we eliminate a temporary that is no longer needed since
ExpressionElement:s can be formatted directly.
Closes#12702
Calling _read_buf.close() doesn't imply eof(), some data
may have already been read into kernel or client buffers
and will be returned next time read() is called.
When the _server._max_request_size limit was exceeded
and the _read_buf was closed, the process_request method
finished and we started processing the next request in
connection::process. The unread data from _read_buf was
treated as the header of the next request frame, resulting
in "Invalid or unsupported protocol version" error.
The existing test_shed_too_large_request was adjusted.
It was originally written with the assumption that the data
of a large query would simply be dropped from the socket
and the connection could be used to handle the
next requests. This behaviour was changed in scylladb#8800,
now the connection is closed on the Scylla side and
can no longer be used. To check there are no errors
in this case, we use Scylla metrics, getting them
from the Scylla Prometheus API.
If request processing ended with an error, it is worth
sending the error to the client through
make_error/write_response. Previously in this case we
just wrote a message to the log and didn't handle the
client connection in any way. As a result, the only
thing the client got in this case was timeout error.
A new test_batch_with_error is added. It is quite
difficult to reproduce error condition in a test,
so we use error injection instead. Passing injection_key
in the body of the request ensures that the exception
will be thrown only for this test request and
will not affect other requests that
the driver may send in the background.
Closes: scylladb#12104
Since the whole reshard() is local to dist. loader code now, the caller
of the reshard helper may let this method get the threshold itself
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now all the reshading logic is accumulated in distributed loader and the
sstable_directory is just the place where sstables are collected.
The changes summary is:
- add sstable_directory as argument (used to be "this")
- replace all "this" captures with &dir ones
- remove temporary namespace gap and declaration from sst-dir class
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is to generalize the code duplication between .reshard() and
existing .load_foreign_sstables() (plural form).
Make it coroutinized right at once.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now it gets one from this-> but the method is becoming static one in
distributed_loader which only has it as an argument. That's not big deal
as the current IO class is going to be derived from current sched group,
so this extra arg will go away at all some day.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now all the reshaping logic is accumulated in distributed loader and the
sstable_directory is just the place where sstables are collected.
The changes summary is:
- add sstable_directory as argument (used to be "this")
- replace all "this" captures with &dir ones
- remove temporary namespace gap and declaration from sst-dir class
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are methods to retrive shared local sstables and foreign sstables,
so here's one more to the family
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now it gets one from this-> but the method is becoming static one in
distributed_loader which only has it as an argument. That's not big deal
as the current IO class is going to be derived from current sched group,
so this extra arg will go away at all some day.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
`paxos_response_handler::learn_decision` was calling
`cdc_service::augment_mutation_call` concurrently with
`storage_proxy::mutate_internal`. `augment_mutation_call` was selecting
rows from the base table in order to create the preimage, while
`mutate_internal` was writing rows to the table. It was therefore
possible for the preimage to observe the update that it accompanied,
which doesn't make any sense, because the preimage is supposed to show
the state before the update.
Fix this by performing the operations sequentially. We can still perform
the CDC mutation write concurrently with the base mutation write.
`cdc_with_lwt_test` was sometimes failing in debug mode due to this bug
and was marked flaky. Unmark it.
Fixes#12098
Test snapshot transfer by reducing the snapshot threshold on initial
servers (3 and 1 trailing).
Then creates a table, and does 3 extra schema changes (add column),
triggering at least 2 snapshots.
Then brings a new server to the cluster, which will get the schema
through a snapshot.
Then the test stops the initial servers and verifies the table
schema is up to date on the new server.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Let the initial range passed to query_partition_key_range
be [1, 2) where 2 is the successor of 1 in terms
of ring_position order and 1 is equal to vnode.
Then query_ranges_to_vnodes_generator() -> [[1, 1], (1, 2)],
so we get an empty range (1,2) and subsequently will
make a data request with this empty range in
storage_proxy::query_partition_key_range_concurrent,
which will be redundant.
The patch adds a check for this condition after
making a split in the main loop in process_one_range.
The patch does not attempt to handle cases where the
original ranges were empty, since this check is the
responsibility of the caller. We only take care
not to add empty ranges to the result as an
unintentional artifact of the algorithm in
query_ranges_to_vnodes_generator.
A test case is added in test_get_restricted_ranges.
The helper lambda check is changed so that not to limit
the number of ranges to the length of expected
ranges, otherwise this check passes without
the change in process_one_range.
Fixes: #12566Closes#12755
request_controller_timeout_exception_factory::timeout() creates an
instance of `request_controller_timed_out_error` whose ctor is
default-created by compiler from that of timed_out_error, which is
in turn default-created from the one of `std::exception`. and
`std::exception::exception` does not throw. so it's safe to
mark this factory method `noexcept`.
with this specifier, we don't need to worry about the exception thrown
by it, and don't need to handle them if any in `seastar::semaphore`,
where `timeout()` is called for the customized exception.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12759
- main: consider EDQUOT as environmental failure also
- main: use defer_verbose_shutdown() to shutdown compaction manager
- replica/table: extract should_retry() int with_retry
- replica/table: retry on EDQUOT when flushing memtable
Fixes#12626Closes#12653
* github.com:scylladb/scylladb:
replica/table: retry on EDQUOT when flushing memtable
replica/table: extract should_retry() int with_retry
main: use defer_verbose_shutdown() to shutdown compaction manager
main: consider EDQUOT as environmental failure also
Currently, if a UDA uses a UDF that's being replaced,
the UDA will still keep using the old UDF until the
node is restarted.
This patch fixes this behavior by checking all UDAs
when replacing a UDF and updating them if necessary.
Fixes#12709
This patch fixes#12475, where an aggregation (e.g., COUNT(*), MIN(v))
of absolutely no partitions (e.g., "WHERE p = null" or "WHERE p in ()")
resulted in an internal error instead of the "zero" result that each
aggregator expects (e.g., 0 for COUNT, null for MIN).
The problem is that normally our aggregator forwarder picks the nodes
which hold the relevant partition(s), forwards the request to each of
them, and then combines these results. When there are no partitions,
the query is sent to no node, and we end up with an empty result set
instead of the "zero" results. So in this patch we recognize this
case and build those "zero" results (as mentioned above, these aren't
always 0 and depend on the aggregation function!).
The patch also adds two tests reproducing this issue in a fairly general
way (e.g., several aggregators, different aggregation functions) and
confirming the patch fixes the bug.
The test also includes two additional tests for COUNT aggregation, which
uncovered an incompatibility with Cassandra which is still not fixed -
so these tests are marked "xfail":
Refs #12477: Combining COUNT with GROUP by results with empty results
in Cassandra, and one result with empty count in Scylla.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12715
Related: https://github.com/scylladb/scylladb/pull/12586
This PR improves the upgrade policy added with https://github.com/scylladb/scylladb/pull/12586, according to the feedback from:
@tzach
> Upgrading from 4.6 to 5.0 is not clear; better to use 4.x to 4.y versions as an example.
and @bhalevy
> It is not completely clear that when upgrading through several versions, the whole cluster needs to be upgraded to each consecutive version, not just the rolling node.
In addition, the content is organized into sections for the sake of readability.
Closes#12647
* github.com:scylladb/scylladb:
doc: add the information abou patch releases
doc: add the info about the minor versions
doc: reorganize the content on the Upgrade ScyllaDB page
doc: improve the overview of the upgrade procedure (apply feedback)
This patch adds three additional tests for empty (e.g., empty string)
clustering keys.
The first test disproves a worry that was raised in #12561 that perhaps
empty clustering keys only seem work, but they don't get written to
sstables. The new test verifies that there is no bug - they are written
and can be read correctly.
The second and third test reproduce issue #12749 - an empty clustering
should be allowed in a COMPACT STORAGE table only if there is a compound
(multi-column) clustering key. But as the tests demonstrate, 1. if there
is just one clustering column, Scylla gives the wrong error message, and
2. if there is a compound clustering key, Scylla doesn't allow an empty
key as it should.
As usual, all tests pass on Cassandra. The last two tests fail on
Scylla, so are marked xfail.
Refs #12561
Refs #12749
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12750
Ideally, these errors should be transparently delivered
to the client, but in practice, due to various
flaws/bugs in scylla and/or the driver,
they can be lost, which enormously complicates troubleshooting.
const socket_address& get_remote_address() is needed for its
convenient conversion to string, which includes ip and port.
When deciding whether two functions have the same
signature, we have to check if they have the same name
and parameter types. Additionally, if they're represented
by pointers, we need to check if any of them is a nullptr.
This logic is used multiple times, so it's extracted to
a separate function.
To use this function, the `used_by_user_aggregate` method
takes now a function instead of name and types list - we
can do it because we always use it with an existing user
function (that we're trying to drop).
The method will also be useful when we'll be not dropping,
but replacing a user function.
We will want to reuse the functions that we get from an aggregate
without making a deep copy, and it's only possible if we get
pointers from the aggregate instead of actual values.
retry when memtable flush fails due to EDQUOT.
there are chances that user exceeds the disk quota when scylla flushes
memtable and user manages to free up the necessary resource before the
next retry.
before this change, we simply `abort()` in this case.
after this change, we just keep on retrying until the service is shutdown.
Fixes#12626
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* extract a lambda encapsulating the condition if we should retry
at seeing an exception when calling functions with `with_retry()`.
we apply the same check to the exception raised when performing
table related i/o operations. in this change, the two checks are
consolidated and extracted into a single lambda, so we can add
yet more error code (s) which should be considered retry-able
failures.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* use `defer_verbose_shutdown()` to shutdown compaction manager
`EDQUOT` is quite similar as `ENOSPC`, in the sense that both of them
are caused by environmental issues.
before this change, `compaction_manager` filters the
ENOSPC exceptions thrown by `compaction_manager::really_do_stop()`,
so they are not propagated to caller when calling
`compaction_manager::stop()` -- only a warning message is printed
in the log. but `EDQUOT` is not handled.
after this change, the exception raised by compaction manager's
stop process is not filtered anymore and is handled by
`defer_verbose_shutdown()` instead, which is able to check the
type of exception, and print out error message in the log. so
the `ENOSPC` and `EDQUOT` errors are taken care of, and more
visible from user's perspective as they are printed as errors
instead of warning. but they are not printed using the
`compaction_manager` logger anymore. so if our testing or user's
workflow depends on this behavior, the related setting should be
updated accordingly.
Fixes#12626
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
EDQUOT can be returned as the errno when the underlying filesystem
is trying to reserve necessary resources from disk for performing
i/o on behalf of the effective user, and the filesystem fails to
acquire the necessary resources. it could be inode, volume space,
or whatever resources for completing the i/o operation. but none
of them is the consequence of scylla's fault. so we should not
`abort()` at seeing this errno. instead, it's should be reported
to the administrator.
in this change, EDQUOT is also considered as an environmental
failure just like EIO, EACCES and ENOSPC. they could be thrown
when stopping an server.
Fixes#12626
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
We currently have two method families to generate partition keys:
* make_keys() in test/lib/simple_schema.hh
* token_generation_for_shard() in test/lib/sstable_utils.hh
Both work only for schemas with a single partition key column of `text` type and both generate keys of fixed size.
This is very restrictive and simplistic. Tests, which wanted anything more complicated than that had to rely on open-coded key generation.
Also, many tests started to rely on the simplistic nature of these keys, in particular two tests started failing because the new key generation method generated keys of varying size:
* sstable_compaction_test.sstable_run_based_compaction_test
* sstable_mutation_test.test_key_count_estimation
These two tests seems to depend on generated keys all being of the same size. This makes some sense in the case of the key count estimation test, but makes no sense at all to me in the case of the sstable run test.
Closes#12657
* github.com:scylladb/scylladb:
test/lib/sstable_utils: remove now unused token_generation_for_shard() and friends
test/lib/simple_schema: remove now unused make_keys() and friends
test: migrate to tests::generate_partition_key[s]()
test/lib/test_services: add table_for_tests::make_default_schema()
test/lib: add key_utils.hh
test/lib/random_schema.hh: value_generator: add min_size_in_bytes
This patch fixes 2 issues with checking whether UDFs are used in UDAs:
1) UDFs types are not considered during the check, which prevents us from dropping a UDF that isn't used in any UDAs, but shares its name with one of them.
2) the REDUCEFUNC is not considered during the check, which allows dropping a UDF even though it's used in a UDA as the REDUCEFUNC.
Additionally, tests for these issues are added
Closes#12681
* github.com:scylladb/scylladb:
udf: also check reducefunc to confirm that a UDF is not used in a UDA
udf: fix dropping UDFs that share names with other UDFs used in UDAs
pytest: add optional argument for new_function argument types
Fixes: https://github.com/scylladb/scylladb/issues/12309Closes#12720
* github.com:scylladb/scylladb:
service/raft: raft_group_registry: use recent_entries_map to store rate_limits in pinger. Fixes#12309
utils: introduce recent_entries_map datatype to track least recent visited entries.
This PR is related to https://github.com/scylladb/scylla-enterprise/issues/2176.
It adds a FAQ about a workaround to install a ScyllaDB version that is not the most recent patch version.
In addition, the link to that FAQ is added to the patch upgrade guides 2021 and 2022 .
Closes#12660
* github.com:scylladb/scylladb:
doc: add the missing sudo command
doc: replace the reduntant link with an alternative way to install a non-latest version
doc: add the link to the FAQ about pinning to the patch upgrade guides 2022 and 2022
doc: add a FAQ with a workaround to install a non-latest ScyllaDB version on Debian and Ubuntu
`poetry install` consistently times out when resolving the
dependencies. like:
```
Command ['/home/kefu/.cache/pypoetry/virtualenvs/scylla-1fWQLpOv-py3.9/bin/python', '-m', 'pip', 'install', '--use-pep517', '--disable-pip-version-check', '--isolated', '--no-input', '--prefix', '/home/kefu/.cache/pypoetry/virtualenvs
/scylla-1fWQLpOv-py3.9', '--upgrade', '--no-deps', '/home/kefu/.cache/pypoetry/artifacts/e6/ad/ab/eca9f61c5b15fd05df7192c0e5914a9e5ac927744b1fb5f6c07a92d7a4/sphinx-sitemap-2.2.0.tar.gz'] errored with the following return code 1, and out
put:
Processing /home/kefu/.cache/pypoetry/artifacts/e6/ad/ab/eca9f61c5b15fd05df7192c0e5914a9e5ac927744b1fb5f6c07a92d7a4/sphinx-sitemap-2.2.0.tar.gz
Installing build dependencies: started
Installing build dependencies: finished with status 'error'
ERROR: Command errored out with exit status 2:
command: /home/kefu/.cache/pypoetry/virtualenvs/scylla-1fWQLpOv-py3.9/bin/python /tmp/pip-standalone-pip-z97s216j/__env_pip__.zip/pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-37k3lwqd/overlay --no-warn-scrip
t-location --no-binary :none: --only-binary :none: -i https://pypi.org/simple -- 'setuptools>=40.8.0' wheel
cwd: None
Complete output (80 lines):
Collecting setuptools>=40.8.0
Downloading setuptools-67.1.0-py3-none-any.whl (1.1 MB)
ERROR: Exception:
Traceback (most recent call last):
File "/tmp/pip-standalone-pip-z97s216j/__env_pip__.zip/pip/_vendor/urllib3/response.py", line 438, in _error_catcher
yield
File "/tmp/pip-standalone-pip-z97s216j/__env_pip__.zip/pip/_vendor/urllib3/response.py", line 519, in read
data = self._fp.read(amt) if not fp_closed else b""
File "/tmp/pip-standalone-pip-z97s216j/__env_pip__.zip/pip/_vendor/cachecontrol/filewrapper.py", line 62, in read
data = self.__fp.read(amt)
File "/usr/lib64/python3.9/http/client.py", line 463, in read
n = self.readinto(b)
File "/usr/lib64/python3.9/http/client.py", line 507, in readinto
n = self.fp.readinto(b)
File "/usr/lib64/python3.9/socket.py", line 704, in readinto
return self._sock.recv_into(b)
File "/usr/lib64/python3.9/ssl.py", line 1242, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib64/python3.9/ssl.py", line 1100, in read
return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
```
while sphinx-sitemap 2.5.0 installs without problems. sphinx-sitemap
2.50 is the latest version published to pypi.
according to sphinx-sitemap's changelog at
https://github.com/jdillard/sphinx-sitemap/blob/master/CHANGELOG.rst ,
no breaking changes were introduced in between 2.2.0 and 2.5.0.
after bumping sphinx-sitemap 2.5.0, following commands can complete
without errors:
```
poetry lock
make preview
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12705
[table: Fix disk-space related metrics](529a1239a9) fixes the table's disk space related metrics.
whereas second patch fixes an inefficiency when computing statistics which can be triggered with multiple compaction groups.
Closes#12718
* github.com:scylladb/scylladb:
table: Fix inefficiency when rebuilding statistics with compaction groups
table: Fix disk-space related metrics
When dropping a UDF we're checking if it's not begin used in any UDAs
and fail otherwise. However, we're only checking its state function
and final function, and it may also be used as its reduce function.
This patch adds the missing checks and a test for them.
Currently, when dropping a function, we only check if there exist
an aggregate that uses a function with the same name as its state
function or final function. This may cause the drop to fail even
when it's just another UDF with the same name that's used in the
aggregate, even when the actual dropped function is not used there.
This patch fixes this by checking whether not only the name of the
UDA's sfunc and finalfunc, but also their argument types.
When multiple functions with the same name but different argument types
are created, the default drop statement for these functions will fail
because it does not include the argument types.
With this change, this problem can be worked around by specifying
argument types when creating the function, as this will cause the drop
statement to include them.
The system_keyspace defines several auxiliary methods to help view_builder update system.scylla_views_builds_in_progress and system.built_views tables. All use global qctx thing.
It only takes adding view_builder -> system_keyspace dependency in order to de-static all those helpers and let them use query-processor from it, not the qctx.
Closes#12728
* github.com:scylladb/scylladb:
system_keysace: De-static calls that update view-building tables
storage_service: Coroutinize mark_existing_views_as_built()
api: Unset column_famliy endpoints
api: Carry sharded<db::system_keyspace> reference over
view_builder: Add system_keyspace dependency
In most cases, tasks manager's tasks are started just after they are
created. Thus, to reduce boilerplate required for creating and starting
tasks, tasks::task_manager::module::make_and_start_task method is added.
Repair tasks are modified to use the method where possible.
Closes#12729
* github.com:scylladb/scylladb:
repair: use tasks::task_manager::module::make_and_start_task for repair tasks
tasks: add task_manager::module::make_and_start_task method
Since 97bb2e47ff (storage_service: Enable Repair Based Node Operations (RBNO) by default for replace), RBNO was enabled by default for replace ops.
After more testing, we decided to enable repair based node operations by default for all node operations.
Closes#12173
* github.com:scylladb/scylladb:
storage_service: Enable Repair Based Node Operations (RBNO) by default for all node ops
test: Increase START_TIMEOUT
test: Increase max-networking-io-control-blocks
storage_service: Check node has left in node_ops_cmd::decommission_done
repair: Use remote dc neighbors for everywhere strategy
Introduces task manager's compaction module. That's an initial
part of integration of compaction with task manager.
When fully integrated, task manager will allow user to track compaction
operations, check status and progress of each individual one. It will help
with creating an asynchronous version of rest api that forces any compaction.
Currently, users can see with /task_manager/list_modules api call that
compaction is one of the modules accessible through task manager.
They won't get any additional information though, since compaction
tasks are not created yet.
A shared_ptr to compaction module is kept in compaction manager.
Closes#12635
* github.com:scylladb/scylladb:
compaction: test: pass task_manager to compaction_manager in test environment
compaction: create and register task manager's module for compaction
tasks: add task_manager constructor without arguments
Makes sstable_set::all() interface robust, and introduces sstable_set::size() to avoid copies when retrieving set size.
Closes#12716
* github.com:scylladb/scylladb:
treewide: Use new sstable_set::size() wherever possible
sstables: Introduce sstable_set::size()
sstables: Fix fragility of sstable_set::all() interface
This series switches memtable and cache to use a new representation for mutation data,
called `mutation_partition_v2`. In this representation, range tombstone information is stored
in the same tree as rows, attached to row entries. Each entry has a new tombstone field,
which represents range tombstone part which applies to the interval between this entry and
the previous one. See docs/dev/mvcc.md for more details about the format.
The transient mutation object still uses the old model in order to avoid work needed to adapt
old code to the new model. It may also be a good idea to live with two models, since the
transient mutation has different requirements and thus different trade-offs can be made.
Transient mutation doesn't need to support eviction and strong exception guarantees,
so its algorithms and in-memory representation can be simpler.
This allows us to incrementally evict range tombstone information. Before this series,
range tombstones were accumulated and evicted only when the whole partition entry was evicted. This
could lead to inefficient use of cache memory.
Another advantage of the new representation is that reads don't have to lookup
range tombstone information in a different tree while reading. This leads to simpler
and more efficient readers.
There are several disadvantages too. Firstly, rows_entry is now larger by 16 bytes.
Secondly, update algorithms are more complex because they need to deoverlap range tombstone
information. Also, to handle preemption and provide strong exception guarantees, update
algorithms may need to allocate sentinel entries, which adds complexity and reduces performance.
The memtable reader was changed to use the same cursor implementation
which cache uses, for improved code reuse and reducing risk of bugs
due to discrepancy of algorithms which deal with MVCC.
Remaining work:
- performance optimizations to apply_monotonically() to avoid regressions
- performance testing
- preemption support in apply_to_incomplete (cache update from memtable)
Fixes#2578Fixes#3288Fixes#10587Closes#12048
* github.com:scylladb/scylladb:
test: mvcc: Extend some scenarios with exhaustive consistency checks on eviction
test: mvcc: Extract mvcc_container::allocate_in_region()
row_cache, lru: Introduce evict_shallow()
test: mvcc: Avoid copies of mutation under failure injection
test: mvcc: Add missing logalloc::reclaim_lock to test_apply_is_atomic
mutation_partition_v2: Avoid full scan when applying mutation to non-evictable
Pass is_evictable to apply()
tests: mutation_partition_v2: Introduce test_external_memory_usage_v2 mirroring the test for v1
tests: mutation: Fix test_external_memory_usage() to not measure mutation object footprint
tests: mutation_partition_v2: Add test for exception safety of mutation merging
tests: Add tests for the mutation_partition_v2 model
mutation_partition_v2: Implement compact()
cache_tracker: Extract insert(mutation_partition_v2&)
mvcc, mutation_partition: Document guarantees in case merging succeeds
mutation_partition_v2: Accept arbitrary preemption source in apply_monotonically()
mutation_partition_v2: Simplify get_continuity()
row_cache: Distinguish dummy insertion site in trace log
db: Use mutation_partition_v2 in mvcc
range_tombstone_change_merger: Introduce peek()
readers: Extract range_tombstone_change_merger
mvcc: partition_snapshot_row_cursor: Handle non-evictable snapshots
mvcc: partition_snapshot_row_cursor: Support digest calculation
mutation_partition_v2: Store range tombstones together with rows
db: Introduce mutation_partition_v2
doc: Introduce docs/dev/mvcc.md
db: cache_tracker: Introduce insert() variant which positions before existing entry in the LRU
db: Print range_tombstone bounds as position_in_partition
test: memtable_test: Relax test_segment_migration_during_flush
test: cache_flat_mutation_reader: Avoid timestamp clash
test: cache_flat_mutation_reader_test: Use monotonic timestamps when inserting rows
test: mvcc: Fix sporadic failures due to compact_for_compaction()
test: lib: random_mutation_generator: Produce partition tombstone less often
test: lib: random_utils: Introduce with_probability()
test: lib: Improve error message in has_same_continuity()
test: mvcc: mvcc_container: Avoid UB in tracker() getter when there is no tracker
test: mvcc: Insert entries in the tracker
test: mvcc_test: Do not set dummy::no on non-clustering rows
mutation_partition: Print full position in error report in append_clustered_row()
db: mutation_cleaner: Extract make_region_space_guard()
position_in_partition: Optimize equality check
mvcc: Fix version merging state resetting
mutation_partition: apply_resume: Mark operator bool() as explicit
get_address_ranges() and get_ranges() perform almost the same computation.
They return the same ranges -- the only difference is that
get_address_ranges() returns them in unspecified order, while get_ranges()
returns them in sorted order. Therefore the result of get_ranges() is also
a valid result for get_address_ranges(), and the two functions can be unified
to avoid code duplication. This patch does just that.
Some callees of update_pending_ranges use the variant of get_address_ranges()
which builds a hashmap of all <endpoint, owned range> pairs. For
everywhere_topology, the size of this map is quadratic in the number of
endpoints, making it big enough to cause contiguous allocations of tens of MiB
for clusters of realistic size, potentially causing trouble for the
allocator (as seen e.g. in #12724). This deserves a correction.
This patch removes the quadratic variant of get_address_ranges() and replaces
its uses with its linear counterpart.
Refs #10337
Refs #10817
Refs #10836
Refs #10837Fixes#12724
In most cases, tasks manager's tasks are started just after they are
created. Thus, to reduce boilerplate required for creating and starting
tasks, make_and_start_task method is added.
There was a missing check in validation of named
bind markers.
Let's say that a user prepares a query like:
```cql
INSERT INTO ks.tab (pk, ck, v) VALUES (:pk, :ck, :v)
```
Then they execute the query, but specify only
values for `:pk` and `:ck`.
We should detect that a value for :v is missing
and throw an invalid_request_exception.
Until now there was no such check, in case of a missing variable
invalid `query_options` were created and Scylla crashed.
Sadly it's impossible to create a regression test
using `cql-pytest` or `boost`.
`cql-pytest` uses the python driver, which silently
ignores mising named bind variables, deciding
that the user meant to send an UNSET_VALUE for them.
When given values like `{'pk': 1, 'ck': 2}`, it will automaticaly
extend them to `{'pk': 1, 'ck': 2, 'v': UNSET_VALUE}`.
In `boost` I tried to use `cql_test_env`,
but it only has methods which take valid `query_options`
as a parameter. I could create a separate unit tests
for the creation and validation of `query_options`
but it won't be a true end-to-end test like `cql-pytest`.
The bug was found using the rust driver,
the reproducer is available in the issue description.
Fixes: #12727
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#12730
To trigger snapshot limit behavior provide an error injection to set
with one-shot.
Note this effectively changes it and there is no revert.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
There's a bunch of them used by mainly view_builder and also by the API
and storage_service. All use global qctx to make its job, now when the
callers have main-local sharded<system_keysace> references they can be
made non-static.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's a start-only method.
Making it coroutine helps further patching.
Also restrict the call to be shard-0 only, it's such anyway but lets
the code have less nested coroutinized lambdas.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The API calls in question will use system keyspace, that starts before
(and thus stops after) and nowadays indirectly uses database instance
that also starts earlier (and also stops later), so this avoids
potential dangling references.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's the column_family/get_built_indexes call that calls a system
keyspace method to fetch data from scylla_views_builds_in_progress
table, so the system keyspace reference will be needed in the API
handler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The view builder updates system.scylla_views_builds_in_progress and
.built_views tables and thus needs the system keyspace instance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Each instance of compaction manager should have compaction module pointer
initialized. All contructors get task_manager reference with which
the module is created.
Preferred aternative to sstable_set->all()->size(), which may
involve of copy elements from a single set or multiple ones
if compound_sstable_set is used.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Since 97bb2e47ff (storage_service: Enable
Repair Based Node Operations (RBNO) by default for replace), RBNO was
enabled by default for replace ops.
After more testing, we decided to enable repair based node operations by
default for all node operations.
As an initial part of integration of compaction with task manager, compaction
module is added. Compaction module inherits from tasks::task_manager::module
and shared_ptr to it is kept in compaction manager. No compaction tasks are
created yet.
System keyspace is a keyspace with local replication strategy and thus
it does not need to be repaired. It is possible to invoke repair
of this keyspace through the api, which leads to runtime error since
peer_events and scylla_table_schema_history have different sharding logic.
For keyspaces with local replication strategy repair_service::do_repair_start
returns immediately.
This PR extends the description of using `nodetool removenode `to remove an unavailable node, as requested in https://github.com/scylladb/scylla-enterprise/issues/2338.
Closes#12410
* github.com:scylladb/scylladb:
docs: improve the warning and add a comment to update/remove the information in the future
doc: extend the information on removing an unavailable node
docs: extend the warning on the Remove a Node page
`TopologyTest`s (used by `topology/` suite and friends) already relied
on the `is_dirty` flag stored in `ScyllaCluster` thanks to
`ScyllaClusterManager` (which passes the flag when returning a cluster
to the pool).
But `PythonTest`s (cql-pytest/ suite) and `CQLApprovalTest`s (cql/
suite) had different ways to decide whether a cluster should be
recycled. For example, `PythonTest` would recycle a cluster if
`after_test` raised an exception. This depended on a post-condition
check made by `after_test`: it would query the number of keyspaces and
throw an exception if it was different than when the test started. If
the cluster (which for `PythonTest` is always single-node) was dead,
this query would fail.
However, we modified the behavior of `after_test` in earlier commits -
it no longer preforms the post-condition check on dirty clusters. So
it's also no longer reliable to use the exception raised by `after_test`
to decide that we should recycle the cluster.
Unify the behavior of `PythonTest` and `CQLApprovalTest` with what
`TopologyTest` does - using the `is_dirty` flag to decide that we should
recycle a cluster. Thanks to earlier commits, this flag is set to `True`
whenever a test fails, so it should cover most cases where we want to
recycle a cluster. (The only case not currently covered is if a
non-dirty cluster crashes after we perform the keyspace post-condition
check, which seems quite improbable.)
Note that this causes us to recycle clusters more often in these tests:
previously, when a `PythonTest` or `CQLApprovalTest` failed, but the
cluster was still alive and the post-condition check passed, we would
use the cluster for the next test. Now we recycle a cluster whenever a
test that used it fails.
After a failed test, the cluster might be down so dropping the
random_tables keyspace might be impossible. The cluster will be marked
dirty so it doesn't matter that we leave any garbage there.
Note: we already drop only if the cluster is not marked as dirty, and we
mark the cluster as dirty after a failed test. However, marking the
cluster as dirty after a failed test happens at the end of the `manager`
fixture and the `random_tables` fixture depends on the `manager`
fixture, so at the end of the `random_tables` fixture the cluster still
wasn't marked as dirty. Hence the fixture must access the
pytest-provided `request` fixture where we store a flag whether the test
has failed.
New test/lib/scylla_test_case.hh, introduced in "tests: Add command line options for Scylla unit tests",
allows extension of the command line options provided by Seastar testing framework.
It allows all boost tests to process additional options without changing a single line of code.
Patch "test: Add x-log2-compaction-groups to Scylla test command line options" builds on that, allowing
all test cases to run with N compaction groups. Again, without changing a line of code in the tests.
Now all you have to do is:
./build/dev/test/boost/sstable_compaction_test -- --smp 1 --x-log2-compaction-groups 1
./test.py --mode=dev --x-log2-compaction-groups 1 --verbose
And it will run the test cases with as many groups as you wish.
./test.py passes successfully with parameter --x-log2-compaction-groups 1.
Closes#12369
* github.com:scylladb/scylladb:
test.py: Add option to run scylla tests with multiple compaction groups
test: Add x-log2-compaction-groups to Scylla test command line options
test: Enable Scylla test command line options for boost tests
tests: Add command line options for Scylla unit tests
replica: table: Add debug log for number of compaction groups
test: sstable_compaction_test: Fix indentation
test: sstable_compaction_test: Make it work with compaction groups
test: test_bloom_filter: Fix it with multiple compaction groups
test: memtable_test: Fix it with multiple compaction groups
`job` was introduced back in 782ebcece4,
so we could consume the option specified in DEB_BUILD_OPTIONS
environmental variable. but now that we always repackage
the artifacts prebuilt in the relocatable package. we don't build
them anymore when packaging debian packages. see
9388f3d626 . and `job` is not
passed to `ninja` anymore.
so, in this change, `job` is removed from debian/rules as well, as
it is not used.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Whenever any compaction group has its SSTable set updated, table's
rebuild_statistics() is called and it inefficiently iterates through
SSTable set of all compaction groups.
Now each compaction group keeps track of its statistics, such that
table's rebuild_statistics() only need to sum them up.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
total disk space used metric is incorrectly telling the amount of
disk space ever used, which is wrong. It should tell the size of
all sstables being used + the ones waiting to be deleted.
live disk space used, by this defition, shouldn't account the
ones waiting to be deleted.
and live sstable count, shouldn't account sstables waiting to
be deleted.
Fix all that.
Fixes#12717.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
all() was returning lw_shared_ptr<sstable_list> which allowed caller
to modify sstable set content, which will mess up everything.
sstable_set is supposed to be only modifed through insert and erase
functions.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Existing helper with async context manager only worked for non one-shot
error injections. Fix it and add another helper for one-shot without a
context manager.
Fix tests using the previous helper.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
We don't expect the cluster to be functioning at all after a failed
test. The whole cluster might have crashed, for example. In these
situations the framework would report multiple errors (one for the
actual failure, another for a failed post-condition check because the
cluster was down) which would only obscure the report and make debugging
harder. It's also not safe in general to reuse the cluster in another
test - if the test previous failed, we should not assume that it's in a
valid state.
Therefore, mark the cluster as dirty after a failed test. This will let
us recycle the cluster based on the dirty flag and it will disable
post-condition check after a failed test (which is only done on
non-dirty clusters).
To implement this in topology tests, we use the
`pytest_runtest_makereport` hook which executes after a test finishes
but before fixtures finish. There we store a test-failed flag in a stash
provided by pytest, then access the flag in the `manager` fixture.
`after_test` would count keyspaces and check that the number is the same
as before the test started. The `random_tables` fixture after a test
would drop the keyspace that it created before the test.
These steps are done to ensure that the cluster is ready to be reused
for the next steps. If the cluster is dirty, it cannot be reused anyway,
so the steps are unnecessary. They might also be impossible in general
- a dirty cluster might be completely dead. For example, the attempts to
drop a keyspace from `random_tables` would cause confusing errors
if a test failed when it tried to restart a node while all nodes
were down, making it harder to find the 'real' failure.
Therefore don't perform these operations if the cluster is dirty.
- print the cluster used by the test in `after_test`
- if cluster setup fails in `before_test`, print the cluster together
with the exception (`after_test` is not executed if `before_test`
fails)
`evaluation_inputs` is a struct which contains data needed to evaluate expressions - values of columns, bind variables and other data.
`is_on_of()` is a function used to to evaluate `IN` restrictions. It checks whether the LHS is one of elements on the RHS list.
Generally when evaluating expressions we get the `evaluation_inputs` as an argument and we should pass them along to any functions that evaluate subexpressions.
`is_one_of()` got the inputs as an argument, but didn't pass them along to `equal()`, instead it creates new empty `evaluation_inputs{}` and gives that to `equal()`.
At first [I thought this was a bug](https://github.com/scylladb/scylladb/pull/12356#discussion_r1084300969) - with missing information there could be a crash if `equal()` tried to evaluate an expression with a `bind_variable`.
It turns out that in this particular case `equal()` won't use the `evaluation_inputs` at all. The LHS and RHS passed to it are just constant values, which were already evaluated to serialized bytes before calling `evaluate()`, so there is no bug.
It's still better to pass the inputs argument along if possible. If in the future `equal()` required these inputs for some reason, missing inputs could lead to an unexpected crash.
I couldn't find any tests that would detect this case, so such a bug could stay undetected until an unhappy user finds it because their cluster crashed.
I added some tests to make sure that it's covered from now on.
Closes#12701
* github.com:scylladb/scylladb:
cql-pytest: test filtering using list with bind variable
test/expr_test: test <int_value> IN (123, ?, 456)
cql3: expr: don't pass empty evaluation_inputs in is_one_of
The first patch in this series enables a previously-skipped test for what happens with Limit=0. The test passes.
The second patch adds an xfailng test for very large Limit.
Closes#12625
* github.com:scylladb/scylladb:
test/alternator: xfailing test for huge Limit in ListStreams
alternator/test: un-skip test of zero Limit in ListStreams
In test with ring delay zero, it is possible that when the
node_ops_cmd::decommission_done is received, the nodes remained in the
cluster haven't learned the LEFT status for the leaving node yet.
To guarantee when the decommission restful api returns, all the nodes
participated the decommission operation have learned the LEFT status, a
check in the node_ops_cmd::decommission_done is added in this patch.
After this patch, the decommission tests which start multiple
decommission in a loop with ring delay zero in
test/topology/test_topology.py passes.
Consider:
- Bootstrap n1 in dc 1
- Create ks with EverywhereStrategy
- Bootstrap n2 in dc 2
Since n2 is the first node in dc2, there will be no local dc nodes to
sync data from. In this case, n2 should sync data with node in dc 1 even
if it is in the remote dc.
The tests can now optionally run with multiple groups via option
--x-log2-compaction-groups.
This includes boost tests and the ones which run against either
one (e.g. cql) or many instances (e.g. topology).
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Now any boost test can run with multiple compaction groups by default,
without any change in the boost test cases whatsoever.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We have enabled the command line options without changing a
single line of code, we only had to replace old include
with scylla_test_case.hh.
Next step is to add x-log-compaction-groups options, which will
determine the number of compaction groups to be used by all
instantiations of replica::table.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Scylla unit tests are limited to command line options defined by
Seastar testing framework.
For extending the set of options, Scylla unit tests can now
include test/lib/scylla_test_case.hh instead of seastar/testing/test_case.hh,
which will "hijack" the entry point and will process the command line
options, then feed the remaining options into seastar testing entry
point.
This is how it looks like when asking for help:
Scylla tests additional options:
--help Produces help message
--x-log2-compaction_groups arg (=0) Controls static number of compaction
groups per table per shard. For X groups,
set the option to log (base 2) of X.
Example: Value of 3 implies 8 groups.
Running 1 test case...
App options:
-h [ --help ] show help message
--help-seastar show help message about seastar options
--help-loggers print a list of logger names and exit
--random-seed arg Random number generator seed
--fail-on-abandoned-failed-futures arg (=1)
Fail the test if there are any
abandoned failed futures
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Tests using replica::table::add_sstable_and_update_cache() cannot
rely on the sstable being added to a single compaction group, if
the test was forced to run with multiple groups.
Additionally let's remove try_flush_memtable_to_sstable() which
is retricted to a single group, allowing the entire test to now
pass with multiple groups.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
With many compaction groups, the data:filter size ratio becomes small
with a small number of keys.
Test is adjusted to run another check with more keys if efficiency
is higher than expected, but not lower.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
With compaction groups, automatic flushing may not pick the user
table. Fix it by using explicit flush.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When the memory consumption of the semaphore reaches the configured
serialize threshold, all but the blessed permit is blocked from
consuming any more memory. This ensures that past this limit, only one
permit at a time can consume memory.
Such a blessed permit can be registered inactive. Before this patch, it
would still retain its blessed status when doing so. This could result
in this permit being re-queued for admission if it was evicted in the
meanwhile, potentially resulting in a complete deadlock of the semaphore:
* admission queue permits cannot be admitted because there is no memory
* admitter permits are all queued on memory, as none of them are blessed
This patch strips the blessed status from the permit when it is
registered as inactive. It also adds a unit test to verify this happens.
Fixes: #12603Closes#12694
EOF is only guarateed to be set if one tried to read past the end of the
file. So when checking for EOF, also try to read some more. This
should force the EOF flag into a correct value. We can then check that
the read yielded 0 bytes.
This should ensure that `validate_checksums()` will not falsely declare
the validation to have failed.
Fixes: #11190Closes#12696
Fixes#12601 (maybe?)
Sort the set of tables on ID. This should ensure we never generate duplicates in a paged listing here. Can obviously miss things if they are added between paged calls and end up with a "smaller" UUID/ARN, but that is to be expected.
Closes#12614
* github.com:scylladb/scylladb:
alternator::streams: Special case single table in list_streams
alternator::streams: Only sort tables iff limit < # tables or ExclusiveStartStreamArn set
alternator::streams: Set default list_streams limit to 100 as per spec
alterator::streams: Sort tables in list_streams to ensure no duplicates
Currently, nothing prevents us from dropping a user type
used in a user function, even though doing so may make us
unable to use the function correctly.
This patch prevents this behavior by checking all function
argument and return types when executing a drop type statement
and preventing it from completing if the type is referenced
by any of them.
Closes#12680
this partially reverts 49157370bc
according the reports in #12173, at least two developers ran into
test failures which are correlated with the lastest Seastar change,
which enables the io_uring backend by default. they are using linux
kernel 6.0.12 and 6.1.7. it's also reported that reverting the
the commit of eedca15f16c3b6eae3d3d8af9510624a93f5d186 in seastar
helps. that very commit enables the io_uring by default. although we
are not able to identify the exact root cause of the failures in #12173
at this moment, to rule out the potential problem of io_uring should
help with further investigation.
in this change, io_uring backend is disabled when building Seastar.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12689
Add tests which test filtering using IN restriction
with a list which contains a bind variable.
There are other cql-pytest tests which
test IN lists with a bind variable,
but it looks like they don't do filtering.
IN restrictions on primary key columns
are handled in a special way to generate
the right ranges.
These tests hit a different code path as
filtering uses `expr::evaluate()`.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
evaluation_inputs is a struct which contains
data needed to evaluate expressions - values
of columns, bind variables and other data.
is_on_of() is a function used to to evaluate
IN restrictions. It checks whether the LHS
is one of elements on the RHS list.
Generally when evaluating expressions we get
the evaluation_inputs{} as an argument and
we should pass them along to any functions
that evaluate subexpressions.
is_one_of() got the inputs as an argument,
but didn't pass them along to equal(),
instead it creates new empty evaluation_inputs{}
and gives that to equal().
At first I thought this was a bug - with missing
information there could be a crash if equal()
tried to evaluate an expression with a bind_variable.
It turns out that in this particular case equal()
won't use the evaluation_inputs{} at all.
The LHS and RHS passed to it are just constant values,
which were already evaluated to serialized bytes
before calling evaluate().
It's still better to pass the inputs argument along
if possible. If in the future equal() required
these inputs for some reason, missing inputs
could lead to an unexpected crash.
I couldn't find any tests that would detect this case,
so such a bug could stay undetected until an unhappy user
finds it because their cluster crashed.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
* seastar 943c09f869...ef24279f03 (6):
> Merge 'util/print_safe, reactor: use concept for type constraints and refactory ' from Kefu Chai
> Right align the memory diagnostics
> Merge 'Add an API for the metrics layer to manipulate metrics dynamically.' from Amnon Heiman
> semaphore: assert no outstanding units when moved
> build: do not populate package registry by default
> build: stop detecting concepts support
Closes#12695
There was a check for immediate consistency after a decommission
operation has finished in one of the tests, but it turns out that also
after decommission it might take some time for token ring to be updated
on other nodes. Replace the check with a wait.
Also do the wait in another test that performs a sequence of
decommissions. We won't attempt to start another decommission until
every node learns that the previously decommissioned node has left.
Closes#12686
LCS backlog tracker uses STCS tracker for L0. Turns out LCS tracker
is calling STCS tracker's replace_sstables() with empty arguments
even when higher levels (> 0) *only* had sstables replaced.
This unnecessary call to STCS tracker will cause it to recompute
the L0 backlog, yielding the same value as before.
As LCS has a fragment size of 0.16G on higher levels, we may be
updating the tracker multiple times during incremental compaction,
which operates on SSTables on higher levels.
Inefficiency is fixed by only updating the STCS tracker if any
L0 sstable is being added or removed from the table.
This may be fixing a quadratic behavior during boot or refresh,
as new sstables are loaded one by one.
Higher levels have a substantial higher number of sstables,
therefore updating STCS tracker only when level 0 changes, reduces
significantly the number of times L0 backlog is recomputed.
Refs #12499.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12676
`ScyllaClusterManager` is used to run a sequence of test cases from
a single test file. Between two consecutive tests, if the previous test
left the cluster 'dirty', meaning the cluster cannot be reused, it would
free up space in the pool (using `steal`), stop the cluster, then get a
new cluster from the pool.
Between the `steal` and the `get`, a concurrent test run (with its own
instance of `ScyllaClusterManager` would start, because there was free
space in the pool.
This resulted in undesirable behavior when we ran tests with
`--repeat X` for a large `X`: we would start with e.g. 4 concurrent
runs of a test file, because the pool size was 4. As soon as one of the
runs freed up space in the pool, we would start another concurrent run.
Soon we'd end up with 8 concurrent runs. Then 16 concurrent runs. And so
on. We would have a large number of concurrent runs, even though the
original 4 runs didn't finish yet. All of these concurrent runs would
compete waiting on the pool, and waiting for space in the pool would
take longer and longer (the duration is linear w.r.t number of
concurrent competing runs). Tests would then time out because they would
have to wait too long.
Fix that by using the new `replace_dirty` function introduced to the
pool. This function frees up space by returning a dirty cluster and then
immediately takes it away to be used for a new cluster. Thanks to this,
we will only have at most as many concurrent runs as the pool size. For
example with --repeat 8 and pool size 4, we would run 4 concurrent runs
and start the 5th run only when one of the original 4 runs finishes,
then the 6th run when a second run finishes and so on.
The fix is preceded by a refactor that replaces `steal` with `put(is_dirty=True)`
and a `destroy` function passed to the pool (now the pool is responsible
for stopping the cluster and releasing its IPs).
Fixes#11757Closes#12549
* github.com:scylladb/scylladb:
test/pylib: scylla_cluster: ensure there's space in the cluster pool when running a sequence of tests
test/pylib: pool: introduce `replace_dirty`
test/pylib: pool: replace `steal` with `put(is_dirty=True)`
The "cluster manager" used by the topology test suite uses a UNIX-domain
socket to communicate between the cluster manager and the individual tests.
The socket is currently located in the test directory but there is a
problem: In Linux the length of the path used as a UNIX-domain socket
address is limited to just a little over 100 bytes. In Jenkins run, the
test directory names are very long, and we sometimes go over this length
limit and the result is that test.py fails creating this socket.
In this patch we simply put the socket in /tmp instead of the test
directory. We only need to do this change in one place - the cluster
manager, as it already passes the socket path to the individual tests
(using the "--manager-api" option).
Tested by cloning Scylla in a very long directory name.
A test like ./test.py --mode=dev test_concurrent_schema fails before
this patch, and passes with it.
Fixes#12622Closes#12678
Related https://github.com/scylladb/scylladb/issues/12658.
This issue fixes the bug in the upgrade guides for the released versions.
Closes#12679
* github.com:scylladb/scylladb:
doc: fix the service name in the upgrade guide for patch releases versions 2022
doc: fix the service name in the upgrade guide from 2021.1 to 2022.1
these fixes address the FTBFS of scylla with GCC-13.
Closes#12669
* github.com:scylladb/scylladb:
cql3/stats: include the used header.
cql3, locator: call fmt::format_to() explicitly
If TOC writing hits TOC file conflict it tries to throw an exception
with sstable generation in it. However, generation_type is not
formattable at all, let alone the {:d} option.pick
This bug generates an obscure 'fmt::v9::format_error (invalid type
specifier)' error in unknown location making the debugging hard.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12671
before this change, we construct a sstring from a comma statement,
which evaluates to the return value of `name.size()`, but what we
expect is `sstring(const char*, size_t)`.
in this change
* instead of passing the size of the string_view,
both its address and size are used
* `std::string_view` is constructed instead of sstring, for better
performance, as we don't need to perform a deep copy
the issue is reported by GCC-13:
```
In file included from cql3/selection/selectable.cc:11:
cql3/selection/field_selector.hh:83:60: error: ignoring return value of function declared with 'nodiscard' attribute [-Werror,-Wunused-result]
auto sname = sstring(reinterpret_cast<const char*>(name.begin(), name.size()));
^~~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12666
since format_to() is defined included by both fmt and std namepaces,
without specifying which one to use, we'd fail to build with the
standard library which implements std::format_to(). yes, we are
`using namespace std` somewhere.
this change should address the FTBFS with GCC-13.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This series contains the following changes for trimming the ranges passed to cleanup a compaction group to the compaction group owned token_range.
table: compaction_group_for_token: use signed arithmetic
Fixes#12595
table: make_compaction_groups: calculate compaction_group token ranges
table: perform_cleanup_compaction: trim owned ranges on compaction_group boundaries
Fixes#12594Closes#12598
* github.com:scylladb/scylladb:
table: perform_cleanup_compaction: trim owned ranges on compaction_group boundaries
table: make_compaction_groups: calculate compaction_group token ranges
dht: range_streamer: define logger as static
Currently, segment file removal first calls `f.remove_file()` and
does `total_size_on_disk -= f.known_size()` later.
However, `remove_file()` resets `known_size` to 0, so in effect
the freed space in not accounted for.
`total_size_on_disk` is not just a metric. It is also responsible
for deciding whether a segment should be recycled -- it is recycled
only if `total_size_on_disk - known_size < max_disk_size`.
Therefore this bug has dire performance consequences:
if `total_size_on_disk - known_size` ever exceeds `max_disk_size`,
the recycling of commitlog segments will stop permanently, because
`total_size_on_disk - known_size` will never go back below
`max_disk_size` due to the accounting bug. All new segments from this
point will be allocated from scratch.
The bug was uncovered by a QA performance test. It isn't easy to trigger --
it took the test 7 hours of constant high load to step into it.
However, the fact that the effect is permanent, and degrades the
performance of the cluster silently, makes the bug potentially quite severe.
The bug can be easily spotted with Prometheus as infinitely rising
`commitlog_total_size_on_disk` on the affected shards.
Fixes#12645Closes#12646
Use the newly introduced key generation facilities, instead of the the
old inflexible alternatives and hand-rolled code.
Most of the migrations are mechanic, but there are two tests that
were tricky to migrate:
* sstable_compaction_test.sstable_run_based_compaction_test
* sstable_mutation_test.test_key_count_estimation
These two tests seems to depend on generated keys all being of the same
size. This makes some sense in the case of the key count estimation
test, but makes no sense at all to me in the case of the sstable run
test.
Creating the default schema, used in the default constructor of
table_for_tests. Allows for getting the default schema without creating
an instance first.
Contains methods to generate partition and clustering keys. In the case
of the former, one can specify the shard to generate keys for.
We currently have some methods to generate these but they are not
generic. Therefore the tests are littered by open-coded variants.
The methods introduced here are completely generic: they can generate
keys for any schema.
Allow caller to specify the minimum size in bytes of the generated
value. Only really works with string-like types (and collections of
these).
Also fixed max size enforcement for strings: before this patch, the
provided max size was dividied by wide string size, instead of the
char width of the actual string type the value is generated for.
Since we're potentially searching the row_lock in parallel to acquiring the read_lock on the partition, we're racing with row_locker::unlock that may erase the _row_locks entry for the same clustering key, since there is no lock to protect it up until the partition lock has been acquired and the lock_partition future is resolved.
This change moves the code to search for or allocate the row lock _after_ the partition lock has been acquired to make sure we're synchronously starting the read/write lock function on it, without yielding, to prevent this use-after-free.
This adds an allocation for copying the clustering key in advance that wasn't needed before if the lock for it was already found, but the view building is not on the hot path so we can tolerate that.
This is required on top of 5007ded2c1 as seen in https://github.com/scylladb/scylladb/issues/12632 which is closely related to #12168 but demonstrates a different race causing use-after-free.
Fixes#12632
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12639
* github.com:scylladb/scylladb:
view: row_lock: lock_ck: try_emplace row_lock entry
view: row_lock: lock_ck: find or construct row_lock under partition lock
Will be used by MVCC tests which don't want (can't) deal with the
row_cache as the container but work with the partition_entry directly.
Currently, rows_entry::on_evicted() assumes that it's embedded in
row_cache and would segfault when trying to evict the contining
partition entry which is not embedded in row_cache. The solution is to
call evict_shallow() from mvcc_tests, which does not attempt to evict
the containing partition_entry.
For non-evictable snapshots all ranges are continuous so there is no
need to apply the continuity flag to the previous interval if the
source mutation has the interval marked as continuous.
Without this, applying a single row mutation to a memtable would
involve scanning exisiting version for the range before the row's
key. This makes population quadratic.
This is severed by the fact that this scan will happen in the
background if preempted, which exposes a scheduling problem. The
mutation cleaner worker which merges versions in the background will
not keep up with the incoming writes. This will lead to explosion of
partition versions, which makes reads (e.g. memtable flush) very
slow. The read will have to refresh the iterator heap, which has an
iterator for each version, across every preemption point, because
cleaning invalidates iterators.
The same could happen before the v2 representation, but for much less
typical workloads, e.g. applying lots of mutations with a single range
tombstone covering existing rows.
The problem was hit in index_with_paging_test in debug mode. It's less
likely to happen in release mode where preemption is not triggered as
often.
The test measured copying of the mutation object, but verified the
measurement against mutation_partition::external_memory_usage(). So
anything allocated on the mutation object level would cause the test
to (incorrectly) fail. Fix that by copying only the mutation_partition
part.
Currently not a problem, because the partition_key is stored in the
in-line storage. Would become a problem once inline storage is
reduced.
It's not obvious that invariants for partial merge do not hold for a
completed merge.
This is due to the fact that an empty source partition, which is
always empty after merge, is always fully continuous.
This patch switches memtable and cache to use mutation_partition_v2,
and all affected algorithms accordingly.
The memtable reader was changed to use the same cursor implementation
which cache uses, for improved code reuse and reducing risk of bugs
due to discrepancy of algorithms which deal with MVCC.
Range tombstone eviction in cache has now fine granularity, like with
rows.
Fixes#2578Fixes#3288Fixes#10587
This is a prerequisite for using the cursor in memtable readers.
Non-evictable snapshots are those which live in memtables. Unlike
evictable snapshots, they don't have a dummy entry at position after
all clustering rows. In evictable snapshots, lookup always finds an
entry, not so with non-evictable snapshots. The cursor was not
prepared for this case, this patch handles it.
This patch changes mutation_partition_v2 to store range tombstone
information together with rows.
This mainly affects the version merging algorithm,
mutation_partition_v2::apply_monotonically().
Continuity setting no longer can drop dummy entry unconditionally
since it may be a boundary of a range tombstone.
Memtable/cache is not switched yet.
Refs #10587
Refs #3288
Intended to be used in memtable/cache, as opposed to the old
mutation_partition which will be intended to be used as temporary
object.
The two will have different trade-offs regarding memory efficiency and
algorithms.
In this commit there is no change in logic, the class is mostly
copied. Some methods which are not needed on the v2 model were removed
from the interface.
Logic changes will be introduced in later commits.
This extracts information which was there in row_cache.md, but is
relevant to MVCC in general.
It also makes adaptations and reflects the upcoming changes in this
series related to switching to the new mutation_partition_v2 model:
- continuity in evictable snapshots can now overlap. This is needed
to represent range tombstone information, which is linked to
continuity information.
- description of range tombstone representation was added
It's the standard now which replaced bound_view.
Will be consistent with how range tombstone bounds are represented in
mutation_partition_v2 (as rows_entry::position()).
Partition version merging can now insert sentinels, which may
temporarily increase unspooled memory. It is no longer true that
unspooled monotonically decreases, which the test verified. Relax it,
and only verify that unspooled is smaller than real dirty.
api::new_timestamp() is not monotonic. In
test_single_row_and_tombstone_not_cached_single_row_range1, we
generate a deletion and an insertion in the deleted reange. If they
get the same timestamp, the inserted row will be covered.
This will surface after cache starts to compact rows with range tombstones.
When inserting range tombstones, the test uses api::new_timestamp(),
but when inserting rows, it uses a fixed timestamp of 1. This will be
problematic when rows get compacted with range tombstone, all rows
would get compacted away, which is not expected by the test. To fix
this, let's use the same timestamp source as range tombstones. This
way rows will get a later timestamp.
compact_for_compaction() will perform cell expiration based on
gc_clock::now(), which introduces sporadic mismatches due to expiry
status of a row marker.
Drop this, we can rely on compaction done by is_equal_to_compacted()
This tombstone has a high chance of obliterating all data, which will
make tests which involve partition version merging not very
interesting. The result will be an empty partition with a
tombstone. Reduce its frequency, so that in MVCC there is a
significant chance of having live data in the combined entry where
individual versions come from the generator.
evictable snapshots must have all entries added to the
tracker. Partition version merging assumes this. Before this was
benign, but will start to trigger asserts in mutation_partition_v2.
This will trigger an assert in apply_monotonically() later in the
series, where this row would be merged with a dummy at the same
position. This row must not be marked as non-dummy, there is an
assumption that non-clustering positions are all dummies. There can't
be two entries with the same position an a different dummy status.
Upon entry to merge_partition_versions() we skip over versions which
are not referenced in order to start merging from the oldest
unreferenced version, which is good for performance. Later, we
reallocate version merging state if we detected such a move, so that
we don't reuse state allocated for a different version pair than
before. This check was using version_no, the counter of skipped
versions to detect this. But this only makes sense if each
merge_partition_versions() uses the same version pointer as a base. In
fact it doesn't, if we skip, we advance _version, so the skip is
persisted in the snapshot. It's enough to discard the version merging
state when we do that.
This shouldn't have effect on existing code base, since there is
currently no way to trigger the version skipping loggic.
We currently don't clean up the system_distributed.view_build_status
table after removed nodes. This can cause false-positive check for
whether view update generation is needed for streaming.
The proper fix is to clean up this table, but that will be more
involved, it even when done, it might not be immediate. So until then
and to be on the safe side, filter out entries belonging to unknown
hosts from said table.
Fixes: #11905
Refs: #11836Closes#11860
Currently all consumed range tombstone changes are unconditionally
forwarded to the validator. Even if they are shadowed by a higher level
tombstone and/or purgable. This can result in a situation where a range
tombstone change was seen by the validator but not passed to the
consumer. The validator expects the range tombstone change to be closed
by end-of-partition but the end fragment won't come as the tombstone was
dropped, resulting in a false-positive validation failure.
Fix by only passing tombstones to the validator, that are actually
passed to the consumer too.
Fixes: #12575Closes#12578
The docs/alternator/compatibility.md file links to various open issues
on unimplemented features. One of the links was to an already-closed
issue. Replace it by a link to an open issue that was missing.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12649
There's another one that accepts explicit basedir first argument and
that's used by the rest of the code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12643
We had one test test_gsi.py::test_gsi_identical that didn't work on KA/LA
sstables due to #6157, so it was skipped. Today, Scylla no longer supports
writing these old sstable formats, so the test can never find itself
running on these versions, so should pass. And indeed it does, and the
"skip" marker can be removed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12651
Today's sstable_test_env starts with a default-configured db::config and, thus, sstables_manager. Test cases that run in this env always create a tempdir to store sstable files in on their own. Next patching makes sstable-manager and friends fully control the data-dir path in order to support object storage for sstables in a nice way, and this behavior of tests upsets this ongoing work.
Said that, this PR configures sstable_test_env with a tempdir and pins down the cases using it to stick to that directory, rather than to the custom one.
Closes#12641
* github.com:scylladb/scylladb:
test: Use tempdir from sstable_test_env
test: Add tmpdir to sstable test env
test: Keep db::config as unique pointer
The leak sanitizer has a bug [1] where, if it detects a leak, it
forks something, and before that, it closes all files (instead of
using close_range like a good citizen).
Docker tends to create containers with the NOFILE limit (number of
open files) set to 1 billion.
The resulting 1 billion close() system calls is incredibly slow.
Work around that problem by passing the host NOFILE limit.
[1] https://github.com/llvm/llvm-project/issues/59112Closes#12638
Use same method as the two-level lock at the
partition level. try_emplace will either use
an existing entry, if found, or create a new
entry otherwise.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since we're potentially searching the row_lock in parallel to acquiring
the read_lock on the partition, we're racing with row_locker::unlock
that may erase the _row_locks entry for the same clustering key, since
there is no lock to protect it up until the partition lock has been
acquired and the lock_partition future is resolved.
This change moves the code to search for or allocate the row lock
_after_ the partition lock has been acquired to make sure we're
synchronously starting the read/write lock function on it, without
yielding, to prevent this use-after-free.
This adds an allocation for copying the clustering key in advance
even if a row_lock entry already exists, that wasn't needed before.
It only us slows down (a bit) when there is contention and the lock
already existed when we want to go locking. In the fast path there
is no contention and then the code already had to create the lock
and copy the key. In any case, the penalty of copying the key once
is tiny compared to the rest of the work that view updates are doing.
This is required on top of 5007ded2c1 as
seen in https://github.com/scylladb/scylladb/issues/12632
which is closely related to #12168 but demonstrates a different race
causing use-after-free.
Fixes#12632
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
After topology changes like removing a node, verify that the set of
group 0 members and token ring members is the same.
Modify `get_token_ring_host_ids` to only return NORMAL members. The
previous version which used the `/storage_service/host_id` endpoint
might have returned non-NORMAL members as well.
Fixes: #12153Closes#12619
* seastar d41af8b59...943c09f86 (20):
> reactor: disable io_uring on older kernels if not enough lockable memory is available
> demos/tcp_sctp_client_demo: use user-defined literal for sizes
> core/units: add user-defined literal for IEC prefixes
> core/units: include what we use
> coroutine/exception: do not include core/coroutine.hh
> seastar/coroutine: drop std-coroutine.hh
> core/bitops.hh: add type constraits to templates
> apps/iotune: s/condition == false/!condition/
> core/metrics_api: s/promehteus/prometheus/
> reactor: make io_uring the default backend if available
> tests: connect_test: use 127.0.0.1 for connect refused test
> reactor: use aio to implement reactor_backend_uring::read()
> future: schedule: get_available_state_ref under SEASTAR_DEBUG
> rpc: client_info: add retrieve_auxiliary_opt
> Merge 'Make http requests with content-length header and generated body' from Pavel Emelyanov
> Merge 'Ensure logger doesn't allocate' from Travis Downs
> http, httpd: optimize header field assignment
> sstring: operator<< std::unordered_map: delete stray space char
> Dump memory diagnostics at error level on abort
> Fix CLI help for memory diagnostics dump
Closes#12650
There are several helpers to make an sstable for the table and two with most of the arguments are only used by tests. This PR leaves table with just one arg-less call thus making it easier to patch further.
Closes#12636
* github.com:scylladb/scylladb:
table: Shrink sstables making API
tests: Use sstables manager to make sstables
distributed_loader: Add helpers to make sstables for reshape/reshard
If a server is stopped suddenly (i.e. not graceful), schema tables might
be in inconsistent state. Add a test case and enable Scylla
configuration option (force_schema_commit_log) to handle this.
Fixes#12218Closes#12630
* github.com:scylladb/scylladb:
pytest: test start after ungraceful stop
test.py: enable force_schema_commit_log
`ScyllaClusterManager` is used to run a sequence of test cases from
a single test file. Between two consecutive tests, if the previous test
left the cluster 'dirty', meaning the cluster cannot be reused, it would
put the old cluster to the pool with `is_dirty=True`, then get a new
cluster from the pool.
Between the `put` and the `get`, a concurrent test run (with its own
instance of `ScyllaClusterManager`) would start, because there was free
space in the pool.
This resulted in undesirable behavior when we ran tests with
`--repeat X` for a large `X`: we would start with e.g. 4 concurrent
runs of a test file, because the pool size was 4. As soon as one of the
runs freed up space in the pool, we would start another concurrent run.
Soon we'd end up with 8 concurrent runs. Then 16 concurrent runs. And so
on. We would have a large number of concurrent runs, even though the
original 4 runs didn't finish yet. All of these concurrent runs would
compete waiting on the pool, and waiting for space in the pool would
take longer and longer (the duration is linear w.r.t number of
concurrent competing runs). Tests would then time out because they would
have to wait too long.
Fix that by using the new `replace_dirty` function introduced to the
pool. This function frees up space by returning a dirty cluster and then
immediately takes it away to be used for a new cluster. Thanks to this,
we will only have at most as many concurrent runs as the pool size. For
example with --repeat 8 and pool size 4, we would run 4 concurrent runs
and start the 5th run only when one of the original 4 runs finishes,
then the 6th run when a second run finishes and so on.
Fixes#11757
Used to atomically return a dirty object to the pool and then use the
space freed by this object to get another object. Unlike
`put(is_dirty=True)` followed by `get`, a concurrent waiter cannot take
away our space from us.
A piece of `get` was refactored to a private function `_build_and_get`,
this piece is also used in `replace_dirty`.
The pool usage was kind of awkward previously: if the user of a pool
decided that a previously borrowed object should no longer be used,
it was their responsibility to destroy the object (releasing associated
resources and so on) and then call `steal()` on the pool to free space
for a new object.
Change the interface. Now the `Pool` constructor obtains a `destroy`
function additionally to the `build` function. The user calls the
function `put` to return both objects that are still usable and those
aren't. For the latter, they set `is_dirty=True`. The pool will
'destroy' the object with the provided function, which could mean e.g.
releasing associated resources.
For example, instead of:
```
if self.cluster.is_dirty:
self.clusters.stop()
self.clusters.release_ips()
self.clusters.steal()
else:
self.clusters.put(self.cluster)
```
we can now use:
```
self.clusters.put(self.cluster, is_dirty=self.cluster.is_dirty)
```
(assuming that `self.clusters` is a pool constructed with a `destroy`
function that stops the cluster and releases its IPs.)
Also extend the interface of the context manager obtained by
`instance()` - the user must now pass a flag `dirty_on_exception`. If
the context manager exists due to an exception and that flag was `True`,
the object will be considered dirty. The dirty flag can also be set
manually on the context manager. For example:
```
async with (cm := pool.instance(dirty_on_exception=True)) as server:
cm.dirty = await run_test(test, server)
# It will also be considered dirty if run_test throws an exception
```
The test cases in sstable_directory_test use a temporary directory that
differs from the one sstables manager starts over. Fix that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This adds the test/lib's tmpdir instance _and_ configures the
data_file_directories with this path. This makes sure sstables manager
and the rest of the test use the same directory for sstables. For now
it doesn't change anything, but helps next patching.
(A neat side effect of this change is that sstable_test_env is now
configured the same way as cql_test_env does)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently there are four helpers, this patch makes it just two and one
of them becomes private the table thus making the API small and neat
(and easy to patch further).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This test uses two many-args helpers from table calss to create sstables
with desired parameters. The table API in question is not used by any
other code but these few places, to it's better to open-code it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This kills two birds with one stone. First, it factors out (quite a lot
of) common arguments that are passed to table.make_sstable(). Second, it
makes the helpers call sstable manager with extended args making it
possible to remove those wrappers from table class later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Although the number of keyspaces should mostly be 1 here, and thus the
chance of two tables from different keyspaces colliding is miniscule, it
is not zero. Better be safe than sorry, so match the keyspace name too
when looking up a table.
Closes#12627
Each time backlog tracker is informed about a new or old sstable, it
will recompute the static part of backlog which complexity is
proportional to the total number of sstables.
On schema change, we're calling backlog_tracker::replace_sstables()
for each existing sstable, therefore it produces O(N ^ 2) complexity.
Fixes#12499.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12593
This phrase is inaccurate and unnecessary. We know all lines in the
printout are for reads and they are semaphores: no need to repeat this
information on each line.
Example:
Read Concurrency Semaphores:
read: 0/100, 0/ 41901096, queued: 0
streaming: 0/ 10, 0/ 41901096, queued: 0
system: 0/ 10, 0/ 41901096, queued: 0
Closes#12633
The dbuild README has an example how to enable ccache, and required
modifying the PATH. Since recently, our docker image includes
required commands (cxxbridge) in /usr/local/bin, so the build will
fail if that directory isn't also in the path - so add it in the
example.
Also use the opportunity to fix the "/home/nyh" in one example to
"$HOME".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12631
The goal is to make it possible to make config with custom-initialized
options in test_env::impl's constructor initializer list (next patch).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
`ScyllaCluster.server_stop` had this piece of code:
```
server = self.running.pop(server_id)
if gracefully:
await server.stop_gracefully()
else:
await server.stop()
self.stopped[server_id] = server
```
We observed `stop_gracefully()` failing due to a server hanging during
shutdown. We then ended up in a state where neither `self.running` nor
`self.stopped` had this server. Later, when releasing the cluster and
its IPs, we would release that server's IP - but the server might have
still been running (all servers in `self.running` are killed before
releasing IPs, but this one wasn't in `self.running`).
Fix this by popping the server from `self.running` only after
`stop_gracefully`/`stop` finishes.
Make an analogous fix in `server_start`: put `server` into
`self.running` *before* we actually start it. If the start fails, the
server will be considered "running" even though it isn't necessarily,
but that is OK - if it isn't running, then trying to stop it later will
simply do nothing; if it is actually running, we will kill it (which we
should do) when clearing after the cluster; and we don't leak it.
Closes#12613
Test case for a start of a server after it was stopped suddenly (instead
of gracefully). This coud cause commitlog flush issues.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
New clusters that use a fresh conf/scylla.yaml will have `consistent_cluster_management: true`, which will enable Raft, unless the user explicitly turns it off before booting the cluster.
People using existing yaml files will continue without Raft, unless consistent_cluster_management is explicitly requested during/after upgrade.
Also update the docs: cluster creation and node addition procedures.
Fixes#12572.
Closes#12585
* github.com:scylladb/scylladb:
docs: mention `consistent_cluster_management` for creating cluster and adding node procedures
conf: enable `consistent_cluster_management` by default
Its _it member keeps state about the current range.
Although it's modified by the method, this is an implementation
detail that irrelevant to the caller, hence mark the
belongs_to_current_node method as const (and noexcept while
at it).
This allows the caller, cleanup_compaction, to use it from
inside a const method, without having to mark
its respective member as mutable too.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12634
change row purge condition for compacting_reader to remove all expired
rows to avoid read perfomance problems when there are many expired
tombstones in row cache
Refs #2252Closes#12565
From reviews of https://github.com/scylladb/scylladb/pull/12569, avoid
using `async with` and access the `Pool` of clusters with
`get()`/`put()`.
Closes#12612
* github.com:scylladb/scylladb:
test.py: manual cluster handling for PythonSuite
test.py: stop cluster if PythonSuite fails to start
test.py: minor fix for failed PythonSuite test
- makes all regexes static
If making regex compilation static
for uuid_type_impl and timeuuid_type_impl helps then it should
also help for timestamp_type and simple_date_type.
- remove unnecessary tolower transform in simple_date_type_impl::from_sstring
Following function uses only decimal and '-' characters (see date_re). They are not
affected by tolower call in any way.
Aditionally std::strtoll supports "0x" prefixes but also accepts
upper case version "0X" so it's also not affected by tolower call.
get_simple_date_time only casts strings to integer types using
boost:lexical_cast so also not affected by tolower.
Finally, serialize only uses str to include it in an exception text
so tolower doesn't affect it in a positive way. It's even better
that input is displayed to the user as it was, not converted to lower
case.
Closes#12621
* github.com:scylladb/scylladb:
types: remove unnecessary tolower transform in simple_date_type_impl::from_sstring
types: make all regexes static
Most of snitch drivers set _my_dc and _my_rack with direct assignment
thus skipping the sanity checks for dc/rack being empty. On other shards
they call set_my_dc_and_rack() helper which warns the empty value and
replaces it with some defaults.
It's better to use the helper on all shards in order to have the same
dc/rack values everywhere.
refs: #12185
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12524
Issue #12538 suggested that maybe Alternator shouldn't bother reporting an
invalid table name in item operations like PutItem, and that it's enough
to report that the table doesn't exist. But the test added in this patch
shows that DynamoDB, like Alternator, reports the invalid table name in
this case - not just that the table doesn't exist.
That should make us think twice before acting on issue #12538. If we do
what this issue recommended, this test will need to be fixed (e.g., to
accept as correct both types of errors).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12608
before this change, we returns the total memory managed by Seastar
in the "total" field in system.memory. but this value only reflect
the total memory managed by Seastar's allocator. if
`reserve_additional_memory` is set when starting app_template,
Seastar's memory subsystem just reserves a chunk of memory of this
specified size for system, and takes the remaining memory. since
f05d612da8, we set this value to 50MB for wasmtime runtime. hence
the test of `TestRuntimeInfoTable.test_default_content` in dtest
fails. the test expects the size passed via the option of
`--memory` to be identical to the value reported by system.memory's
"total" field.
after this change, the "total" field takes the reserved memory
for wasm udf into account. the "total" field should reflect the total
size of memory used by Scylla, no matter how we use a certain portion
of the allocated memory.
Fixes#12522
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12573
DynamoDB Streams limits the "Limit" parameter of ListStreams to 100 -
anything larger will result in an error. Scylla doesn't necessarily
need to uphold the same limit, but we should uphold *some* limit, as
not having any limit can result (in the theoretical case of a huge
number of tables with streams enabled) in an unbounded response size.
So here we add a test to check that a Limit of 100,000 is not allowed.
It passes on DynamoDB (in fact, any number higher than 100 will be
enough threre) but fails on Alternator, so is marked "xfail".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We had a skipped test on how Alternator handles Limit=0 for ListStreams
which should be reported as an error. We had to skip it because boto3
did us a "favor" of discovering this parameter error before ever sending
it to the server. We discovered long ago how to avoid this client-side
checking in boto3, but only used it for the "dynamodb" fixture and
forgot to copy the same trick to the "dynamodbstreams" fixture - and
in this patch we do, and can run this test successfully.
While at it, also copy the extented timeout configuration we had in
the dynamodb fixture also to the dynamodbstreams fixture. There is
no reason why it should be different.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Instead of complex async with logic, use manual cluster pool handling.
Revert the discard() logic in Pool from a recent commit.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Even though test can't fail both before and after, make the logic
explicit in case code changes in the future.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
add Make variable named `PREVIEW_HOST` so it can be overriden like
```
make preview PREVIEW_HOST=$(hostname -I | cut -d' ' -f 1)
```
it allows developer to preview the document if the host buiding the
document is not localhost.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12589
It's a simple helper used during boot-time that can enjoy query-processor from sharded<system_keyspace>
Closes#12587
* github.com:scylladb/scylladb:
system_keyspace: De-static system_keyspace::increment_and_get_generation
system_keyspace: Fix indentation after previous patch
system_keyspace: Coroutinize system_keyspace::increment_and_get_generation
Following function uses only decimal and '-' characters (see date_re). They are not
affected by tolower call in any way.
Aditionally std::strtoll supports "0x" prefixes but also accepts
upper case version "0X" so it's also not affected by tolower call.
get_simple_date_time only casts strings to integer types using
boost:lexical_cast so also not affected by tolower.
Finally, serialize only uses str to include it in an exception text
so tolower doesn't affect it in a positive way. It's even better
that input is displayed to the user as it was, not converted to lower
case.
The reasons for force-disabling are doubly wrong: we now
use liburing from Fedora 37, which is sufficiently recent,
and the auto-detection code will disable io_uring if a
sufficiently recent version isn't present.
Closes#12620
There was a small chance that we called `timeout_src.request_abort()`
twice in the `with_timeout` function, first by timeout and then by
shutdown. `abort_source` fails on an assertion in this case. Fix this.
Fixes: #12512Closes#12514
Fix https://github.com/scylladb/scylla-docs/issues/3968
This PR adds the information that an upgrade to each successive major version is required to upgrade from an old ScyllaDB version.
Closes#12586
* github.com:scylladb/scylladb:
docs: remove repetition
doc: add the general upgrade policy to the uprage page
Don't use a range scan, which is very inefficient, to perform a query for checking CQL availability.
Improve logging when waiting for server startup times out. Provide details about the failure: whether we managed to obtain the Host ID of the server and whether we managed to establish a CQL connection.
Closes#12588
* github.com:scylladb/scylladb:
test/pylib: scylla_cluster: better logging for timeout on server startup
test/pylib: scylla_cluster: use less expensive query to check for CQL availability
Waiting for server startup is a multi-step procedure: after we start the
actual process, we will:
- try to obtain the Host ID (by querying a REST API endpoint)
- then try to connect a CQL session
- then try to perform a CQL query
The steps are repeated every .1 second until we reach a timeout (the
Host ID step is skipped if we previously managed to obtain it).
On timeout we'd only get a generic "failed to start server" message, it
wouldn't say what we managed to do and what not.
For example, on one of the failed jobs on Jenkins I observed this
timeout error. Looking at the logs of the server, it turned out that the
server printed the "initialization completed" message more than 2
minutes before the actual timeout happened. So for 2 minutes, the test
framework either couldn't obtain the Host ID, or couldn't establish a
CQL connection, or couldn't perform a CQL query, but I wasn't able to
determine fully which one of these was the case.
Improve the code by printing whether we managed to get the Host ID of
the server and if so - whether we managed to connect to CQL.
Fix https://github.com/scylladb/scylladb/issues/12605#event-8328930604
This PR removes the duplicated content (the file with the table was included twice) and reorganizes the content in the Networking section.
Closes#12615
* github.com:scylladb/scylladb:
doc: fix the broken link
doc: replace Scylla with ScyllaDB
doc: remove duplication in the Networking section (the table of ports used by ScyllaDB
Fixes#12601 (maybe?)
Sort the set of tables on ID. This should ensure we never
generate duplicates in a paged listing here. Can obviously miss things if they
are added between paged calls and end up with a "smaller" UUID/ARN, but that
is to be expected.
To cleanup tokens in sstables that are not owned
by the compaction group. This may happen in the future
after a compaction group split if copying / linking
the sstables in the original compaction_group to
the split compaction_groups.
Fixes#12594
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add dht::split_token_range_msb that returns a token_range_vector
with ranges split using a given number of most-significant bits.
When creating the table's compaction groups, use dht::split_token_range_msb
to calculate the token_range owned by each compaction_group.
Refs #12594
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
dht::logger can't be global in this case,
as it's too generic, but should be static
to range_streamer.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If the after test check fails (is_after_test_ok is False), discard the cluster and raise exception so context manager (pool) does not recycle it.
Ignore exception re-raised by the context manager.
Fixes#12360Closes#12569
* github.com:scylladb/scylladb:
test.py: handle broken clusters for Python suite
test.py: Pool discard method
Add and use dht::compaction_group_of that computes the
compaction_group index by unbiasing the token,
similar to dht::shard_of.
This way, all tokens in `_compaction_groups[i]` are ordered
before `_compaction_groups[j]` iff i < j.
Fixes#12595
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12599
It's only called on cluster-join from storage_service which has the
local system_keyspace reference and it's already started by that time.
This allows removing few more occurrences of global qctx.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Just unroll the fn().then({ fn2().then().then(); }); chain.
Indentation is deliberately left broken.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently reverse types match the default case (false), even though they
might be wrapping a tuple type. One user-visible effect of this is that
a schema, which has a reversed<frozen<UDT>> clustering key component,
will have this component incorrectly represented in the schema cql dump:
the UDT will loose the frozen attribute. When attempting to recreate
this schema based on the dump, it will fail as the only frozen UDTs are
allowed in primary key components.
Fixes: #12576Closes#12579
This test creates random dummy reads and simulates a query with them.
The test works in terms of iteration (tick), advancing each simulating
read in each iteration. To prevent infinite runtime an iteration limit
of 100 was added to detect a non-converging test and kill it. This limit
proved too strict however and in this patch we bump it to 1000 to
prevent some unlucky seed making this test fail, as seen recently in CI.
Closes#12580
We noticed that old branches of Scylla had problems with looking up a
null value in a local secondary index - hanging or crashing. This patch
includes tests to reproduce these bugs. The tests pass on current
master - apparently this bug has already been fixed, but we didn't
have a regression test for it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12570
If the after test check fails (!is_after_test_ok), discard the cluster
and raise exception so context manager (pool) does not recycle it.
Ignore Pool exception re-raised by the context manager.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
This PR fixes three problems that prevented/could prevent a successful build in ScyllaDB's Nix development environment.
The first commit adds a missing `abseil-cpp` dependency to Nix devenv, as this dependency is now required after 8635d2442.
The second commit bumps the version of Lua from 5.3 to 5.4, as after 9dd5107919 a 4-argument version of `lua_resume` (only available in Lua 5.4) is used in the ScyllaDB codebase.
The third commit explicitly adds `rustc` to Nix devenv dependencies. This places `rustc` from nixpkgs on the `PATH`, preventing `cargo` from executing `rustc` installed globally on the system (see the commit message for additional reasoning).
After those changes, ScyllaDB can be succesfully built in both `nix-shell .` and `nix develop .` environments.
Closes#12568
* github.com:scylladb/scylladb:
build: explicitly add rustc to Nix devenv
build: bump Lua version (5.3 -> 5.4) in Nix devenv
build: add abseil-cpp dependency to Nix devenv
Sets the current schema to be used by schema-aware commands.
Setting the schema allows some commands and printers to interpret
schema-dependent objects and present them in a more friendly form.
Some commands require schema to work, for example to sort keys, and
will fail otherwise.
If an endpoint handler throws an exception, the details of the exception
are not returned to the client. Normally this is desirable so that
information is not leaked, but in this test framework we do want to
return the details to the client so it can log a useful error message.
Do it by wrapping every handler into a catch clause that returns
the exception message.
Also modify a bit how HTTPErrors are rendered so it's easier to discern
the actual body of the error from other details (such as the params used
to make the request etc.)
Before:
```
E test.pylib.rest_client.HTTPError: HTTP error 500: 500 Internal Server Error
E
E Server got itself in trouble, params None, json None, uri http+unix://api/cluster/before-test/test_stuff
```
After:
```
E test.pylib.rest_client.HTTPError: HTTP error 500, uri: http+unix://api/cluster/before-test/test_stuff, params: None, json: None, body:
E Failed to start server at host 127.155.129.1.
E Check the log files:
E /home/kbraun/dev/scylladb/testlog/test.py.dev.log
E /home/kbraun/dev/scylladb/testlog/dev/scylla-1.log
```
Closes#12563
When we obtained a new cluster for a test case after the previous test
case left a dirty cluster, we would release the old cluster's used IP
addresses (`_before_test` function). However, we would not release the
last cluster's IP after the last test case. We would run out of IPs with
sufficiently many test files or `--repeat` runs. Fix this.
Also reorder the operations a bit: stop the cluster (and release its
IPs) before freeing up space in the cluster pool (i.e. call
`self.cluster.stop()` before `self.clusters.steal()`). This reduces
concurrency a bit - fewer Scyllas running at the same time, which is
good (the pool size gives a limit on the desired max number of
concurrently running clusters). Killing a cluster is quick so it won't
make a significant difference for the next guy waiting on the pool.
Closes#12564
Before this patch, "cargo" was the only Rust toolchain dependency in Nix
development environment. Due to the way "cargo" tool is packaged in Nix,
"cargo" would first try to use "rustc" from PATH (for example some
version already installed globally on OS). If it didn't find any, it
would fallback to "rustc" from nixpkgs.
There are issues with such approach:
- "rustc" installed globally on the system could be old.
- the goal of having a Nix development environment is that such
environment is separate from the programs installed globally on the
system and the versions of all tools are pinned (via flake.lock).
Fix this problem by adding rustc to nativeBuildInputs in default.nix.
After this patch, "rustc" from nixpkgs is present on the PATH
(potentially overriding "rustc" already installed on the system), so
"cargo" can correctly use it.
You can validate this behavior experimentally by adding a fake failing
rustc before entering the Nix development environment:
mkdir fakerustc
echo '#!/bin/bash' >> fakerustc/rustc
echo 'exit 1' >> fakerustc/rustc
chmod +x fakerustc/rustc
export PATH=$(pwd)/fakerustc:$PATH
nix-shell .
A recent commit (9dd5107919) started using a 4-argument version of
lua_resume, which is only available in Lua 5.4. This caused build
problems when trying to build Scylla in Nix development environment:
tools/lua_sstable_consumer.cc:1292:19: error: no matching function for call to 'lua_resume'
ret = lua_resume(l, nullptr, nargs, &nresults);
^~~~~~~~~~
/nix/store/wiz3xb19x2pv7j3hf29rbafm4s5zp2kx-lua-5.3.6/include/lua.h:290:15: note: candidate function not viable: requires 3 arguments, but 4 were provided
LUA_API int (lua_resume) (lua_State *L, lua_State *from, int narg);
^
1 error generated.
Fix the problem by bumping the version of Lua from 5.3 to 5.4 in
default.nix. Since "lua54Packages.lua" was added to nixpkgs fairly
recently (NixOS/nixpkgs#207862), flake.lock is updated to get the newest
version of nixpkgs (updated using "nix flake update" command).
Main assumption here is that if is_big is good enough for
GetBatchItems operation it should work well also for Scan,
Query and GetRecords. And it's easier to maintain more unified
code.
Additionally 'future<> print' documentation used for streaming
suggests that there is quite big overhead so since it seems the
only motivation for streaming was to reduce contiguous allocation
size below some threshold we should not stream when this threshold
is not exceeded.
Closes#12164
This PR is not related to any reported issue in the repo.
I've just discovered a broken link in the university caused by a
missing redirection.
Closes#12567
After 8635d2442 commit, the abseil submodule was removed in favor of
using pre-built abseil distribution. Installation of abseil-cpp was
added to install-dependencies.sh and dbuild image, but no change was
made to the Nix development environment, which resulted in error
while executing ./configure.py (while in Nix devenv):
Package absl_raw_hash_set was not found in the pkg-config search path.
Perhaps you should add the directory containing `absl_raw_hash_set.pc'
to the PKG_CONFIG_PATH environment variable
No package 'absl_raw_hash_set' found
Fix the issue by adding "abseil-cpp" to buildInputs in default.nix.
Recently, commit 0b418fa made the checking for "unset" values more
centralized and more robust, but as the tests added in this patch
show, the situation is good (and in particular, that #10358 is
solved).
The tests in this patch check that the behavior of "unset" values in
the CQL v4 protocol matches Cassandra's behavior and its documentation,
and how it compares to our wishes of how we want unset values to behave.
One of these tests fail on Cassandra (we consider this a Cassandra bug).
One test fails on Scylla because it doesn't yet support arithmetic
expressions (Refs #2693).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12534
The CQL protocol and specification call for lists with NULLs in
some places. For example, the statement:
```cql
UPDATE tab
SET x = 3
IF y IN (1, 2, NULL)
WHERE pk = 4
```
has a list `(1, 2, NULL)` that contains NULL. Although the syntax is tuple-like, the value is a list;
consider the same statement as a prepared statement:
```cql
UPDATE tab
SET x = :x
IF y IN :y_values
WHERE pk = :pk
```
`:y_values` must have a list type, since the number of elements is unknown.
Currently, this is done with special paths inside LWT that bypass normal
evaluation, but if we want to unify those paths, we must allow NULLs in
lists (except in storage). This series does that.
Closes#12411
* github.com:scylladb/scylladb:
test: materialized view: add test exercising synthetic empty-type columns
cql3: expr: relax evaluate_list() to allow allow NULL elements
types: allow lists with NULL
test: relax NULL check test predicate
cql3, types: validate listlike collections (sets, lists) for storage
types: make empty type deserialize to non-null value
following tests are integrated into scylla executable
- perf_fast_forward
- perf_row_cache_update
- perf_simple_query
- perf_row_cache_update
- perf_sstable
before this change
```console
$ size build/release/scylla
text data bss dec hex filename
82284664 288960 335897 82909521 4f11951 build/release/scylla
$ ls -l build/release/scylla
-rwxrwxr-x 1 kefu kefu 1719672112 Jan 19 17:51 build/release/scylla
```
after this change
```console
$ size build/release/scylla
text data bss dec hex filename
84349449 289424 345257 84984130 510c142 build/release/scylla
$ ls -l build/release/scylla
-rwxrwxr-x 1 kefu kefu 1774204800 Jan 19 17:52 build/release/scylla
```
Fixes#12484Closes#12558
* github.com:scylladb/scylladb:
main: move perf_sstable into scylla
main: move perf_row_cache_update into scylla
test: perf_row_cache_update: add static specifier to local functions
main: move perf_fast_forward into scylla
main: move perf_simple_query into scylla
test: extract debug::the_database out
main: shift the args when checking exec_name
main: extract lookup_main_func() out
If a cluster fails to boot, it saves the exception in
`self.start_exception` variable; the exception will be rethrown when
a test tries to start using this cluster. As explained in `before_test`:
```
def before_test(self, name) -> None:
"""Check that the cluster is ready for a test. If
there was a start error, throw it here - the server is
running when it's added to the pool, which can't be attributed
to any specific test, throwing it here would stop a specific
test."""
```
It's arguable whether we should blame some random test for a failure
that it didn't cause, but nevertheless, there's a problem here: the
`start_exception` will be rethrown and the test will fail, but then the
cluster will be simply returned to the pool and the next test will
attempt to use it... and so on.
Prevent this by marking the cluster as dirty the first time we rethrow
the exception.
Closes#12560
This decreases the whole alternator::get_table cpu time by 78%
(from 2.8 us to 0.6 us on my cpu).
In perf_simple_query it decreases allocs/op by 1.6% (by removing 4 allocations)
and increases median tps by 3.4%.
Raw results from running:
./build/release/test/perf/perf_simple_query_g --smp 1 \
--alternator forbid --default-log-level error \
--random-seed=1235000092 --duration=180 --write
Before the patch:
median 46903.65 tps (197.2 allocs/op, 12.1 tasks/op, 170886 insns/op, 0 errors)
median absolute deviation: 210.15
maximum: 47354.59
minimum: 42535.63
After the patch:
median 48484.76 tps (194.1 allocs/op, 12.1 tasks/op, 168512 insns/op, 0 errors)
median absolute deviation: 317.32
maximum: 49247.69
minimum: 44656.38
Closes#12445
Commitlog O_DSYNC is intended to make Raft and schema writes durable
in the face of power loss. To make O_DSYNC performant, we preallocate
the commitlog segments, so that the commitlog writes only change file
data and not file metadata (which would require the filesystem to commit
its own log).
However, in tests, this causes each ScyllaDB instance to write 384MB
of commitlog segments. This overloads the disks and slows everything
down.
Fix this by disabling O_DSYNC (and therefore preallocation) during
the tests. They can't survive power loss, and run with
--unsafe-bypass-fsync anyway.
Closes#12542
* configure.py:
- include `test/perf/perf_sstable` and its dependencies in scylla_perfs
* test/perf/perf_sstable.cc: change `main()` to
`perf::scylla_sstable_main()`
* test/perf/entry_point.hh: add
`perf::scylla_sstable_main()`
* main.cc:
- dispatch "perf-sstable" subcommand to
`perf::scylla_sstable_main`
before this change, we have a tool at `test/perf/perf_sstable`
for running performance tests by exercising sstable related operations.
after this change, the `test/perf/perf_sstable` is integreated
into `scylla` as a subcommand. so we can run `scylla perf-sstable`
[options, ...]` to perform the same tests previous driven by the tool.
Fixes#12484
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* configure.py:
- include `test/perf/perf_row_cache_update.cc` in scylla_perfs
* main.cc:
- dispatch "perf-row-cache-update" subcommand to
`perf::scylla_row_cache_update_main`
* test/perf/perf_fast_forward.cc: change `main()` to
`perf::scylla_row_cache_update_main()`
* test/perf/entry_point.hh: add
`perf::scylla_row_cache_update_main()`
before this change, we have a tool at `test/perf/perf_row_cache_update`
for running performance tests by updating row cache.
after this change, the `test/perf/perf_row_cache_update` is integreated
into `scylla` as a subcommand. so we can run `scylla perf-row-cache-update
[options, ...]` to perform the same tests previous driven by the tool.
Fixes#12484
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
now that these functions are only used by the same compiling unit,
they don't need external linkage. so let's hide them using `static`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* configure.py:
- include `test/perf/perf_simple_query.cc` in scylla_perfs
* main.cc:
- dispatch "perf-fast-forward" subcommand to
`perf::scylla_fast_forward_main`
* test/perf/perf_fast_forward.cc: change `main()` to
`perf::scylla_simple_query_main()`
* test/perf/entry_point.hh: add
`perf::scylla_simple_query_main()`
before this change, we have a tool at `test/perf/perf_fast_forward`
for running performance tests by fast forwarding the reader.
after this change, the `test/perf/perf_fast_forward` is integreated
into `scylla` as a subcommand. so we can run `scylla perf-fast-forward
[options, ...]` to perform the same tests previous driven by the tool.
Fixes#12484
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* configure.py:
- include scylla_perfs in scylla
- move 'test/lib/debug.cc' down scylla_perfs, as the latter uses
`debug::the_database`
- link `scylla` against seastar_testing_libs also. because we
use the helpers in `test/lib/random_utils.hh` for generating
random numbers / sequences in `perf_simple_query.cc`, and
`random_utils.hh` references `seastar::testing::local_random_engine`
as a local RNG. but `seastar::testing::local_random_engine`
is included in `libseastar_testing.a` or
`libseastar_perf_testing.a`. since we already have the rules for
linking against `libseastar_testing.a`, let's just reuse them,
and link `scylla` against this new dependency.
* main.cc:
- dispatch "perf-simple-query" subcommand to
`perf::scylla_simple_query_main`
* test/perf/perf_simple_query.cc: change `main()` to
`perf::scylla_simple_query_main()`
* test/perf/entry_point.hh: define the main function entries
so `main.cc` can find them. it's quite like how we collect
the entries in `tools/entry_point.hh`
before this change, we have a tool at `test/perf/perf_simple_query`
for running performance test by sending simple query to a single-node
cluster.
after this change, the `test/perf/perf_simple_query` is integreated
into `scylla` as a subcommand. so we can run `scylla perf-simple-query
[options, ...]` to perform the same tests previous driven by the tool.
Fixes#12484
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
we want to integrate some perf test into scylla executable, so we
can run them on a regular basis. but `test/lib/cql_test_env.cc`
shares `debug::the_database` with `main.cc`, so we cannot just
compile them into a single binary without changing them.
before this change, both `test/lib/cql_test_env.cc`
and `main.cc` define `debug::the_database`.
after this change, `debug::the_database` is extracted into
`debug.cc`, so it compiles into a separate compiling unit.
and scylla and tests using seastar testing framework are linked
against `debug.cc` via `scylla_core` respectively. this paves the road to
integrating scylla with the tests linking aginst
`test/lib/cql_test_env.cc`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Commit 0b418fa improved the error detection of unset values in
inappropriate CQL statements, and some of the unit tests translated
from Cassandra started to pass, so this patch removes their "xfail"
mark.
In a couple of places Scylla's error message is worded differently
from Cassandra, so the test was modified to look for a shorter
string common to both implementations.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12553
The reader concurrency semaphore has no mechanism to limit the memory consumption of already admitted read. Once memory collective memory consumption of all the admitted reads is above the limit, all it can do is to not admit any more. Sometimes this is not enough and the memory consumption of the already admitted reads balloons to the point of OOMing the node. This pull-request offers a solution to this: it introduces two more layers of defense above this: a soft and a hard limit. Both are multipliers applied on the semaphores normal memory limit.
When the soft limit threshold is surpassed, all readers but one are blocked via a new blocking `request_memory()` call which is used by the `tracking_file_impl`. The reader to be allowed to proceed is chosen at random, it is the first reader which happens to request memory after the limit is surpassed. This is both very simple and should avoid situations where the algorithm choosing the reader to be allowed to proceed chooses a reader which will then always time out.
When the hard limit threshold is surpassed, `reader_concurrency_semaphore::consume()` starts throwing `std::bad_alloc`. This again will result in eliminating whichever reader was unlucky enough to request memory at the right moment.
With this, the semaphore is now effectively enforcing an upper bound for memory consumption, defined by the hard limit.
Refs: https://github.com/scylladb/scylladb/issues/11927Closes#11955
* github.com:scylladb/scylladb:
test: reader_concurrency_semaphore_test: add tests for semaphore memory limits
reader_permit: expose operator<<(reader_permit::state)
reader_permit: add id() accessor
reader_concurrency_semaphore: add foreach_permit()
reader_concurrency_semaphore: document the new memory limits
reader_concurrency_semaphore: add OOM killer
reader_concurrency_semaphore: make consume() and signal() private
test: stop using reader_concurrency_semaphore::{consume,signal}() directly
reader_concurrency_semaphore: move consume() out-of-line
reader_permit: consume(): make it exception-safe
reader_permit: resource_units::reset(): only call consume() if needed
reader_concurrency_semaphore: tracked_file_impl: use request_memory()
reader_concurrency_semaphore: add request_memory()
reader_concurrency_semaphore: wrap wait list
reader_concurrency_semaphore: add {serialize,kill}_limit_multiplier parameters
test/boost/reader_concurrency_semaphore_test: dummy_file_impl: don't use hardoced buffer size
reader_permit: add make_new_tracked_temporary_buffer()
reader_permit: add get_state() accessor
reader_permit: resource_units: add constructor for already consumed res
reader_permit: resource_units: remove noexcept qualifier from constructor
db/config: introduce reader_concurrency_semaphore_{serialize,kill}_limit_multiplier
scylla-gdb.py: scylla-memory: extract semaphore stats formatting code
scylla-gdb.py: fix spelling of "graphviz"
`prepare_expression` takes an unprepared CQL expression straight from the parser output and prepares it. Preparation consists of various type checks that are needed to ensure that the expression is correct and to reason about it.
While `prepare_expression` supports a number of different types of expressions, until now it was impossible to prepare a `binary_operator`. Eventually we would like to be able to prepare all kinds of expressions, so this PR adds the missing support for `binary_operator`.
Closes#12550
* github.com:scylladb/scylladb:
expr_test: test preparing binary_operator with NULL RHS
expr_test: test preparing IS NOT NULL binary_operator
expr_test: test preparing binary_operator with LIKE
expr_test: test preparing binary_operator with CONTAINS KEY
expr_test: test preparing binary_operator with CONTAINS
expr_test: test preparing binary_operator with IN
expr_test: test preparing binary_operator with =, !=, <, <=, >, >=
expr_test: use make_*_untyped function in existing tests
expr_test_utils: add utilities to create untyped_constant
expr_test_utils: add make_float_* and make_double_*
cql3: expr: make it possible to prepare binary_operator using prepare_expression
cql3/expr: check that RHS of IS NOT NULL is a null value when preparing binary operators
cql3: expr: pass non-empty keyspace name in prepare_binary_operator
cql3: expr: take reference to schema in prepare_binary_operator
instead of introducing yet another variable for tracking the
status, update the args right away. for better readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
refactor main() to extract lookup_main_func() out, so we find
the main_func in a table instead of using a lengthy if-then-else
clause.
when the length of the list of candidates of dispatch grows, the
code would be less structured. so in this change, the code looking
up for the main_func is extracted into a dedicated function for
better readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* seastar 8889cbc198...d41af8b592 (14):
> Merge 'Perf stall detector related improvements' from Travis Downs
Ref #8828, #7882, #11582 (may help make progress)
> build: pass HEAPPROF definition to src/core/reactor.cc too
> Limit memory address space per core to 64GB when hwloc is not available
> build: revert use pkg_search_module(.. IMPORTED_TARGET ..) changes
> Fix missing newlines in seastar-addr2line
> Use an integral type for uniform_int_distribution
> Merge 'tls_test: use a dedicated https server for testing' from Kefu Chai
> build: use ${CMAKE_BINARY_DIR} when running 'cmake --build ..'
> build: do not set c-ares_FOUND with PARENT_SCOPE
> reactor: drop unused member function declaration
> sstring: refactor to_sstring() using fmt::format_to()
> http: delay input stream close until responses sent
> build: enable non-library targets using default option value
> Merge 'sstring: specialize uninitialize_string() and use resize_and_overwrite if available' from Kefu Chai
Closes#12509
Add unit test which check that preparing binary_operators
which represent IS NOT NULL works as expected
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add unit test which check that preparing binary_operators
with the LIKE operation works as expected.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com
Add unit test which check that preparing binary_operators
with the CONTAINS KEY operation works as expected.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add unit test which check that preparing binary_operators
with the CONTAINS operation works as expected.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add unit test which check that preparing binary_operators
with basic comparison operations works as expected.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Use the newly introduced convenience methods that create
untyped_constant in existing tests.
This will make the code more readable by removing
visual clutter that came with the previous overly
verbose code.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
expression tests often need to create instances of untyped_constant.
Creating them by hand is tedious because the required code is overly verbose.
Having convenience functions for it speeds up test writing.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
prepare_expression didn't allow to prepare binary_operators.
so it's now implemented.
If prepare_binary_operator is unable to infer
the types it will fail with an exception instead
of returning std::nullopt, but we can live with
that for now.
Preparing binary_operators inside the WHERE
clause is currently more complicated than just
calling prepare_binary_operator. Preparation
of the WHERE clause is done inside statement_restrictions
constructor. It's done by iterating over all binary_operators,
validating them and then preparing. The validation contains
additional checks with custom error messages.
Preparation has to be done after validation,
because otherwise the error messages will change
and some tests will start failing.
Because of that we can't just call prepare_expression
on the WHERE clause yet.
It's still useful to have the ability to prepare
binary_operators using prepare_expression.
In cases where we know that the WHERE clause is valid,
we can just call prepare_expression and be done with it.
Once grammar is fully relaxed the artificial constraints
checked by the validation code will be removed and
it will be possible to prepare the whole WHERE clause
using just prepare_expression.
prepare_expression does a bit more than
prepare_binary_operator. In case where
both sides of the binary_operator are known
it will evaluate the whole binary_operator
to a constant value.
Query analysis code is NOT ready
to encounter constant boolean values inside
the WHERE clause, so for the WHERE we still use
prepare_binary_operator which doesn't
evaluate the binary_operator to a
constant value.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
When preparing a binary operator we first prepare the LHS,
which gives us information about its type and allows
to infer the desired type of RHS.
Then the RHS is prepared with the expectation that it
is compatible with the inferred type.
This is enough for all types of operations apart
from IS NOT NULL.
For IS NOT we should also check that the RHS value
is actually null. It's not enough to check that
RHS is of right type.
Before this change preparing `int_col IS NOT 123`
would end in success, which is wrong.
The missing check doesn't cause any real problems,
it's impossible for the user to produce such input
because the parser will reject it.
Still it's better to have the check because
in the future the grammar might get more relaxed
and the parser could become more generic,
making it possible to write such things.
It would be better to introduce unary_operators,
but that's a bigger change.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
For some reason we passed an empty keyspace name
to prepare_expression when preparing the LHS
of a binary operator.
This doesn't look correct. We have keyspace
name available from the schema_ptr so let's use that.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
prepare_binary_operator takes a schema_ptr,
but it would be useful to take a reference to schema instead.
Every schema_ptr can be easily converted to a reference
so there is no loss of functionality.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Several cases where fixed in this patches, all are related to processing of malformed base64 data. Main purpose was to bring alternator implementation closer to what DynamoDB does. We now:
- Throw error when padding is missing during base64 decoding
- Throw error when base64 data is malformed
- In alternator when invalid base64 data is fetched from DB (as opposed to being part of user's request) we now exclude such row during filtering
Additionally some small code quality improvements:
- avoid unnecessary type conversions in calls to rjson:from_strings functions
- avoid some copy constructions in calls to rjson:from_strings functions
Fixes https://github.com/scylladb/scylladb/issues/6487Closes#11944
* github.com:scylladb/scylladb:
alternator: evaluate expressions as false for stored malformed binary data
rjson: avoid copy constructors in from_string calls when possible
alternator: remove unused parameters from describe_items func
utils: throw error on malformed input in base64 decode
utils: throw error on missing padding in base64 decode
Materialized views inject synthetic empty-type columns in some conditions.
Since we just touched empty-type serialization/deserialization, add a
test to exercise it and make sure it still works.
Tests are similarly relaxed. A test is added in lwt_test to show
that insertion of a list with NULL is still rejected, though we
allow NULLs in IF conditions.
One test is changed from a list of longs to a list of ints, to
prevent churn in the test helper library.
Allow transient lists that contain NULL throughout the
evaluation machinery. This makes is possible to evalute things
like `IF col IN (1, 2, NULL)` without hacks, once LWT conditions
are converted to expressions.
A few tests are relaxed to accommodate the new behavior:
- cql_query_test's test_null_and_unset_in_collections is relaxed
to allow `WHERE col IN ?`, with the variable bound to a list
containing NULL; now it's explicitly allowed
- expr_test's evaluate_bind_variable_validates_no_null_in_list was
checking generic lists for NULLs, and was similary relaxed (and
renamed)
- expr_Test's evaluate_bind_variable_validates_null_in_lists_recursively
was similarly relaxed to allow NULLs.
When we start allowing NULL in lists in some contexts, the exact
location where an error is raised (when it's disallowed) will
change. To prepare for that, relax the exception check to just
ensure the word NULL is there, without caring about the exact
wording.
Lists allow NULL in some contexts (bind variables for LWT "IN ?"
conditions), but not in most others. Currently, the implementation
just disallows NULLs in list values, and the cases where it is allowed
are hacked around. To reduce the special cases, we'll allow lists
to have NULLs, and just restrict them for storage. This is similar
to how scalar values can be NULL, but not when they are part of a
partition key.
To prepare for the transition, identify the locations where lists
(and sets, which share the same storage) are stored as frozen
values and add a NULL check there. Non-frozen lists already have the
check. Since sets share the same format as lists, apply the same to
them.
No actual checks are done yet, since NULLs are impossible. This
is just a stub.
The empty type is used internally to implement CQL sets on top
of multi-cell maps. The map's key (an atomic cell) represents the
set value, and the map's value is discarded. Since it's unneeded
we use an internal "empty" type.
Currently, it is deserialized into a `data_value` object representing
a NULL. Since it's discarded, it really doesn't matter.
However, with the impending change to change lists to allow NULLs,
it does matter:
1. the coordinator sets the 'collections_as_maps' flag for LWT
requests since it wants list indexes (this affects sets too).
2. the replica responds by serializing a set as a map.
3. since we start allow NULL collection values, we now serialize
those NULLs as NULLs.
4. the coordinator deserializes the map, and complains about NULL
values, since those are not supported.
The solution is simple, deserialize the empty value as a non-NULL
object. We create an empty empty_type_representation and add the
scaffolding needed. Serialization and deserialization is already
coded, it was just never called for NULL values (which were serialized
with size 0, in collections, rather than size -1, luckily).
A unit test is added.
When the collective memory consumption of all readers goes above
$kill_limit_multiplier * $memory_limit, consume() will throw
std::bad_alloc(), instantly unwinding the read that is unlucky enough
to have requested the last bytes of memory. This should help situation
where there are some problematic partitions, either because of large
cells or because they are scattered in too many sstables. Currently
nothing prevents such reads from bringing down the entire node via OOM.
Using this API is quite dangerous as any mistakes can lead to leaking
resources from the semaphore. Also, soon we will tie this API closer to
permits, so they won't be as generic. Make them private so we don't have
to worry about correct usage. All external users are patched away
already.
These methods will soon be retired (made private) so migrate away from
them. Consume memory through a permit instead. It is also safer this
way: all memory consumed through the permit is guaranteed to be released
when the permit is destroyed at the latest.
reset() is called from the destructor, with null resources. Calling
consume() can be avoided in this case and in fact it is required as
consume() is soon going to throw in some cases.
Use the recently added `request_memory()` to aquire the memory units for
the I/O. This allows blocking all but one readers when memory
consumption grows too high.
A possibly blocking request for more memory. If the collective memory
consumption of all reads goes above
$serialize_limit_multiplier * $memory_limit this request will block for
all but one reader (the first requester). Until this situation is
resolved, that is until memory stays above the above explained limit,
only this one reader is allowed to make progress. This should help reign
in the memory consumption of reads in a situation where their memory
consumption used to baloon without constraints before.
data
We'll try to distinguish the case when data comes from the storage rather
than user reuqest. Such attribute can be used in expressions and
when it can't be decoded it should make expression evaluate as
false to simply exclude the row during filter query or scan.
Note that this change focuses on binary type, for other types we
may have some inconsistencies in the implementation.
We already fixed the case of missing padding but there is also
more generic one where input for decode function contains non
base64 characters.
This is mostly done for alternator purpose, it should discard
the request containing such data and return 400 http error.
Addionally some harmless integer overflow during integer casting
was fixed here. This was attempted to be fixed by 2d33a3f
but since we also implicitly cast to uint8_t the problem persisted.
This is done to make alternator behavior more on a pair with dynamodb.
Decode function is used there when processing user requests containing binary
item values. We will now discard improperly formed user input with 400 http error.
It also makes it more consistent as some of our other base64 functions
may have assumed padding is present.
The patch should not break other usages of base64 functions as the only one is
in db/hints where the code already throws std::runtime_error.
Fixes#6487
In `dma_read_bulk()`, use the `range_size` passed as parameter and have
the callers pass meaningful sizes. We got away with callers passing 0
and using a hard-coded size internally because the tracking file wrapper
used the size of the returned buffer as the basis for memory tracking.
This will soon not be the case and instead the passed-in size will be
used, so this has to be fixed.
A separate method for callers of make_tracked_temporary_buffer() who
are creating new empty tracked buffers of a certain size.
make_tracked_temporary_buffer() is about to be changed to be more
targeted at callers who call it with pre-consumed memory units.
Use the [Scylla Users mailing list](https://groups.google.com/g/scylladb-users) or the [Slack workspace](http://slack.scylladb.com) for general questions and help.
Use the [ScyllaDB Community Forum](https://forum.scylladb.com) or the [Slack workspace](http://slack.scylladb.com) for general questions and help.
Join the [Scylla Developers mailing list](https://groups.google.com/g/scylladb-dev) for deeper technical discussions and to discuss your ideas for contributions.
The `scylla.yaml` file in the repository by default writes all database data to `/var/lib/scylla`, which likely requires root access. Change the `data_file_directories` and `commitlog_directory` fields as appropriate.
The `scylla.yaml` file in the repository by default writes all database data to `/var/lib/scylla`, which likely requires root access. Change the `data_file_directories`, `commitlog_directory` and `schema_commitlog_directory` fields as appropriate.
Scylla has a number of requirements for the file-system and operating system to operate ideally and at peak performance. However, during development, these requirements can be relaxed with the `--developer-mode` flag.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.