This commit fixes the rollback procedure in
the 5.0-to-5.1 upgrade guide:
- The "Restore system tables" step is removed.
- The "Restore the configuration file" command
is fixed.
- The "Gracefully shutdown ScyllaDB" command
is fixed.
In addition, there are the following updates
to be in sync with the tests:
- The "Backup the configuration file" step is
extended to include a command to backup
the packages.
- The Rollback procedure is extended to restore
the backup packages.
- The Reinstallation section is fixed for RHEL.
Also, I've the section removed the rollback
section for images, as it's not correct or
relevant.
Refs https://github.com/scylladb/scylladb/issues/11907
This commit must be backported to branch-5.4, branch-5.2, and branch-5.1
Closesscylladb/scylladb#16154
(cherry picked from commit 7ad0b92559)
The copy assignment operator of _ck can throw
after _type and _bound_weight have already been changed.
This leaves position_in_partition in an inconsistent state,
potentially leading to various weird symptoms.
The problem was witnessed by test_exception_safety_of_reads.
Specifically: in cache_flat_mutation_reader::add_to_buffer,
which requires the assignment to _lower_bound to be exception-safe.
The easy fix is to perform the only potentially-throwing step first.
Fixes#15822Closesscylladb/scylladb#15864
(cherry picked from commit 93ea3d41d8)
* tools/jmx 06f2735...ed3cc6d (1):
> Merge "scylla-apiclient: update several Java dependencies" from Piotr Grabowski
* tools/java be0aaf7597...7459a11815 (1):
> Merge 'build: update several dependencies' from Piotr Grabowski
Update build dependencies which were flagged by security scanners.
Refs: scylladb/scylla-jmx#220
Refs: scylladb/scylla-tools-java#351
Closes#16151
Currently, the API call recalculates only per-node schema version. To
workaround issues like #4485 we want to recalculate per-table
digests. One way to do that is to restart the node, but that's slow
and has impact on availability.
Use like this:
curl -X POST http://127.0.0.1:10000/storage_service/relocal_schemaFixes#15380Closes#15381
(cherry picked from commit c27d212f4b)
(cherry picked from commit bfd8401477)
Currently, when said feature is enabled, we recalcuate the schema
digest. But this feature also influences how table versions are
calculated, so it has to trigger a recalculation of all table versions,
so that we can guarantee correct versions.
Before, this used to happen by happy accident. Another feature --
table_digest_insensitive_to_expiry -- used to take care of this, by
triggering a table version recalulation. However this feature only takes
effect if digest_insensitive_to_expiry is also enabled. This used to be
the case incidently, by the time the reload triggered by
table_digest_insensitive_to_expiry ran, digest_insensitive_to_expiry was
already enabled. But this was not guaranteed whatsoever and as we've
recently seen, any change to the feature list, which changes the order
in which features are enabled, can cause this intricate balance to
break.
This patch makes digest_insensitive_to_expiry also kick off a schema
reload, to eliminate our dependence on (unguaranteed) feature order, and
to guarantee that table schemas have a correct version after all features
are enabled. In fact, all schema feature notification handlers now kick
off a full schema reload, to ensure bugs like this don't creep in, in
the future.
Fixes: #16004Closesscylladb/scylladb#16013
(cherry picked from commit 22381441b0)
(cherry picked from commit e31f2224f5)
In 0c86abab4d `merge_schema` obtained a new flag, `reload`.
Unfortunately, the flag was assigned a default value, which I think is
almost always a bad idea, and indeed it was in this case. When
`merge_scehma` is called on shard different than 0, it recursively calls
itself on shard 0. That recursive call forgot to pass the `reload` flag.
Fix this.
(cherry picked from commit 48164e1d09)
(cherry picked from commit c994ed2057)
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.
Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.
After ae8d2a550d (5.2.0), it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).
This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.
A similar problem was fixed for per-node digest calculation in
c2ba94dc39e4add9db213751295fb17b95e6b962. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.
Fixes#4485.
Manually tested using ccm on cluster upgrade scenarios and node restarts.
Closes#14441
* github.com:scylladb/scylladb:
test: schema_change_test: Verify digests also with TABLE_DIGEST_INSENSITIVE_TO_EXPIRY enabled
schema_mutations, migration_manager: Ignore empty partitions in per-table digest
migration_manager, schema_tables: Implement migration_manager::reload_schema()
schema_tables: Avoid crashing when table selector has only one kind of tables
(cherry picked from commit cf81eef370)
(cherry picked from commit 40eed1f1c5)
Currently the code will assert because cl pointer will be null and it
will be null because there is no mutations to initialize it from.
Message-Id: <20230212144837.2276080-3-gleb@scylladb.com>
(cherry picked from commit 941407b905)
Backport needed by #4485.
(cherry picked from commit f233c8a9e4)
Currently, it is started/stopped in the streaming/maintenance sg, which
is what the API itself runs in.
Starting the native transport in the streaming sg, will lead to severely
degraded performance, as the streaming sg has significantly less
CPU/disk shares and reader concurrency semaphore resources.
Furthermore, it will lead to multi-paged reads possibly switching
between scheduling groups mid-way, triggering an internal error.
To fix, use `with_scheduling_group()` for both starting and stopping
native transport. Technically, it is only strictly necessary for
starting, but I added it for stop as well for consistency.
Also apply the same treatment to RPC (Thrift). Although no one uses it,
best to fix it, just to be on the safe side.
I think we need a more systematic approach for solving this once and for
all, like passing the scheduling group to the protocol server and have
it switch to it internally. This allows the server to always run on the
correct scheduling group, not depending on the caller to remember using
it. However, I think this is best done in a follow-up, to keep this
critical patch small and easily backportable.
Fixes: #15485Closesscylladb/scylladb#16019
(cherry picked from commit dfd7981fa7)
$ID_LIKE = "rhel" works only on RHEL compatible OSes, not for RHEL
itself.
To detect RHEL correctly, we also need to check $ID = "rhel".
Fixes#16040Closesscylladb/scylladb#16041
(cherry picked from commit 338a9492c9)
When base write triggers mv write and it needs to be send to another
shard it used the same service group and we could end up with a
deadlock.
This fix affects also alternator's secondary indexes.
Testing was done using (yet) not committed framework for easy alternator
performance testing: https://github.com/scylladb/scylladb/pull/13121.
I've changed hardcoded max_nonlocal_requests config in scylla from 5000 to 500 and
then ran:
./build/release/scylla perf-alternator-workloads --workdir /tmp/scylla-workdir/ --smp 2 \
--developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write_gsi \
--duration 60 --ring-delay-ms 0 --skip-wait-for-gossip-to-settle 0 --continue-after-error true --concurrency 2000
Without the patch when scylla is overloaded (i.e. number of scheduled futures being close to max_nonlocal_requests) after couple seconds
scylla hangs, cpu usage drops to zero, no progress is made. We can confirm we're hitting this issue by seeing under gdb:
p seastar::get_smp_service_groups_semaphore(2,0)._count
$1 = 0
With the patch I wasn't able to observe the problem, even with 2x
concurrency. I was able to make the process hang with 10x concurrency
but I think it's hitting different limit as there wasn't any depleted
smp service group semaphore and it was happening also on non mv loads.
Fixes https://github.com/scylladb/scylladb/issues/15844Closesscylladb/scylladb#15845
(cherry picked from commit 020a9c931b)
These APIs may return stale or simply incorrect data on shards
other than 0. Newer versions of Scylla are better at maintaining
cross-shard consistency, but we need a simple fix that can be easily and
without risk be backported to older versions; this is the fix.
Add a simple test to check that the `failure_detector/endpoints`
API returns nonzero generation.
Fixes: scylladb/scylladb#15816Closesscylladb/scylladb#15970
* github.com:scylladb/scylladb:
test: rest_api: test that generation is nonzero in `failure_detector/endpoints`
api: failure_detector: fix indentation
api: failure_detector: invoke on shard 0
(cherry picked from commit 9443253f3d)
Off-strategy suffers with a 100% space overhead, as it adopted
a sort of all or nothing approach. Meaning all input sstables,
living in maintenance set, are kept alive until they're all
reshaped according to the strategy criteria.
Input sstables in off-strategy are very likely to be mostly disjoint,
so it can greatly benefit from incremental compaction.
The incremental compaction approach is not only good for
decreasing disk usage, but also memory usage (as metadata of
input and output live in memory), and file desc count, which
takes memory away from OS.
Turns out that this approach also greatly simplifies the
off-strategy impl in compaction manager, as it no longer have
to maintain new unused sstables and mark them for
deletion on failure, and also unlink intermediary sstables
used between reshape rounds.
Fixes https://github.com/scylladb/scylladb/issues/14992.
Backport notes: relatively easy to backport, had to include
**replica: Make compaction_group responsible for deleting off-strategy compaction input**
and
**compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0**
Closes#15794
* github.com:scylladb/scylladb:
test: Verify that off-strategy can do incremental compaction
compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0
compaction: Clear pending_replacement list when tombstone GC is disabled
compaction: Enable incremental compaction on off-strategy
compaction: Extend reshape type to allow for incremental compaction
compaction: Move reshape_compaction in the source
compaction: Enable incremental compaction only if replacer callback is engaged
replica: Make compaction_group responsible for deleting off-strategy compaction input
removenode host_id must specify the host ID as a UUID,
not an ip address.
Fixes#11839
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#11840
(cherry picked from commit 44e1058f63)
before this change, `checksummed_file_data_sink_impl` just inherits the
`data_sink_impl::flush()` from its parent class. but as a wrapper around
the underlying `_out` data_sink, this is not only an unusual design
decision in a layered design of an I/O system, but also could be
problematic. to be more specific, the typical user of `data_sink_impl`
is a `data_sink`, whose `flush()` member function is called when
the user of `data_sink` want to ensure that the data sent to the sink
is pushed to the underlying storage / channel.
this in general works, as the typical user of `data_sink` is in turn
`output_stream`, which calls `data_sink.flush()` before closing the
`data_sink` with `data_sink.close()`. and the operating system will
eventually flush the data after application closes the corresponding
fd. to be more specific, almost none of the popular local filesystem
implements the file_operations.op, hence, it's safe even if the
`output_stream` does not flush the underlying data_sink after writing
to it. this is the use case when we write to sstables stored on local
filesystem. but as explained above, if the data_sink is backed by a
network filesystem, a layered filesystem or a storage connected via
a buffered network device, then it is crucial to flush in a timely
manner, otherwise we could risk data lost if the application / machine /
network breaks when the data is considerered persisted but they are
_not_!
but the `data_sink` returned by `client::make_upload_jumbo_sink` is
a little bit different. multipart upload is used under the hood, and
we have to finalize the upload once all the parts are uploaded by
calling `close()`. but if the caller fails / chooses to close the
sink before flushing it, the upload is aborted, and the partially
uploaded parts are deleted.
the default-implemented `checksummed_file_data_sink_impl::flush()`
breaks `upload_jumbo_sink` which is the `_out` data_sink being
wrapped by `checksummed_file_data_sink_impl`. as the `flush()`
calls are shortcircuited by the wrapper, the `close()` call
always aborts the upload. that's why the data and index components
just fail to upload with the S3 backend.
in this change, we just delegate the `flush()` call to the
wrapped class.
Fixes#15079
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15134
(cherry picked from commit d2d1141188)
The grammar mistakenly allows nothing to be parsed as an
intValue (itself accepted in LIMIT and similar clauses).
Easily fixed by removing the empty alternative. A unit test is
added.
Fixes#14705.
Closes#14707
(cherry picked from commit e00811caac)
In this branch(5.1) the most recent available rustc version is 1.60,
despite that, the 'cargo install' command tries to install the most
recent version of a package by default, which may rely on newer rustc
versions. This patch specifies the version of the cxxbridge-cmd package
to one that works with rustc 1.60.
Closesscylladb/scylladb#15812
[avi: regenerated frozen toolchain]
Closesscylladb/scylladb#15828
Prevent div-by-zero byt returning const level 1
if max_sstable_size is zero, as configured by
cleanup_incremental_compaction_test, before it's
extended to cover also offstrategy compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b1e164a241)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
pending_replacement list is used by incremental compaction to
communicate to other ongoing compactions about exhausted sstables
that must be replaced in the sstable set they keep for tombstone
GC purposes.
Reshape doesn't enable tombstone GC, so that list will not
be cleared, which prevents incremental compaction from releasing
sstables referenced by that list. It's not a problem until now
where we want reshape to do incremental compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Off-strategy suffers with a 100% space overhead, as it adopted
a sort of all or nothing approach. Meaning all input sstables,
living in maintenance set, are kept alive until they're all
reshaped according to the strategy criteria.
Input sstables in off-strategy are very likely to mostly disjoint,
so it can greatly benefit from incremental compaction.
The incremental compaction approach is not only good for
decreasing disk usage, but also memory usage (as metadata of
input and output live in memory), and file desc count, which
takes memory away from OS.
Turns out that this approach also greatly simplifies the
off-strategy impl in compaction manager, as it no longer have
to maintain new unused sstables and mark them for
deletion on failure, and also unlink intermediary sstables
used between reshape rounds.
Fixes#14992.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 42050f13a0)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's done by inheriting regular_compaction, which implement
incremental compaction. But reshape still implements its own
methods for creating writer and reader. One reason is that
reshape is not driven by controller, as input sstables to it
live in maintenance set. Another reason is customization
of things like sstable origin, etc.
stop_sstable_writer() is extended because that's used by
regular_compaction to check for possibility of removing
exhausted sstables earlier whenever an output sstable
is sealed.
Also, incremental compaction will be unconditionally
enabled for ICS/LCS during off-strategy.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit db9ce9f35a)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's in preparation to next change that will make reshape
inherit from regular compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's needed for enabling incremental compaction to operate, and
needed for subsequent work that enables incremental compaction
for off-strategy, which in turn uses reshape compaction type.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Compaction group is responsible for deleting SSTables of "in-strategy"
compactions, i.e. regular, major, cleanup, etc.
Both in-strategy and off-strategy compaction have their completion
handled using the same compaction group interface, which is
compaction_group::table_state::on_compaction_completion(...,
sstables::offstrategy offstrategy)
So it's important to bring symmetry there, by moving the responsibility
of deleting off-strategy input, from manager to group.
Another important advantage is that off-strategy deletion is now throttled
and gated, allowing for better control, e.g. table waiting for deletion
on shutdown.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13432
(cherry picked from commit 457c772c9c)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Commit 8c4b5e4 introduced an optimization which only
calculates max purgeable timestamp when a tombstone satisfy the
grace period.
Commit 'repair: Get rid of the gc_grace_seconds' inverted the order,
probably under the assumption that getting grace period can be
more expensive than calculating max purgeable, as repair-mode GC
will look up into history data in order to calculate gc_before.
This caused a significant regression on tombstone heavy compactions,
where most of tombstones are still newer than grace period.
A compaction which used to take 5s, now takes 35s. 7x slower.
The reason is simple, now calculation of max purgeable happens
for every single tombstone (once for each key), even the ones that
cannot be GC'ed yet. And each calculation has to iterate through
(i.e. check the bloom filter of) every single sstable that doesn't
participate in compaction.
Flame graph makes it very clear that bloom filter is a heavy path
without the optimization:
45.64% 45.64% sstable_compact sstable_compaction_test_g
[.] utils::filter::bloom_filter::is_present
With its resurrection, the problem is gone.
This scenario can easily happen, e.g. after a deletion burst, and
tombstones becoming only GC'able after they reach upper tiers in
the LSM tree.
Before this patch, a compaction can be estimated to have this # of
filter checks:
(# of keys containing *any* tombstone) * (# of uncompacting sstable
runs[1])
[1] It's # of *runs*, as each key tend to overlap with only one
fragment of each run.
After this patch, the estimation becomes:
(# of keys containing a GC'able tombstone) * (# of uncompacting
runs).
With repair mode for tombstone GC, the assumption, that retrieval
of gc_before is more expensive than calculating max purgeable,
is kept. We can revisit it later. But the default mode, which
is the "timeout" (i.e. gc_grace_seconds) one, we still benefit
from the optimization of deferring the calculation until
needed.
Cherry picked from commit 38b226f997
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Fixes#14091.
Closes#13908
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#15745
This is a backport of PR https://github.com/scylladb/scylladb/pull/15740.
This commit removes the information about the recommended way of upgrading ScyllaDB images - by updating ScyllaDB and OS packages in one step. This upgrade procedure is not supported (it was implemented, but then reverted).
The scope of this commit:
- Remove the information from the 5.0-to.-5.1 upgrade guide and replace with general info.
- Remove the information from the 4.6-to.-5.1 upgrade guide and replace with general info.
- Remove the information from the 5.x.y-to.-5.x.z upgrade guide and replace with general info.
- Remove the following files as no longer necessary (they were only created to incorporate the (invalid) information about image upgrade into the upgrade guides.
/upgrade/_common/upgrade-image-opensource.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p1.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p2.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian.rst
Closes#15769
* github.com:scylladb/scylladb:
doc: remove wrong image upgrade info (5.x.y-to-5.x.y)
doc: remove wrong image upgrade info (4.6-to-5.0)
doc: remove wrong image upgrade info (5.0-to-5.1)
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 5.x.y-to-5.x.y upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).
Refs https://github.com/scylladb/scylladb/issues/15733
In addition, the following files are removed as no longer
necessary (they were only created to incorporate the (invalid)
information about image upgrade into the upgrade guides.
/upgrade/_common/upgrade-image-opensource.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p1.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p2.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian.rst
(cherry picked from commit dd1207cabb)
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 4.6-to-5.0 upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).
Refs https://github.com/scylladb/scylladb/issues/15733
(cherry picked from commit 526d543b95)
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 5.0-to-5.1 upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).
Refs https://github.com/scylladb/scylladb/issues/15733
(cherry picked from commit 9852130c5b)
The estimated_partitions is estimated after the repair_meta is created.
Currently, the default estimated_partitions was used to create the
write which is not correct.
To fix, use the updated estimated_partitions.
Reported by Petr Gusev
Closes#14179Fixes#15748
(cherry picked from commit 4592bbe182)
Scylla can crash due to a complicated interaction of service level drop,
evictable readers, inactive read registration path.
1) service level drop invoke stop of reader concurrency semaphore, which will
wait for in flight requests
2) turns out it stops first the gate used for closing readers that will
become inactive.
3) proceeds to wait for in-flight reads by closing the reader permit gate.
4) one of evictable reads take the inactive read registration path, and
finds the gate for closing readers closed.
5) flat mutation reader is destroyed, but finds the underlying reader was
not closed gracefully and triggers the abort.
By closing permit gate first, evictable readers becoming inactive will
be able to properly close underlying reader, therefore avoiding the
crash.
Fixes#15534.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15535
(cherry picked from commit 914cbc11cf)
Currently, when creating the table, permissions may be mistakenly
granted to the user even if the table is already existing. This
can happen in two cases:
The query has a IF NOT EXISTS clause - as a result no exception
is thrown after encountering the existing table, and the permission
granting is not prevented.
The query is handled by a non-zero shard - as a result we accept
the query with a bounce_to_shard result_message, again without
preventing the granting of permissions.
These two cases are now avoided by checking the result_message
generated when handling the query - now we only grant permissions
when the query resulted in a schema_change message.
Additionally, a test is added that reproduces both of the mentioned
cases.
CVE-2023-33972
Fixes#15467.
* 'no-grant-on-no-create' of github.com:scylladb/scylladb-ghsa-ww5v-p45p-3vhq:
auth: do not grant permissions to creator without actually creating
transport: add is_schema_change() method to result_message
(cherry picked from commit ab6988c52f)
Today, we base compaction throughput on the amount of data written,
but it should be based on the amount of input data compacted
instead, to show the amount of data compaction had to process
during its execution.
A good example is a compaction which expire 99% of data, and
today throughput would be calculated on the 1% written, which
will mislead the reader to think that compaction was terribly
slow.
Fixes#14533.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14615
(cherry picked from commit 3b1829f0d8)
We allow inserting column values using a JSON value, eg:
```cql
INSERT INTO mytable JSON '{ "\"myKey\"": 0, "value": 0}';
```
When no JSON value is specified, the query should be rejected.
Scylla used to crash in such cases. A recent change fixed the crash
(https://github.com/scylladb/scylladb/pull/14706), it now fails
on unwrapping an uninitialized value, but really it should
be rejected at the parsing stage, so let's fix the grammar so that
it doesn't allow JSON queries without JSON values.
A unit test is added to prevent regressions.
Refs: https://github.com/scylladb/scylladb/pull/14707
Fixes: https://github.com/scylladb/scylladb/issues/14709
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
\Closes #14785
(cherry picked from commit cbc97b41d4)
The Alternator test test_ttl.py::test_ttl_expiration_gsi_lsi was flaky.
The test incorrectly assumes that when we write an already expired item,
it will be visible for a short time until being deleted by the TTL thread.
But this doesn't need to be true - if the test is slow enough, it may go
look or the item after it was already expired!
So we fix this test by splitting it into two parts - in the first part
we write a non-expiring item, and notice it eventually appears in the
GSI, LSI, and base-table. Then we write the same item again, with an
expiration time - and now it should eventually disappear from the GSI,
LSI and base-table.
This patch also fixes a small bug which prevented this test from running
on DynamoDB.
Fixes#14495
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14496
(cherry picked from commit 599636b307)
Permits added to `_ready_list` remain there until
executed by `execution_loop()`.
But `execution_loop()` exits when `_stopped == true`,
even though nothing prevents new permits from being added
to `_ready_list` after `stop()` sets `_stopped = true`.
Thus, if there are reads concurrent with `stop()`,
it's possible for a permit to be added to `_ready_list`
after `execution_loop()` has already quit. Such a permit will
never be destroyed, and `stop()` will forever block on
`_permit_gate.close()`.
A natural solution is to dismiss `execution_loop()` only after
it's certain that `_ready_list` won't receive any new permits.
This is guaranteed by `_permit_gate.close()`. After this call completes,
it is certain that no permits *exist*.
After this patch, `execution_loop()` no longer looks at `_stopped`.
It only exits when `_ready_list_cv` breaks, and this is triggered
by `stop()` right after `_permit_gate.close()`.
Fixes#15198Closes#15199
(cherry picked from commit 2000a09859)
Call replicate_live_endpoints on shard 0 to copy from 0 to the rest of
the shards. And get the list of live members from shard 0.
Move lock to the callers.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13240
(cherry picked from commit da00052ad8)
Add an API call to wait for all shards to reach the current shard 0
gossiper version. Throws when timeout is reached.
Closes#12540
* github.com:scylladb/scylladb:
api: gossiper: fix alive nodes
gms, service: lock live endpoint copy
gms, service: live endpoint copy method
(cherry picked from commit b919373cce)
when the local_deletion_time is too large and beyond the
epoch time of INT32_MAX, we cap it to INT32_MAX - 1.
this is a signal of bad configuration or a bug in scylla.
so let's add more information in the logging message to
help track back to the source of the problem.
Fixes#15015
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 9c24be05c3)
Closes#15151
This mini-series backports the fix for #12010 along with low-risk patches it depends on.
Fixes: #12010Closes#15135
* github.com:scylladb/scylladb:
distributed_loader: process_sstable_dir: do not verify snapshots
utils/directories: verify_owner_and_mode: add recursive flag
utils: Restore indentation after previous patch
utils: Coroutinize verify_owner_and_mode()
Skip over verification of owner and mode of the snapshots
sub-directory as this might race with scylla-manager
trying to delete old snapshots concurrently.
\Fixes #12010
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 845b6f901b)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Allow the caller to verify only the top level directories
so that sub-directories can be verified selectively
(in particular, skip validation of snapshots).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 60862c63dd)
There's a helper verification_error() that prints a warning and returns
excpetional future. The one is converted into void throwing one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 4ebb812df0)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Loop in shard_reshaping_compaction_task_impl::run relies on whether
sstables::compaction_stopped_exception is thrown from run_custom_job.
The exception is swallowed for each type of compaction
in compaction_manager::perform_task.
Rethrow an exception in perfrom task for reshape compaction.
Fixes: #15058.
(cherry picked from commit e0ce711e4f)
Closes#15123
This argument was dead since its introduction and 'discard' was
always configured regardless of its value.
This patch allows actually configuring things using this argument.
Fixes#14963Closes#14964
(cherry picked from commit e13a2b687d)
While repair requested by user is performed, some tables
may be dropped. When the repair proceeds to these tables,
it should skip them and continue with others.
When no_such_column_family is thrown during user requested
repair, it is logged and swallowed. Then the repair continues with
the remaining tables.
Fixes: scylladb/scylladb#13045Closesscylladb/scylladb#13068
* github.com:scylladb/scylladb:
repair: fix indentation
repair: continue user requested repair if no_such_column_family is thrown
repair: add find_column_family_if_exists function
(cherry picked from commit 9859bae54f)
Will be useful for writing tests which trigger failures, and for
warkarounds in production.
(cherry picked from commit 5c8ad2db3c)
Refs scylladb/scylladb#12969
We have had support for COUNTER columns for quite some time now, but some functionality was left unimplemented - various internal and CQL functions resulted in "unimplemented" messages when used, and the goal of this series is to fix those issues. The primary goal was to add the missing support for CASTing counters to other types in CQL (issue #14501), but we also add the missing CQL `counterasblob()` and `blobascounter()` functions (issue #14742).
As usual, the series includes extensive functional tests for these features, and one pre-existing test for CAST that used to fail now begins to pass.
Fixes#14501Fixes#14742Closes#14745
* github.com:scylladb/scylladb:
test/cql-pytest: test confirming that casting to counter doesn't work
cql: support casting of counter to other types
cql: implement missing counterasblob() and blobascounter() functions
cql: implement missing type functions for "counters" type
(cherry picked from commit a637ddd09c)
Small modification was needed to validate_visitor API for the patch to
apply.
This patch includes a translation of two more test files from
Cassandra's CQL unit test directory cql3/validation/operations.
All tests included here pass on Cassandra. Several test fail on Scylla
and are marked "xfail". These failures discovered two previously-unknown
bugs:
#12243: Setting USING TTL of "null" should be allowed
#12247: Better error reporting for oversized keys during INSERT
And also added reproducers for two previously-known bugs:
#3882: Support "ALTER TABLE DROP COMPACT STORAGE"
#6447: TTL unexpected behavior when setting to 0 on a table with
default_time_to_live
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12248
(cherry picked from commit 0c26032e70)
This is a translation of Cassandra's CQL unit test source file
validation/operations/CompactStorageTest.java into our cql-pytest
framework.
This very large test file includes 86 tests for various types of
operations and corner cases of WITH COMPACT STORAGE tables.
All 86 tests pass on Cassandra (except one using a deprecated feature
that needs to be specially enabled). 30 of the tests fail on Scylla
reproducing 7 already-known Scylla issues and 7 previously-unknown issues:
Already known issues:
Refs #3882: Support "ALTER TABLE DROP COMPACT STORAGE"
Refs #4244: Add support for mixing token, multi- and single-column
restrictions
Refs #5361: LIMIT doesn't work when using GROUP BY
Refs #5362: LIMIT is not doing it right when using GROUP BY
Refs #5363: PER PARTITION LIMIT doesn't work right when using GROUP BY
Refs #7735: CQL parser missing support for Cassandra 3.10's new "+=" syntax
Refs #8627: Cleanly reject updates with indexed values where value > 64k
New issues:
Refs #12471: Range deletions on COMPACT STORAGE is not supported
Refs #12474: DELETE prints misleading error message suggesting
ALLOW FILTERING would work
Refs #12477: Combination of COUNT with GROUP BY is different from
Cassandra in case of no matches
Refs #12479: SELECT DISTINCT should refuse GROUP BY with clustering column
Refs #12526: Support filtering on COMPACT tables
Refs #12749: Unsupported empty clustering key in COMPACT table
Refs #12815: Hidden column "value" in compact table isn't completely hidden
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12816
(cherry picked from commit 328cdb2124)
(cherry picked from commit e11561ef65)
Modified for 5.1 to comment out error-path tests for "unset" values what
are silently ignored (instead of being detected) in this version.
This is a translation of Cassandra's CQL unit test source file
functions/CastFctsTest.java into our cql-pytest framework.
There are 13 tests, 9 of them currently xfail.
The failures are caused by one recently-discovered issue:
Refs #14501: Cannot Cast Counter To Double
and by three previously unknown or undocumented issues:
Refs #14508: SELECT CAST column names should match Cassandra's
Refs #14518: CAST from timestamp to string not same as Cassandra on zero
milliseconds
Refs #14522: Support CAST function not only in SELECT
Curiously, the careful translation of this test also caused me to
find a bug in Cassandra https://issues.apache.org/jira/browse/CASSANDRA-18647
which the test in Java missed because it made the same mistake as the
implementation.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14528
(cherry picked from commit f08bc83cb2)
(cherry picked from commit e03c21a83b)
This patch adds tests to reproduce issue #13551. The issue, discovered
by a dtest (cql_cast_test.py), claimed that either cast() or sum(cast())
from varint type broke. So we add two tests in cql-pytest:
1. A new test file, test_cast_data.py, for testing data casts (a
CAST (...) as ... in a SELECT), starting with testing casts from
varint to other types.
The test uncovers a lot of interesting cases (it is heavily
commented to explain these cases) but nothing there is wrong
and all tests pass on Scylla.
2. An xfailing test for sum() aggregate of +Inf and -Inf. It turns out
that this caused #13551. In Cassandra and older Scylla, the sum
returned a NaN. In Scylla today, it generates a misleading
error message.
As usual, the tests were run on both Cassandra (4.1.1) and Scylla.
Refs #13551.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 78555ba7f1)
(cherry picked from commit 79b5befe65)
If semaphore mismatch occurs, check whether both semaphores belong
to user. If so, log a warning, log a `querier_cache_scheduling_group_mismatches` stat and drop cached reader instead of throwing an error.
Until now, semaphore mismatch was only checked in multi-partition queries. The PR pushes the check to `querier_cache` and perform it on all `lookup_*_querier` methods.
The mismatch can happen if user's scheduling group changed during
a query. We don't want to throw an error then, but drop and reset
cached reader.
This patch doesn't solve a problem with mismatched semaphores because of changes in service levels/scheduling groups but only mitigate it.
Refers: https://github.com/scylladb/scylla-enterprise/issues/3182
Refers: https://github.com/scylladb/scylla-enterprise/issues/3050Closes: #14770Closes#14736
* github.com:scylladb/scylladb:
querier_cache: add stats of scheduling group mismatches
querier_cache: check semaphore mismatch during querier lookup
querier_cache: add reference to `replica::database::is_user_semaphore()`
replica:database: add method to determine if semaphore is user one
(cherry picked from commit a8feb7428d)
This mini-series introduces dht::tokens_filter and uses it for consuming staging sstable in the view_update_generator.
The tokens_filter uses the token ranges owned by the current node, as retrieved by get_keyspace_local_ranges.
Refs #9559Closes#11932
* github.com:scylladb/scylladb:
db: view_update_generator: always clean up staging sstables
compaction: extract incremental_owned_ranges_checker out to dht
(cherry picked from commit 3aff59f189)
do_refresh_state() keeps iterators to rows_entry in a vector.
This vector might be resized during the procedure, triggering
memory reclaim and invalidating the iterators, which can cause
arbitrarily long loops and/or a segmentation fault during make_heap().
To fix this, do_refresh_state has to always be called from the allocating
section.
Additionally, it turns out that the first do_refresh_state is useless,
because reset_state() doesn't set _change_mark. This causes do_refresh_state
to be needlessly repeated during a next_row() or next_range_tombstone() which
happens immediately after it. Therefore this patch moves the _change_mark
assignment from maybe_refresh_state to do_refresh_state, so that the change mark
is properly set even after the first refresh.
Fixes#14696Closes#14697
(cherry picked from commit 41aef6dc96)
before this change, there are chances that the temporary sstables
created for collecting the GC-able data create by a certain
compaction can be picked up by another compaction job. this
wastes the CPU cycles, adds write amplification, and causes
inefficiency.
in general, these GC-only SSTables are created with the same run id
as those non-GC SSTables, but when a new sstable exhausts input
sstable(s), we proactively replace the old main set with a new one
so that we can free up the space as soon as possible. so the
GC-only SSTables are added to the new main set along with
the non-GC SSTables, but since the former have good chance to
overlap the latter. these GC-only SSTables are assigned with
different run ids. but we fail to register them to the
`compaction_manager` when replacing the main sstable set.
that's why future compactions pick them up when performing compaction,
when the compaction which created them is not yet completed.
so, in this change,
* to prevent sstables in the transient stage from being picked
up by regular compactions, a new interface class is introduced
so that the sstable is always added to registration before
it is added to sstable set, and removed from registration after
it is removed from sstable set. the struct helps to consolidate
the regitration related logic in a single place, and helps to
make it more obvious that the timespan of an sstable in
the registration should cover that in the sstable set.
* use a different run_id for the gc sstable run, as it can
overlap with the output sstable run. the run_id for the
gc sstable run is created only when the gc sstable writer
is created. because the gc sstables is not always created
for all compactions.
please note, all (indirect) callers of
`compaction_task_executor::compact_sstables()` passes a non-empty
`std::function` to this function, so there is no need to check for
empty before calling it. so in this change, the check is dropped.
Fixes#14560
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14725
(cherry picked from commit fdf61d2f7c)
Closes#14828
Te view updating consumer uses `_buffer_size` to decide when to flush the accumulated mutations, passing them to the actual view building code. This `_buffer_size` is incremented every time a mutation fragment is consumed. This is not exact, as e.g. range tombstones are represented differently in the mutation object, than in the fragment, but it is good enough. There is one flaw however: `_buffer_size` is not incremented when consuming a partition-start fragment. This is when the mutation object is created in the mutation rebuilder. This is not a big problem when partition have many rows, but if the partitions are tiny, the error in accounting quickly becomes significant. If the partitions are empty, `_buffer_size` is not bumped at all for empty partitions, and any number of these can accumulate in the buffer. We have recently seen this causing stalls and OOM as the buffer got to immense size, only containing empty and tiny partitions.
This PR fixes this by accounting the size of the freshly created `mutation` object in `_buffer_size`, after the partition-start fragment is consumed.
Fixes: #14819Closes#14821
* github.com:scylladb/scylladb:
test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations
db/view/view_updating_consumer: account for the size of mutations
mutation/mutation_rebuilder*: return const mutation& from consume_new_partition()
mutation/mutation: add memory_usage()
(cherry picked from commit 056d04954c)
(cherry picked from commit e34c62c567)
It was found that cached_file dtor can hit the following assert
after OOM
cached_file_test: utils/cached_file.hh:379: cached_file::~cached_file(): Assertion _cache.empty()' failed.`
cached_file's dtor iterates through all entries and evict those
that are linked to LRU, under the assumption that all unused
entries were linked to LRU.
That's partially correct. get_page_ptr() may fetch more than 1
page due to read ahead, but it will only call cached_page::share()
on the first page, the one that will be consumed now.
share() is responsible for automatically placing the page into
LRU once refcount drops to zero.
If the read is aborted midway, before cached_file has a chance
to hit the 2nd page (read ahead) in cache, it will remain there
with refcount 0 and unlinked to LRU, in hope that a subsequent
read will bring it out of that state.
Our main user of cached_file is per-sstable index caching.
If the scenario above happens, and the sstable and its associated
cached_file is destroyed, before the 2nd page is hit, cached_file
will not be able to clear all the cache because some of the
pages are unused and not linked.
A page read ahead will be linked into LRU so it doesn't sit in
memory indefinitely. Also allowing for cached_file dtor to
clear all cache if some of those pages brought in advance
aren't fetched later.
A reproducer was added.
Fixes#14814.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14818
(cherry picked from commit 050ce9ef1d)
The new test detected a stack-use-after-return when using table's
as_mutation_source_excluding_staging() for range reads.
This doesn't really affect view updates that generate single
key reads only. So the problem was only stressed in the recently
added test. Otherwise, we'd have seen it when running dtests
(in debug mode) that stress the view update path from staging.
The problem happens because the closure was feeded into
a noncopyable_function that was taken by reference. For range
reads, we defer before subsequent usage of the predicate.
For single key reads, we only defer after finished using
the predicate.
Fix is about using sstable_predicate type, so there won't
be a need to construct a temporary object on stack.
Fixes#14812.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14813
(cherry picked from commit 0ac43ea877)
Currently, scylla_fstrim_setup does not start scylla-fstrim.timer and
just enables it, so the timer starts only after rebooted.
This is incorrect behavior, we start start it during the setup.
Also, unmask is unnecessary for enabling the timer.
Fixes#14249Closes#14252
(cherry picked from commit c70a9cbffe)
Closes#14420
Consider
- 10 repair instances take all the 10 _streaming_concurrency_sem
- repair readers are done but the permits are not released since they
are waiting for view update _registration_sem
- view updates trying to take the _streaming_concurrency_sem to make
progress of view update so it could release _registration_sem, but it
could not take _streaming_concurrency_sem since the 10 repair
instances have taken them
- deadlock happens
Note, when the readers are done, i.e., reaching EOS, the repair reader
replaces the underlying (evictable) reader with an empty reader. The
empty reader is not evictable, so the resources cannot be forcibly
released.
To fix, release the permits manually as soon as the repair readers are
done even if the repair job is waiting for _registration_sem.
Fixes#14676Closes#14677
(cherry picked from commit 1b577e0414)
Adds preemption points used in Alternator when:
- sending bigger json response
- building results for BatchGetItem
I've tested manually by inserting in preemptible sections (e.g. before `os.write`) code similar to:
auto start = std::chrono::steady_clock::now();
do { } while ((std::chrono::steady_clock::now() - start) < 100ms);
and seeing reactor stall times. After the patch they
were not increasing while before they kept building up due to no preemption.
Refs #7926Fixes#13689Closes#12351
* github.com:scylladb/scylladb:
alternator: remove redundant flush call in make_streamed
utils: yield when streaming json in print()
alternator: yield during BatchGetItem operation
(cherry picked from commit d2e089777b)
On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name.
If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen.
This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster.
This PR proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore.
I tested this manually, by creating a cluster of 2 OSS nodes, then upgrading one of the nodes to enterprise and verifying (with extra logging) that service level connections are matched to the default statement tenant after the PR and they indeed match to the default scheduling group before.
Fixes: #13841Fixes: #12552Closes#13843
* github.com:scylladb/scylladb:
message: match unknown tenants to the default tenant
message: generalize per-tenant connection types
(cherry picked from commit a7c2c9f92b)
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.
This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
e225c88a65 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.
To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.
\Fixes scylladb/scylladb#14182
Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing #14182
\Closes scylladb/scylladb#14183
* github.com:scylladb/scylladb:
docs: dml: add update ordering section
cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
atomic_cell: compare_atomic_cell_for_merge: update and add documentation
compare_atomic_cell_for_merge: compare value last for live cells
mutation_test: test_cell_ordering: improve debuggability
(cherry picked from commit 87b4606cd6)
Closes#14651
View update routines accept mutation objects.
But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to convert the fragment stream into mutations. This is done by piping the stream to mutation_rebuilder_v2.
To keep memory usage limited, the stream for a single partition might have to be split into multiple partial mutation objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error.
This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next mutation object).
The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic.
Backported from c25201c1a3. Some minor API-related adjustments were needed.
Closes#14621
* github.com:scylladb/scylladb:
test: view_build_test: add range tombstones to test_view_update_generator_buffering
test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
view_updating_consumer: make buffer limit a variable
view: fix range tombstone handling on flushes in view_updating_consumer
Fixes#11017
When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.
Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.
Closes#14631
In mutation_reader_merger and clustering_order_reader_merger, the
operator()() is responsible for producing mutation fragments that will
be merged and pushed to the combined reader's buffer. Sometimes, it
might have to advance existing readers, open new and / or close some
existing ones, which requires calling a helper method and then calling
operator()() recursively.
In some unlucky circumstances, a stack overflow can occur:
- Readers have to be opened incrementally,
- Most or all readers must not produce any fragments and need to report
end of stream without preemption,
- There has to be enough readers opened within the lifetime of the
combined reader (~500),
- All of the above needs to happen within a single task quota.
In order to prevent such a situation, the code of both reader merger
classes were modified not to perform recursion at all. Most of the code
of the operator()() was moved to maybe_produce_batch which does not
recur if it is not possible for it to produce a fragment, instead it
returns std::nullopt and operator()() calls this method in a loop via
seastar::repeat_until_value.
A regression test is added.
Fixes: scylladb/scylladb#14415
Closes#14452
(cherry picked from commit ee9bfb583c)
Closes#14604
The discussion on the thread says, when we reformat a volume with another
filesystem, kernel and libblkid may skip to populate /dev/disk/by-* since it
detected two filesystem signatures, because mkfs.xxx did not cleared previous
filesystem signature.
To avoid this, we need to run wipefs before running mkfs.
Note that this runs wipefs twice, for target disks and also for RAID device.
wipefs for RAID device is needed since wipefs on disks doesn't clear filesystem signatures on /dev/mdX (we may see previous filesystem signature on /dev/mdX when we construct RAID volume multiple time on same disks).
Also dropped -f option from mkfs.xfs, it will check wipefs is working as we
expected.
Fixes#13737
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#13738
(cherry picked from commit fdceda20cc)
This patch adds a full-range tombstone to the compacted mutation.
This raises the coverage of the test. In particular, it reproduces
issue #14503, which should have been caught by this test, but wasn't.
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of
mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to
convert the fragment stream into `mutation`s. This is done by piping
the stream to mutation_rebuilder_v2.
To keep memory usage limited, the stream for a single partition might
have to be split into multiple partial `mutation` objects.
view_update_consumer does that, but in improper way -- when the
split/flush happens inside an active range tombstone, the range
tombstone isn't closed properly. This is illegal, and triggers an
internal error.
This patch fixes the problem by closing the active range tombstone
(and reopening in the same position in the next `mutation` object).
The tombstone is closed just after the last seen clustered position.
This is not necessary for correctness -- for example we could delay
all processing of the range tombstone until we see its end
bound -- but it seems like the most natural semantic.
Fixes#14503
This PR fixes the Restore System Tables section of the upgrade guides by adding a command to clean upgraded SStables during rollback or adding the entire section to restore system tables (which was missing from the older documents).
This PR fixes is a bug and must be backported to branch-5.3, branch-5.2., and branch-5.1.
Refs: https://github.com/scylladb/scylla-enterprise/issues/3046
- [x] 5.1-to-2022.2 - update command (backport to branch-5.3, branch-5.2, and branch-5.1)
- [x] 5.0-to-2022.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1)
- [x] 4.3-to-2021.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1)
(see https://github.com/scylladb/scylla-enterprise/issues/3046#issuecomment-1604232864)
Closes#14444
* github.com:scylladb/scylladb:
doc: fix rollback in 4.3-to-2021.1 upgrade guide
doc: fix rollback in 5.0-to-2022.1 upgrade guide
doc: fix rollback in 5.1-to-2022.2 upgrade guide
(cherry picked from commit 8a7261fd70)
Prior to off-strategy compaction, streaming / repair would place
staging files into main sstable set, and wait for view building
completion before they could be selected for regular compaction.
The reason for that is that view building relies on table providing
a mutation source without data in staging files. Had regular compaction
mixed staging data with non-staging one, table would have a hard time
providing the required mutation source.
After off-strategy compaction, staging files can be compacted
in parallel to view building. If off-strategy completes first, it
will place the output into the main sstable set. So a parallel view
building (on sstables used for off-strategy) may potentially get a
mutation source containing staging data from the off-strategy output.
That will mislead view builder as it won't be able to detect
changes to data in main directory.
To fix it, we'll do what we did before. Filter out staging files
from compaction, and trigger the operation only after we're done
with view building. We're piggybacking on off-strategy timer for
still allowing the off-strategy to only run at the end of the
node operation, to reduce the amount of compaction rounds on
the data introduced by repair / streaming.
Fixes#11882.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#11919
(cherry picked from commit a57724e711)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14365
with off-strategy, input list size can be close to 1k, which will
lead to unneeded reallocations when formatting the list for
logging.
in the past, we faced stalls in this area, and excessive reallocation
(log2 ~1k = ~10) may have contributed to that.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13907
(cherry picked from commit 5544d12f18)
Fixesscylladb/scylladb#14071
Information was duplicated before and the version on this page was outdated - RBNO is enabled for replace operation already.
Closes#12984
(cherry picked from commit bd7caefccf)
View building from staging creates a reader from scratch (memtable
+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.
perf shows that the reader creation is very expensive:
+ 12.15% 10.75% reactor-3 scylla [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+ 10.01% 9.99% reactor-3 scylla [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 8.95% 8.94% reactor-3 scylla [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+ 7.29% 7.28% reactor-3 scylla [.] dht::ring_position_tri_compare
+ 6.28% 6.27% reactor-3 scylla [.] dht::tri_compare
+ 4.11% 3.52% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 4.09% 4.07% reactor-3 scylla [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+ 3.46% 0.93% reactor-3 scylla [.] sstables::sstable_run::will_introduce_overlapping
+ 2.53% 2.53% reactor-3 libstdc++.so.6 [.] std::_Rb_tree_increment
+ 2.45% 2.45% reactor-3 scylla [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 2.14% 2.13% reactor-3 scylla [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 2.07% 2.07% reactor-3 scylla [.] logalloc::region_impl::free
+ 2.06% 1.91% reactor-3 scylla [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()#1}::operator()() const::{lambda()#1}::operator()
+ 2.04% 2.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 1.87% 0.00% reactor-3 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 1.86% 0.00% reactor-3 [kernel.kallsyms] [k] do_syscall_64
+ 1.39% 1.38% reactor-3 libc.so.6 [.] __memcmp_avx2_movbe
+ 1.37% 0.92% reactor-3 scylla [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+ 1.34% 1.33% reactor-3 scylla [.] logalloc::region_impl::alloc_small
+ 1.33% 1.33% reactor-3 scylla [.] seastar::memory::small_pool::add_more_objects
+ 1.30% 0.35% reactor-3 scylla [.] seastar::reactor::do_run
+ 1.29% 1.29% reactor-3 scylla [.] seastar::memory::allocate
+ 1.19% 0.05% reactor-3 libc.so.6 [.] syscall
+ 1.16% 1.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+ 1.07% 0.79% reactor-3 scylla [.] sstables::partitioned_sstable_set::insert
That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).
The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.
This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.
This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.
With this improvement, view building was measured to be 3x faster.
from
INFO 2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s
to
INFO 2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s
Refs #14089.
Fixes#14244.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14476
The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the previous buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between.
The code guranteeing this forward progress had a bug: the comparison between the position after the last buffer-fill and the current last fragment position was done in the wrong direction.
So if the condition that we wanted to achieve was already true, we would continue filling the buffer until partition end which may lead to OOMs such as in #13491.
There was already a fix in this area to handle `partition_start` fragments correctly - #13563 - but it missed that the position comparison was done in the wrong order.
Fix the comparison and adjust one of the tests (added in #13563) to detect this case.
After the fix, the evictable reader starts generating some redundant (but expected) range tombstone change fragments since it's now being paused and resumed. For this we need to adjust mutation source tests which were a bit too specific. We modify `flat_mutation_reader_assertions` to squash the redundant `r_t_c`s.
Fixes#13491Closes#14375
* github.com:scylladb/scylladb:
readers: evictable_reader: don't accidentally consume the entire partition
test: flat_mutation_reader_assertions: squash `r_t_c`s with the same position
(cherry picked from commit 586102b42e)
range_tombstone_change_generator::flush() mishandles the case when two range
tombstones are adjacent and flush(pos, end_of_range=true) is called with pos
equal to the end bound of the lesser-position range tombstone.
In such case, the start change of the greater-position rtc will be accidentally
emitted, and there won't be an end change, which breaks reader assumptions by
ending the stream with an unclosed range tombstone, triggering an assertion.
This is due to a non-strict inequality used in a place where strict inequality
should be used. The modified line was intended to close range tombstones
which end exactly on the flush position, but this is unnecessary because such
range tombstones are handled by the last `if` in the function anyway.
Instead, this line caused range tombstones beginning right after the flush
position to be emitted sometimes.
Fixes#12462Closes#13906
(cherry picked from commit 9b0679c140)
This includes seastar update titled
'Merge 'Split rpc::server stop into two parts''
Includes backport of #12244 fix
* br-5.1-backport-ms-shutdown:
messaging_service: Shutdown rpc server on shutdown
messaging_service: Generalize stop_servers()
messaging_service: Restore indentation after previous patch
messaging_service: Coroutinize stop()
messaging_service: Coroutinize stop_servers()
messaging: Shutdown on stop() if it wasn't shut down earlier
Update seastar submodule
refs: #14031
The RPC server now has a lighter .shutdown() method that just does what
m.s. shutdown() needs, so call it. On stop call regular stop to finalize
the stopping process
backport: The messaging_service::shutdown() had conflict due to missing
e147681d85 commit
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it do_with_servers() and make it accept method to call and message
to print. This gives the ability to reuse this helper in next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After c7826aa910, sstable runs are cleaned up together.
The procedure which executes cleanup was holding reference to all
input sstables, such that it could later retry the same cleanup
job on failure.
Turns out it was not taking into account that incremental compaction
will exhaust the input set incrementally.
Therefore cleanup is affected by the 100% space overhead.
To fix it, cleanup will now have the input set updated, by removing
the sstables that were already cleaned up. On failure, cleanup
will retry the same job with the remaining sstables that weren't
exhausted by incremental compaction.
New unit test reproduces the failure, and passes with the fix.
Fixes#14035.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14038
(cherry picked from commit 23443e0574)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14195
With regards to closing the looked-up querier if an exception is thrown. In particular, this requires closing the querier if a semaphore mismatch is detected. Move the table lookup above the line where the querier is looked up, to avoid having to handle the exception from it. As a consequence of closing the querier on the error path, the lookup lambda has to be made a coroutine. This is sad, but this is executed once per page, so its cost should be insignificant when spread over an
entire page worth of work.
Also add a unit test checking that the mismatch is detected in the first place and that readers are closed.
Fixes: #13784Closes#13790
* github.com:scylladb/scylladb:
test/boost/database_test: add unit test for semaphore mismatch on range scans
partition_slice_builder: add set_specific_ranges()
multishard_mutation_query: make reader_context::lookup_readers() exception safe
multishard_mutation_query: lookup_readers(): make inner lambda a coroutine
(cherry picked from commit 1c0e8c25ca)
Due to a simple programming oversight, one of keyspace_metadata
constructors is using empty user_types_metadata instead of the
passed one. Fix that.
Fixes#14139Closes#14143
(cherry picked from commit 1a521172ec)
A long long time ago there was an issue about removing infinite timeouts
from distributed queries: #3603. There was also a fix:
620e950fc8. But apparently some queries
escaped the fix, like the one in `default_role_row_satisfies`.
With the right conditions and timing this query may cause a node to hang
indefinitely on shutdown. A node tries to perform this query after it
starts. If we kill another node which is required to serve this query
right before that moment, the query will hang; when we try to shutdown
the querying node, it will wait for the query to finish (it's a
background task in auth service), which it never does due to infinite
timeout.
Use the same timeout configuration as other queries in this module do.
Fixes#13545.
Closes#14134
(cherry picked from commit f51312e580)
Fixes a regression introduced in 80917a1054:
"scylla_prepare: stop generating 'mode' value in perftune.yaml"
When cpuset.conf contains a "full" CPU set the negation of it from
the "full" CPU set is going to generate a zero mask as a irq_cpu_mask.
This is an illegal value that will eventually end up in the generated
perftune.yaml, which in line will make the scylla service fail to start
until the issue is resolved.
In such a case a irq_cpu_mask must represent a "full" CPU set mimicking
a former 'MQ' mode.
\Fixes scylladb/scylladb#11701
Tested:
- Manually on a 2 vCPU VM in an 'auto-selection' mode.
- Manually on a large VM (48 vCPUs) with an 'MQ' manually
enforced.
Message-Id: <20221004004237.2961246-1-vladz@scylladb.com>
(cherry picked from commit 8195dab92a)
This patch fixes the regression introduced by 3a51e78 which broke
a very important contract: perftune.yaml should not be "touched"
by Scylla scriptology unless explicitly requested.
And a call for scylla_cpuset_setup is such an explicit request.
The issue that the offending patch was intending to fix was that
cpuset.conf was always generated anew for every call of
scylla_cpuset_setup - even if a resulting cpuset.conf would come
out exactly the same as the one present on the disk before tha call.
And since the original code was following the contract mentioned above
it was also deleting perftune.yaml every time too.
However, this was just an unavoidable side-effect of that cpuset.conf
re-generation.
The above also means that if scylla_cpuset_setup doesn't write to cpuset.conf
we should not "touch" perftune.yaml and vise versa.
This patch implements exactly that together with reverting the dangerous
logic introduced by 3a51e78.
\Fixes scylladb/scylladb#11385
\Fixes scylladb/scylladb#10121
(cherry picked from commit c538cc2372)
Modern perftune.py supports a more generic way of defining IRQ CPUs:
'irq_cpu_mask'.
This patch makes our auto-generation code create a perftune.yaml
that uses this new parameter instead of using outdated 'mode'.
As a side effect, this change eliminates the notion of "incorrect"
value in cpuset.conf - every value is valid now as long as it fits into
the 'all' CPU set of the specific machine.
Auto-generated 'irq_cpu_mask' is going to include all bits from 'all'
CPU mask except those defined in cpuset.conf.
\Fixes scylladb/scylladb#9903
(cherry picked from commit 80917a1054)
This class exists for one purpose only: to serve as glue code between
dht::ring_position and boost::icl::interval_map. The latter requires
that keys in its intervals are:
* default constructible
* copyable
* have standalone compare operations
For this reason we have to wrap `dht::ring_position` in a class,
together with a schema to provide all this. This is
`compatible_ring_position`. There is one further requirement by code
using the interval map: it wants to do lookups without copying the
lookup key(s). To solve this, we came up with
`compatible_ring_position_or_view` which is a union of a key or a key
view + schema. As we recently found out, boost::icl copies its keys **a
lot**. It seems to assume these keys are cheap to copy and carelessly
copies them around even when iterating over the map. But
`compatible_ring_position_or_view` is not cheap to copy as it copies a
`dht::ring_position` which allocates, and it does that via an
`std::optional` and `std::variant` to add insult to injury.
This patch make said class cheap to copy, by getting rid of the variant
and storing the `dht::ring_position` via a shared pointer. The view is
stored separately and either points to the ring position stored in the
shared pointer or to an outside ring position (for lookups).
Fixes: #11669Closes#11670
(cherry picked from commit 169a8a66f2)
The manager intended to periodically reevaluate compaction need for
each registered table. But it's not working as intended.
The reevaluation is one-off.
This means that compaction was not kicking in later for a table, with
low to none write activity, that had expired data 1 hour from now.
Also make sure that reevaluation happens within the compaction
scheduling group.
Fixes#13430.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 156ac0a67a)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Every 1 hour, compaction manager will submit all registered table_state
for a regular compaction attempt, all without yielding.
This can potentially cause a reactor stall if there are 1000s of table
states, as compaction strategy heuristics will run on behalf of each,
and processing all buckets and picking the best one is not cheap.
This problem can be magnified with compaction groups, as each group
is represented by a table state.
This might appear in dashboard as periodic stalls, every 1h, misleading
the investigator into believing that the problem is caused by a
chronological job.
This is fixed by piggybacking on compaction reevaluation loop which
can yield between each submission attempt if needed.
Fixes#12390.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12391
(cherry picked from commit 67ebd70e6e)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
postponed_compactions_reevaluation() runs until compaction_manager is
stopped, checking if it needs to launch new compactions.
Make it return a future instead of stashing its completion somewhere.
This makes is easier to convert it to a coroutine.
(cherry picked from commit d2c44cba77)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When off-strategy compaction completes, regular compaction is not triggered.
If off-strategy output causes the table's SSTable set to not conform the strategy
goal, it means that read and space amplification will be suboptimal until the next
compaction kicks in, which can take undefinite amount of time (e.g. when active
memtable is flushed).
Let's reevaluate compaction on main SSTable set when off-strategy ends.
Fixes#13429.
Backport note: conflict is around compaction_group vs table.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 2652b41606)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
static report:
sstables/mx/reader.cc:1705:58: error: invalid invocation of method 'operator*' on object 'schema' while it is in the 'consumed' state [-Werror,-Wconsumed]
legacy_reverse_slice_to_native_reverse_slice(*schema, slice.get()), pc, std::move(trace_state), fwd, fwd_mr, monitor);
Fixes#13394.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 213eaab246)
use-after-free in ctor, which potentially leads to a failure
when locating table from moved schema object.
static report
In file included from db/system_keyspace.cc:51:
./db/view/build_progress_virtual_reader.hh:202:40: warning: invalid invocation of method 'operator->' on object 's' while it is in the 'consumed' state [-Wconsumed]
_db.find_column_family(s->ks_name(), system_keyspace::v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS),
Fixes#13395.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 1ecba373d6)
static report:
./index/built_indexes_virtual_reader.hh:228:40: warning: invalid invocation of method 'operator->' on object 's' while it is in the 'consumed' state [-Wconsumed]
_db.find_column_family(s->ks_name(), system_keyspace::v3::BUILT_VIEWS),
Fixes#13396.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit f8df3c72d4)
Variant used by
streaming/stream_transfer_task.cc: , reader(cf.make_streaming_reader(cf.schema(), std::move(permit_), prs))
as full slice is retrieved after schema is moved (clang evaluates
left-to-right), the stream transfer task can be potentially working
on a stale slice for a particular set of partitions.
static report:
In file included from replica/dirty_memory_manager.cc:6:
replica/database.hh:706:83: error: invalid invocation of method 'operator->' on object 'schema' while it is in the 'consumed' state [-Werror,-Wconsumed]
return make_streaming_reader(std::move(schema), std::move(permit), range, schema->full_slice());
Fixes#13397.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 04932a66d3)
The immediate mode is similar to timeout mode with gc_grace_seconds
zero. Thus, the gc_before returned should be the query_time instead of
gc_clock::time_point::max in immediate mode.
Setting gc_before to gc_clock::time_point::max, a row could be dropped
by compaction even if the ttl is not expired yet.
The following procedure reproduces the issue:
- Start 2 nodes
- Insert data
```
CREATE KEYSPACE ks2a WITH REPLICATION = { 'class' : 'SimpleStrategy',
'replication_factor' : 2 };
CREATE TABLE ks2a.tb (pk int, ck int, c0 text, c1 text, c2 text, PRIMARY
KEY(pk, ck)) WITH tombstone_gc = {'mode': 'immediate'};
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (10 ,1, 'x', 'y', 'z')
USING TTL 1000000;
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (20 ,1, 'x', 'y', 'z')
USING TTL 1000000;
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (30 ,1, 'x', 'y', 'z')
USING TTL 1000000;
```
- Run nodetool flush and nodetool compact
- Compaction drops all data
```
~128 total partitions merged to 0.
```
Fixes#13572Closes#13800
(cherry picked from commit 7fcc403122)
This is not really an error, so print it in debug log_level
rather than error log_level.
Fixes#13374
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13462
(cherry picked from commit cc42f00232)
Courtersy of clang-tidy:
row_cache.cc:1191:28: warning: 'entry' used after it was moved [bugprone-use-after-move]
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{_schema});
^
row_cache.cc:1191:60: note: move occurred here
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{_schema});
^
row_cache.cc:1191:28: note: the use and move are unsequenced, i.e. there is no guarantee about the order in which they are evaluated
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{*_schema});
The use-after-move is UB, as for it to happen, depends on evaluation order.
We haven't hit it yet as clang is left-to-right.
Fixes#13400.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13401
(cherry picked from commit d2d151ae5b)
Aggregation query on counter column is failing because forward_service is looking for function with counter as an argument and such function doesn't exist. Instead the long type should be used.
Fixes: #12939Closes#12963
* github.com:scylladb/scylladb:
test:boost: counter column parallelized aggregation test
service:forward_service: use long type when column is counter
(cherry picked from commit 61e67b865a)
Run tests for parallelized aggregation with
`enable_parallelized_aggregation` set always to true, so the tests work
even if the default value of the option is false.
Closes#12409
(cherry picked from commit 83bb77b8bb)
Ref #12939.
This patch fixes#12475, where an aggregation (e.g., COUNT(*), MIN(v))
of absolutely no partitions (e.g., "WHERE p = null" or "WHERE p in ()")
resulted in an internal error instead of the "zero" result that each
aggregator expects (e.g., 0 for COUNT, null for MIN).
The problem is that normally our aggregator forwarder picks the nodes
which hold the relevant partition(s), forwards the request to each of
them, and then combines these results. When there are no partitions,
the query is sent to no node, and we end up with an empty result set
instead of the "zero" results. So in this patch we recognize this
case and build those "zero" results (as mentioned above, these aren't
always 0 and depend on the aggregation function!).
The patch also adds two tests reproducing this issue in a fairly general
way (e.g., several aggregators, different aggregation functions) and
confirming the patch fixes the bug.
The test also includes two additional tests for COUNT aggregation, which
uncovered an incompatibility with Cassandra which is still not fixed -
so these tests are marked "xfail":
Refs #12477: Combining COUNT with GROUP by results with empty results
in Cassandra, and one result with empty count in Scylla.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12715
(cherry picked from commit 3ba011c2be)
The evictable reader must ensure that each buffer fill makes forward
progress, i.e. the last fragment in the buffer has a position larger
than the last fragment from the last buffer-fill. Otherwise, the reader
could get stuck in an infinite loop between buffer fills, if the reader
is evicted in-between.
The code guranteeing this forward change has a bug: when the next
expected position is a partition-start (another partition), the code
would loop forever, effectively reading all there is from the underlying
reader.
To avoid this, add a special case to ignore the progress guarantee loop
altogether when the next expected position is a partition start. In this
case, progress is garanteed anyway, because there is exactly one
partition-start fragment in each partition.
Fixes: #13491Closes#13563
(cherry picked from commit 72003dc35c)
Undefined behavior because the evaluation order is undefined.
With GCC, where evaluation is right-to-left, schema will be moved
once it's forwarded to make_flat_mutation_reader_from_mutations_v2().
The consequence is that memory tracking of mutation_fragment_v2
(for tracking only permit used by view update), which uses the schema,
can be incorrect. However, it's more likely that Scylla will crash
when estimating memory usage for row, which access schema column
information using schema::column_at(), which in turn asserts that
the requested column does really exist.
Fixes#13093.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13092
(cherry picked from commit 3fae46203d)
Fixes https://github.com/scylladb/scylladb/issues/13106
This commit removes the information that BYPASS CACHE
is an Enterprise-only feature and replaces that info
with the link to the BYPASS CACHE description.
Closes#13316
(cherry picked from commit 1cfea1f13c)
* tools/python3 bf6e892...4b04b46 (1):
> dist: redhat: provide only a single version
s/%{version}/%{version}-%{release}/ in `Requires:` sections.
this enforces the runtime dependencies of exactly the same
releases between scylla packages.
Fixes#13222
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 7165551fd7)
The REST test test_storage_service.py::test_toppartitions_pk_needs_escaping
was flaky. It tests the toppartition request, which unfortunately needs
to choose a sampling duration in advance, and we chose 1 second which we
considered more than enough - and indeed typically even 1ms is enough!
but very rarely (only know of only one occurance, in issue #13223) one
second is not enough.
Instead of increasing this 1 second and making this test even slower,
this patch takes a retry approach: The tests starts with a 0.01 second
duration, and is then retried with increasing durations until it succeeds
or a 5-seconds duration is reached. This retry approach has two benefits:
1. It de-flakes the test (allowing a very slow test to take 5 seconds
instead of 1 seconds which wasn't enough), and 2. At the same time it
makes a successful test much faster (it used to always take a full
second, now it takes 0.07 seconds on a dev build on my laptop).
A *failed* test may, in some cases, take 10 seconds after this patch
(although in some other cases, an error will be caught immediately),
but I consider this acceptable - this test should pass, after all,
and a failure indicates a regression and taking 10 seconds will be
the last of our worries in that case.
Fixes#13223.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13238
(cherry picked from commit c550e681d7)
This patch increases the connection timeout in the get_cql_cluster()
function in test/cql-pytest/run.py. This function is used to test
that Scylla came up, and also test/alternator/run uses it to set
up the authentication - which can only be done through CQL.
The Python driver has 2-second and 5-second default timeouts that should
have been more than enough for everybody (TM), but in #13239 we saw
that in one case it apparently wasn't enough. So to be extra safe,
let's increase the default connection-related timeouts to 60 seconds.
Note this change only affects the Scylla *boot* in the test/*/run
scripts, and it does not affect the actual tests - those have different
code to connect to Scylla (see cql_session() in test/cql-pytest/util.py),
and we already increased the timeouts there in #11289.
Fixes#13239
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13291
(cherry picked from commit 4fdcee8415)
sleep_abortable() is aborted on success, which causes sleep_aborted
exception to be thrown. This causes scylla to throw every 100ms for
each pinged node. Throwing may reduce performance if happens often.
Also, it spams the logs if --logger-log-level exception=trace is enabled.
Avoid by swallowing the exception on cancellation.
Fixes#13278.
Closes#13279
(cherry picked from commit 99cb948eac)
Otherwise the null pointer is dereferenced.
Add a unit test reproducing the issue
and testing this fix.
Fixes#13636
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 12877ad026)
The removenode_abort logic that follows the warning
may throw, in which case information about
the original exception was lost.
Fixes: #11722Closes#11735
(cherry picked from commit 40bd9137f8)
Related: https://github.com/scylladb/scylla-enterprise/issues/2807
This commit removes the --load-and-stream nodetool option
from version 5.1 - it is not supported in this version.
This commit should only be merged to branch-5.1 (not to master)
as the feature will be added in the later versions => in versions
prior to 5.2.x the information about the option is a bug.
Closes#13618
in `make_group0_history_state_id_mutation`, when adding a new entry to
the group 0 history table, if the parameter `gc_older_than` is engaged,
we create a range tombstone in the mutation which deletes entries older
than the new one by `gc_older_than`. In particular if
`gc_older_than = 0`, we want to delete all older entries.
There was a subtle bug there: we were using millisecond resolution when
generating the tombstone, while the provided state IDs used microsecond
resolution. On a super fast machine it could happen that we managed to
perform two schema changes in a single millisecond; this happened
sometimes in `group0_test.test_group0_history_clearing_old_entries`
on our new CI/promotion machines, causing the test to fail because the
tombstone didn't clear the entry correspodning to the previous schema
change when performing the next schema change (since they happened in
the same millisecond).
Use microsecond resolution to fix that. The consecutive state IDs used
in group 0 mutations are guaranteed to be strictly monotonic at
microsecond resolution (see `generate_group0_state_id` in
service/raft/raft_group0_client.cc).
Fixes#13594Closes#13604
* github.com:scylladb/scylladb:
db: system_keyspace: use microsecond resolution for group0_history range tombstone
utils: UUID_gen: accept decimicroseconds in min_time_UUID
(cherry picked from commit 10c1f1dc80)
We should scan all sstables in the table directory and its
subdirectories to determine the highest sstable version and generation
before using it for creating new sstables (via reshard or reshape).
Otherwise, the generations of new sstables created when populating staging (via reshard or reshape) may collide with generations in the base directory, leading to https://github.com/scylladb/scylladb/issues/11789
Refs scylladb/scylladb#11789
Fixes scylladb/scylladb#11793
Closes#11795
* github.com:scylladb/scylladb:
distributed_loader: populate_column_family: reindent
distributed_loader: coroutinize populate_column_family
distributed_loader: table_population_metadata: start: reindent
distributed_loader: table_population_metadata: coroutinize start_subdir
distributed_loader: table_population_metadata: start_subdir: reindent
distributed_loader: pre-load all sstables metadata for table before populating it
(cherry picked from commit 4aa0b16852)
Our documentation states that writing an item with "USING TTL 0" means it
should never expire. This should be true even if the table has a default
TTL. But Scylla mistakenly handled "USING TTL 0" exactly like having no
USING TTL at all (i.e., it took the default TTL, instead of unlimited).
We had two xfailing tests demonstrating that Scylla's behavior in this
is different from Cassandra. Scylla's behavior in this case was also
undocumented.
By the way, Cassandra used to have the same bug (CASSANDRA-11207) but
it was fixed already in 2016 (Cassandra 3.6).
So in this patch we fix Scylla's "USING TTL 0" behavior to match the
documentation and Cassandra's behavior since 2016. One xfailing test
starts to pass and the second test passes this bug and fails on a
different one. This patch also adds a third test for "USING TTL ?"
with UNSET_VALUE - it behaves, on both Scylla and Cassandra, like a
missing "USING TTL".
The origin of this bug was that after parsing the statement, we saved
the USING TTL in an integer, and used 0 for the case of no USING TTL
given. This meant that we couldn't tell if we have USING TTL 0 or
no USING TTL at all. This patch uses an std::optional so we can tell
the case of a missing USING TTL from the case of USING TTL 0.
Fixes#6447
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13079
(cherry picked from commit a4a318f394)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The purpose of `_stop` is to remember whether the consumption of the
last partition was interrupted or it was consumed fully. In the former
case, the compactor allows retreiving the compaction state for the given
partition, so that its compaction can be resumed at a later point in
time.
Currently, `_stop` is set to `stop_iteration::yes` whenever the return
value of any of the `consume()` methods is also `stop_iteration::yes`.
Meaning, if the consuming of the partition is interrupted, this is
remembered in `_stop`.
However, a partition whose consumption was interrupted is not always
continued later. Sometimes consumption of a partitions is interrputed
because the partition is not interesting and the downstream consumer
wants to stop it. In these cases the compactor should not return an
engagned optional from `detach_state()`, because there is not state to
detach, the state should be thrown away. This was incorrectly handled so
far and is fixed in this patch, but overwriting `_stop` in
`consume_partition_end()` with whatever the downstream consumer returns.
Meaning if they want to skip the partition, then `_stop` is reset to
`stop_partition::no` and `detach_state()` will return a disengaged
optional as it should in this case.
Fixes: #12629Closes#13365
(cherry picked from commit bae62f899d)
The patch doesn't apply cleanly, so a targeted backport PR was necessary.
I also needed to cherry-pick two patches from https://github.com/scylladb/scylladb/pull/13255 that the backported patch depends on. Decided against backporting the entire https://github.com/scylladb/scylladb/pull/13255 as it is quite an intrusive change.
Fixes: https://github.com/scylladb/scylladb/issues/11803Closes#13516
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: don't evict inactive readers needlessly
reader_concurrency_semaphore: add stats to record reason for queueing permits
reader_concurrency_semaphore: can_admit_read(): also return reason for rejection
reader_concurrency_semaphore: add set_resources()
total disk space used metric is incorrectly telling the amount of
disk space ever used, which is wrong. It should tell the size of
all sstables being used + the ones waiting to be deleted.
live disk space used, by this defition, shouldn't account the
ones waiting to be deleted.
and live sstable count, shouldn't account sstables waiting to
be deleted.
Fix all that.
Fixes#12717.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 529a1239a9)
Some callees of update_pending_ranges use the variant of get_address_ranges()
which builds a hashmap of all <endpoint, owned range> pairs. For
everywhere_topology, the size of this map is quadratic in the number of
endpoints, making it big enough to cause contiguous allocations of tens of MiB
for clusters of realistic size, potentially causing trouble for the
allocator (as seen e.g. in #12724). This deserves a correction.
This patch removes the quadratic variant of get_address_ranges() and replaces
its uses with its linear counterpart.
Refs #10337
Refs #10817
Refs #10836
Refs #10837Fixes#12724
(cherry picked from commit 9e57b21e0c)
Inactive readers should only be evicted to free up resources for waiting
readers. Evicting them when waiters are not admitted for any other
reason than resources is wasteful and leads to extra load later on when
these evicted readers have to be recreated end requeued.
This patch changes the logic on both the registering path and the
admission path to not evict inactive readers unless there are readers
actually waiting on resources.
A unit-test is also added, reproducing the overly-agressive eviction and
checking that it doesn't happen anymore.
Fixes: #11803Closes#13286
(cherry picked from commit bd57471e54)
When diagnosing problems, knowing why permits were queued is very
valuable. Record the reason in a new stats, one for each reason a permit
can be queued.
(cherry picked from commit 7b701ac52e)
Allowing to change the total or initial resources the semaphore has.
After calling `set_resources()` the semaphore will look like as if it
was created with the specified amount of resources when created.
(cherry picked from commit ecc7c72acd)
This is a backport of #11949Closes#13303
* github.com:scylladb/scylladb:
transport server: fix "request size too large" handling
transport server: fix unexpected server errors handling
test/cql-pytest.py: add scylla_inject_error() utility
test/cql-pytest: add simple tests for USE statement
Fixes#12104
Calling _read_buf.close() doesn't imply eof(), some data
may have already been read into kernel or client buffers
and will be returned next time read() is called.
When the _server._max_request_size limit was exceeded
and the _read_buf was closed, the process_request method
finished and we started processing the next request in
connection::process. The unread data from _read_buf was
treated as the header of the next request frame, resulting
in "Invalid or unsupported protocol version" error.
The existing test_shed_too_large_request was adjusted.
It was originally written with the assumption that the data
of a large query would simply be dropped from the socket
and the connection could be used to handle the
next requests. This behaviour was changed in scylladb#8800,
now the connection is closed on the Scylla side and
can no longer be used. To check there are no errors
in this case, we use Scylla metrics, getting them
from the Scylla Prometheus API.
(cherry picked from commit 3263523)
If request processing ended with an error, it is worth
sending the error to the client through
make_error/write_response. Previously in this case we
just wrote a message to the log and didn't handle the
client connection in any way. As a result, the only
thing the client got in this case was timeout error.
A new test_batch_with_error is added. It is quite
difficult to reproduce error condition in a test,
so we use error injection instead. Passing injection_key
in the body of the request ensures that the exception
will be thrown only for this test request and
will not affect other requests that
the driver may send in the background.
Closes: scylladb#12104
(cherry picked from commit a4cf509)
This patch adds a scylla_inject_error(), a context manager which tests
can use to temporarily enable some error injection while some test
code is running. It can be used to write tests that artificially
inject certain errors instead of trying to reach the elaborate (and
often requiring precise timing or high amounts of data) situation where
they occur naturally.
The error-injection API is Scylla-specific (it uses the Scylla REST API)
and does not work on "release"-mode builds (all other modes are supported),
so when Cassandra or release-mode build are being tested, the test which
uses scylla_inject_error() gets skipped.
Example usage:
```python
from rest_api import scylla_inject_error
with scylla_inject_error(cql, "injection_name", one_shot=True):
# do something here
...
```
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12264
(cherry picked from commit 6d2e146aa6)
This patch adds a couple of simple tests for the USE statement: that
without USE one cannot create a table without explicitly specifying
a keyspace name, and with USE, it is possible.
Beyond testing these specific feature, this patch also serves as an
example of how to write more tests that need to control the effective USE
setting. Specifically, it adds a "new_cql" function that can be used to
create a new connection with a fresh USE setting. This is necessary
in such tests, because if multiple tests use the same cql fixture
and its single connection, they will share their USE setting and there
is no way to undo or reset it after being set.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11741
(cherry picked from commit ef0da14d6f)
We currently don't clean up the system_distributed.view_build_status
table after removed nodes. This can cause false-positive check for
whether view update generation is needed for streaming.
The proper fix is to clean up this table, but that will be more
involved, it even when done, it might not be immediate. So until then
and to be on the safe side, filter out entries belonging to unknown
hosts from said table.
Fixes: #11905
Refs: #11836Closes#11860
(cherry picked from commit 84a69b6adb)
On some docker instance configuration, hostname resolution does not
work, so our script will fail on startup because we use hostname -i to
construct cqlshrc.
To prevent the error, we can use --rpc-address or --listen-address
for the address since it should be same.
Fixes#12011Closes#12115
(cherry picked from commit 642d035067)
`paxos_response_handler::learn_decision` was calling
`cdc_service::augment_mutation_call` concurrently with
`storage_proxy::mutate_internal`. `augment_mutation_call` was selecting
rows from the base table in order to create the preimage, while
`mutate_internal` was writing rows to the table. It was therefore
possible for the preimage to observe the update that it accompanied,
which doesn't make any sense, because the preimage is supposed to show
the state before the update.
Fix this by performing the operations sequentially. We can still perform
the CDC mutation write concurrently with the base mutation write.
`cdc_with_lwt_test` was sometimes failing in debug mode due to this bug
and was marked flaky. Unmark it.
Fixes#12098
(cherry picked from commit 1ef113691a)
This PR backports 2f4a793457 to branch-5.1. Said patch depends on some other patches that are not part of any release yet.
Closes#13224
* github.com:scylladb/scylladb:
reader_concurrency_semaphore:: clear_inactive_reads(): defer evicting to evict()
reader_permit: expose operator<<(reader_permit::state)
reader_permit: add get_state() accessor
Instead of open-coding the same, in an incomplete way.
clear_inactive_reads() does incomplete eviction in severeal ways:
* it doesn't decrement _stats.inactive_reads
* it doesn't set the permit to evicted state
* it doesn't cancel the ttl timer (if any)
* it doesn't call the eviction notifier on the permit (if there is one)
The list goes on. We already have an evict() method that all this
correctly, use that instead of the current badly open-coded alternative.
This patch also enhances the existing test for clear_inactive_reads()
and adds a new one specifically for `stop()` being called while having
inactive reads.
Fixes: #13048Closes#13049
(cherry picked from commit 2f4a793457)
This is another attempt to fix#13001 on `branch-5.1`.
In #13001 we found a test case which causes a crash on `branch-5.1` because it didn't handle `UNSET_VALUE` properly:
```python3
def test_unset_insert_where(cql, table2):
p = unique_key_int()
stmt = cql.prepare(f'INSERT INTO {table2} (p, c) VALUES ({p}, ?)')
with pytest.raises(InvalidRequest, match="unset"):
cql.execute(stmt, [UNSET_VALUE])
def test_unset_insert_where_lwt(cql, table2):
p = unique_key_int()
stmt = cql.prepare(f'INSERT INTO {table2} (p, c) VALUES ({p}, ?) IF NOT EXISTS')
with pytest.raises(InvalidRequest, match="unset"):
cql.execute(stmt, [UNSET_VALUE])
```
This problem has been fixed on `master` by PR #12517. I tried to backport it to `branch-5.1` (#13029), but this didn't go well - it was a big change that touched a lot of components. It's hard to make sure that it won't cause some unexpected issues.
Then I made a simpler fix for `branch-5.1`, which achieves the same effect as the original PR (#13057).
The problem is that this effect includes backwards incompatible changes - it bans UNSET_VALUE in some places that `branch-5.1` used to allow.
Breaking changes are bad, so I made this PR, which does an absolutely minimal change to fix the crash.
It adds a check the moment before the crash would happen.
To make sure that everything works correctly, and to detect any possible breaking changes, I wrote a bunch of tests that validate the current behavior.
I also ported some tests from the `master` branch, at least the ones that were in line with the behavior on `branch-5.1`.
Closes#13133
* github.com:scylladb/scylladb:
cql-pytest/test_unset: port some tests from master branch
cql-pytest/test_unset: test unset value in UPDATEs with LWT conditions
cql-pytest/test_unset: test unset value in UPDATEs with IF EXISTS
cql-pytest/test_unset: test unset value in UPDATE statements
cql-pytest/test_unset: test unset value in INSERTs with IF NOT EXISTS
cql-pytest/test_unset: test unset value in INSERT statements
cas_request: fix crash on unset value in primary key with LWT
I copied cql-pytest tests from the master branch,
at least the ones that were compatible with branch-5.1
Some of them were expecting an InvalidRequest exception
in case of UNSET VALUES being present in places that
branch-5.1 allows, so I skipped these tests.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add tests which test INSERT statements with IF NOT EXISTS,
when an UNSET_VLAUE is passed for some column.
The test are similar to the previous ones done for simple
INSERTs without IF NOT EXISTS.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add some tests which test what happens when an UNSET_VALUE
is passed to an INSERT statement.
Passing it for partition key column is impossible
because python driver doesn't allow it.
Passing it for clustering key column causes Scylla
to silently ignore the INSERT.
Passing it for a regular or static column
causes this column to remain unchanged,
as expected.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Doing an LWT INSERT/UPDATE and passing UNSET_VALUE
for the primary key column used to caused a crash.
This is a minimal fix for this crash.
Crash backtrace pointed to a place where
we tried doing .front() on an empty vector
of primary key ranges.
I added a check that the vector isn't empty.
If it's empty then let's throw an error
and mention that it's most likely
caused by an unset value.
This has been fixed on master,
but the PR that fixed it introduced
breaking changes, which I don't want
to add to branch-5.1.
This fix is absolutely minimal
- it performs the check at the
last moment before a crash.
It's not the prettiest, but it works
and can't introduce breaking changes,
because the new code gets activated
only in cases that would've caused
a crash.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
There was a bug in `expr::search_and_replace`.
It doesn't preserve the `order` field of binary_operator.
`order` field is used to mark relations created
using the SCYLLA_CLUSTERING_BOUND.
It is a CQL feature used for internal queries inside Scylla.
It means that we should handle the restriction as a raw
clustering bound, not as an expression in the CQL language.
Losing the SCYLLA_CLUSTERING_BOUND marker could cause issues,
the database could end up selecting the wrong clustering ranges.
Fixes: #13055
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#13056
(cherry picked from commit aa604bd935)
EOF is only guarateed to be set if one tried to read past the end of the
file. So when checking for EOF, also try to read some more. This
should force the EOF flag into a correct value. We can then check that
the read yielded 0 bytes.
This should ensure that `validate_checksums()` will not falsely declare
the validation to have failed.
Fixes: #11190Closes#12696
(cherry picked from commit 693c22595a)
Currently, UDAs can't be reused if Scylla has been
restarted since they have been created. This is
caused by the missing initialization of saved
UDAs that should have inserted them to the
cql3::functions::functions::_declared map, that
should store all (user-)created functions and
aggregates.
This patch adds the missing implementation in a way
that's analogous to the method of inserting UDF to
the _declared map.
Fixes#11309
(cherry picked from commit e558c7d988)
The reason is alloc-dealloc mismatch of position_in_partition objects
allocated by cursors inside coroutine object stored in the update
variable in row_cache::do_update()
It is allocated under cache region, but in case of exception it will
be destroyed under the standard allocator. If update is successful, it
will be cleared under region allocator, so there is not problem in the
normal case.
Fixes#12068Closes#12233
(cherry picked from commit 992a73a861)
This commit makes the following changes to the docs landing page:
- Adds the ScyllaDB enterprise docs as one of three tiles.
- Modifies the three tiles to reflect the three flavors of ScyllaDB.
- Moves the "New to ScyllaDB? Start here!" under the page title.
- Renames "Our Products" to "Other Products" to list the products other
than ScyllaDB itself. In addtition, the boxes are enlarged from to
large-4 to look better.
The major purpose of this commit is to expose the ScyllaDB
documentation.
docs: fix the link
(cherry picked from commit 27bb8c2302)
Closes#13086
Azure metadata API may return empty zone sometimes. If that happens
shard-0 gets empty string as its rack, but propagates UNKNOWN_RACK to
other shards.
Empty zones response should be handled regardless.
refs: #12185
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12274
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Several snitch drivers make http requests to get
region/dc/zone/rack/whatever from the cloud provider. They blindly rely
on the response being successfull and read response body to parse the
data they need from.
That's not nice, add checks for requests finish with http OK statuses.
refs: #12185
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12287
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Check the first fragment before dereferencing it, the fragment might be
empty, in which case move to the next one.
Found by running range scan tests with random schema and random data.
Fixes: #12821Fixes: #12823Fixes: #12708Closes#12824
(cherry picked from commit ef548e654d)
we should never return a reference to local variable.
so in this change, a reference to a static variable is returned
instead. this should address following warning from Clang 17:
```
/home/kefu/dev/scylladb/tools/schema_loader.cc:146:16: error: returning reference to local temporary object [-Werror,-Wreturn-stack-address]
return {};
^~
```
Fixes#12875
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12876
(cherry picked from commit 6eab8720c4)
Currently they are upgraded during learn on a replica. The are two
problems with this. First the column mapping may not exist on a replica
if it missed this particular schema (because it was down for instance)
and the mapping history is not part of the schema. In this case "Failed
to look up column mapping for schema version" will be thrown. Second lwt
request coordinator may not have the schema for the mutation as well
(because it was freed from the registry already) and when a replica
tries to retrieve the schema from the coordinator the retrieval will fail
causing the whole request to fail with "Schema version XXXX not found"
Both of those problems can be fixed by upgrading stored mutations
during prepare on a node it is stored at. To upgrade the mutation its
column mapping is needed and it is guarantied that it will be present
at the node the mutation is stored at since it is pre-request to store
it that the corresponded schema is available. After that the mutation
is processed using latest schema that will be available on all nodes.
Fixes#10770
Message-Id: <Y7/ifraPJghCWTsq@scylladb.com>
(cherry picked from commit 15ebd59071)
trim_clustering_row_ranges_to() is broken for non-full keys in reverse
mode. It will trim the range to
position_in_partition_view::after_key(full_key) instead of
position_in_partition_view::before_key(key), hence it will include the
key in the resulting range rather than exclude it.
Fixes#12180
Refs #1446
(cherry picked from commit 536c0ab194)
A frozen set can be part of the clustering key, and with compact
storage, the corresponding key component can have an empty value.
Comparison was not prepared for this, the iterator attempts to
deserialize the item count and will fail if the value is empty.
Fixes#12242
(cherry picked from commit 232ce699ab)
Option names given in db/config.cc are handled for the command line by passing
them to boost::program_options, and by YAML by comparing them with YAML
keys.
boost::program_options has logic for understanding the
long_name,short_name syntax, so for a "workdir,W" option both --workdir and -W
worked, as intended. But our YAML config parsing doesn't have this logic
and expected "workdir,W" verbatim, which is obviously not intended. Fix that.
Fixes#7478Fixes#9500Fixes#11503Closes#11506
(cherry picked from commit af7ace3926)
We currently configure only TimeoutStartSec, but probably it's not
enough to prevent coredump timeout, since TimeoutStartSec is maximum
waiting time for service startup, and there is another directive to
specify maximum service running time (RuntimeMaxSec).
To fix the problem, we should specify RunTimeMaxSec and TimeoutSec (it
configures both TimeoutStartSec and TimeoutStopSec).
Fixes#5430Closes#12757
(cherry picked from commit bf27fdeaa2)
Related https://github.com/scylladb/scylladb/issues/12658.
This issue fixes the bug in the upgrade guides for the released versions.
Closes#12679
* github.com:scylladb/scylladb:
doc: fix the service name in the upgrade guide for patch releases versions 2022
doc: fix the service name in the upgrade guide from 2021.1 to 2022.1
(cherry picked from commit 325246ab2a)
Both patches are important to fix inefficiencies when updating the backlog tracker, which can manifest as a reactor stall, on a special event like schema change.
A simple conflict was resolved in the first patch, since master has compaction groups. It was very easy to resolve.
Regression since 1d9f53c881, which is present in 5.1 onwards. So probably it merits a backport to 5.2 too.
Closes#12769
* github.com:scylladb/scylladb:
compaction: Fix inefficiency when updating LCS backlog tracker
table: Fix quadratic behavior when inserting sstables into tracker on schema change
LCS backlog tracker uses STCS tracker for L0. Turns out LCS tracker
is calling STCS tracker's replace_sstables() with empty arguments
even when higher levels (> 0) *only* had sstables replaced.
This unnecessary call to STCS tracker will cause it to recompute
the L0 backlog, yielding the same value as before.
As LCS has a fragment size of 0.16G on higher levels, we may be
updating the tracker multiple times during incremental compaction,
which operates on SSTables on higher levels.
Inefficiency is fixed by only updating the STCS tracker if any
L0 sstable is being added or removed from the table.
This may be fixing a quadratic behavior during boot or refresh,
as new sstables are loaded one by one.
Higher levels have a substantial higher number of sstables,
therefore updating STCS tracker only when level 0 changes, reduces
significantly the number of times L0 backlog is recomputed.
Refs #12499.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12676
(cherry picked from commit 1b2140e416)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Each time backlog tracker is informed about a new or old sstable, it
will recompute the static part of backlog which complexity is
proportional to the total number of sstables.
On schema change, we're calling backlog_tracker::replace_sstables()
for each existing sstable, therefore it produces O(N ^ 2) complexity.
Fixes#12499.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12593
(cherry picked from commit 87ee547120)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Convert decompressed temporary buffers into tracked buffers just before
returning them to the upper layer. This ensures these buffers are known
to the reader concurrency semaphore and it has an accurate view of the
actual memory consumption of reads.
Fixes: #12448Closes#12454
(cherry picked from commit c4688563e3)
Consider the following MVCC state of a partition:
v2: ==== <7> [entry2] ==== <9> ===== <last dummy>
v1: ================================ <last dummy> [entry1]
Where === means a continuous range and --- means a discontinuous range.
After two LRU items are evicted (entry1 and entry2), we will end up with:
v2: ---------------------- <9> ===== <last dummy>
v1: ================================ <last dummy> [entry1]
This will cause readers to incorrectly think there are no rows before
entry <9>, because the range is continuous in v1, and continuity of a
snapshot is a union of continuous intervals in all versions. The
cursor will see the interval before <9> as continuous and the reader
will produce no rows.
This is only temporary, because current MVCC merging rules are such
that the flag on the latest entry wins, so we'll end up with this once
v1 is no longer needed:
v2: ---------------------- <9> ===== <last dummy>
...and the reader will go to sstables to fetch the evicted rows before
entry <9>, as expected.
The bug is in rows_entry::on_evicted(), which treats the last dummy
entry in a special way, and doesn't evict it, and doesn't clear the
continuity by omission.
The situation is not easy to trigger because it requires certain
eviction pattern concurrent with multiple reads of the same partition
in different versions, so across memtable flushes.
Closes#12452
(cherry-picked from commit f97268d8f2)
Fixes#12451.
LCS reshape is compacting all levels if a single one breaks
disjointness. That's unnecessary work because rewriting that single
level is enough to restore disjointness. If multiple levels break
disjointness, they'll each be reshaped in its own iteration, so
reducing operation time for each step and disk space requirement,
as input files can be released incrementally.
Incremental compaction is not applied to reshape yet, so we need to
avoid "major compaction", to avoid the space overhead.
But space overhead is not the only problem, the inefficiency, when
deciding what to reshape when overlapping is detected, motivated
this patch.
Fixes#12495.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12496
(cherry picked from commit f2f839b9cc)
Currently, we create `forward_aggregates` inside a function that
returns the result of a future lambda that captures these aggregates
by reference. As a result, the aggregates may be destructed before
the lambda finishes, resulting in a heap use-after-free.
To prolong the lifetime of these aggregates, we cannot use a move
capture, because the lambda is wrapped in a with_thread_if_needed()
call on these aggregates. Instead, we fix this by wrapping the
entire return statement in a do_with().
Fixes#12528Closes#12533
(cherry picked from commit 5f45b32bfa)
Currently reverse types match the default case (false), even though they
might be wrapping a tuple type. One user-visible effect of this is that
a schema, which has a reversed<frozen<UDT>> clustering key component,
will have this component incorrectly represented in the schema cql dump:
the UDT will loose the frozen attribute. When attempting to recreate
this schema based on the dump, it will fail as the only frozen UDTs are
allowed in primary key components.
Fixes: #12576Closes#12579
(cherry picked from commit ebc100f74f)
Fixes#12601 (maybe?)
Sort the set of tables on ID. This should ensure we never
generate duplicates in a paged listing here. Can obviously miss things if they
are added between paged calls and end up with a "smaller" UUID/ARN, but that
is to be expected.
(cherry picked from commit da8adb4d26)
Since we're potentially searching the row_lock in parallel to acquiring
the read_lock on the partition, we're racing with row_locker::unlock
that may erase the _row_locks entry for the same clustering key, since
there is no lock to protect it up until the partition lock has been
acquired and the lock_partition future is resolved.
This change moves the code to search for or allocate the row lock
_after_ the partition lock has been acquired to make sure we're
synchronously starting the read/write lock function on it, without
yielding, to prevent this use-after-free.
This adds an allocation for copying the clustering key in advance
even if a row_lock entry already exists, that wasn't needed before.
It only us slows down (a bit) when there is contention and the lock
already existed when we want to go locking. In the fast path there
is no contention and then the code already had to create the lock
and copy the key. In any case, the penalty of copying the key once
is tiny compared to the rest of the work that view updates are doing.
This is required on top of 5007ded2c1 as
seen in https://github.com/scylladb/scylladb/issues/12632
which is closely related to #12168 but demonstrates a different race
causing use-after-free.
Fixes#12632
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 4b5e324ecb)
before this change, we construct a sstring from a comma statement,
which evaluates to the return value of `name.size()`, but what we
expect is `sstring(const char*, size_t)`.
in this change
* instead of passing the size of the string_view,
both its address and size are used
* `std::string_view` is constructed instead of sstring, for better
performance, as we don't need to perform a deep copy
the issue is reported by GCC-13:
```
In file included from cql3/selection/selectable.cc:11:
cql3/selection/field_selector.hh:83:60: error: ignoring return value of function declared with 'nodiscard' attribute [-Werror,-Wunused-result]
auto sname = sstring(reinterpret_cast<const char*>(name.begin(), name.size()));
^~~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12666
(cherry picked from commit 186ceea009)
Fixes#12739.
(cherry picked from commit b588b19620)
Currently, segment file removal first calls `f.remove_file()` and
does `total_size_on_disk -= f.known_size()` later.
However, `remove_file()` resets `known_size` to 0, so in effect
the freed space in not accounted for.
`total_size_on_disk` is not just a metric. It is also responsible
for deciding whether a segment should be recycled -- it is recycled
only if `total_size_on_disk - known_size < max_disk_size`.
Therefore this bug has dire performance consequences:
if `total_size_on_disk - known_size` ever exceeds `max_disk_size`,
the recycling of commitlog segments will stop permanently, because
`total_size_on_disk - known_size` will never go back below
`max_disk_size` due to the accounting bug. All new segments from this
point will be allocated from scratch.
The bug was uncovered by a QA performance test. It isn't easy to trigger --
it took the test 7 hours of constant high load to step into it.
However, the fact that the effect is permanent, and degrades the
performance of the cluster silently, makes the bug potentially quite severe.
The bug can be easily spotted with Prometheus as infinitely rising
`commitlog_total_size_on_disk` on the affected shards.
Fixes#12645Closes#12646
(cherry picked from commit fa7e904cd6)
Fix some problems in the documentation, e.g. it is not possible to
enable Raft in an existing cluster in 5.0, but the documentation claimed
that it is.
(cherry picked from commit 1cc68b262e)
Cherry-pick note: the original commit added a lot of new stuff like
describing the Raft upgrade procedure, but also fixed problems with the
existing documentation. In this backport we include only the latter.
Closes#12582
`forward_request` verb carried information about timeouts using
`lowres_clock::time_point` (that came from local steady clock
`seastar::lowres_clock`). The time point was produced on one node and
later compared against other node `lowres_clock`. That behavior
was wrong (`lowres_clock::time_point`s produced with different
`lowres_clock`s cannot be compared) and could lead to delayed or
premature timeout.
To fix this issue, `lowres_clock::time_point` was replaced with
`lowres_system_clock::time_point` in `forward_request` verb.
Representation to which both time point types serialize is the same
(64-bit integer denoting the count of elapsed nanoseconds), so it was
possible to do an in-place switch of those types using logic suggested
by @avikivity:
- using steady_clock is just broken, so we aren't taking anything
from users by breaking it further
- once all nodes are upgraded, it magically starts to work
Closes#12529
(cherry picked from commit bbbe12af43)
Fixes#12458
This a backport of 9fa1783892 (#11902) to branch-5.1
Flush the memtable before cleaning up the table so not to leave any disowned tokens in the memtable
as they might be resurrected if left in the memtable.
Refs #1239Closes#12490
* github.com:scylladb/scylladb:
table: perform_cleanup_compaction: flush memtable
table: add perform_cleanup_compaction
api: storage_service: add logging for compaction operations et al
We don't explicitly cleanup the memtable, while
it might hold tokens disowned by the current node.
Flush the memtable before performing cleanup compaction
to make sure all tokens in the memtable are cleaned up.
Note that non-owned ranges are invalidate in the cache
in compaction_group::update_main_sstable_list_on_compaction_completion
using desc.ranges_for_cache_invalidation.
\Fixes #1239
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit eb3a94e2bc)
Move the integration with compaction_manager
from the api layer to the tabel class so
it can also make sure the memtable is cleaned up in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit fc278be6c4)
The line modified in this patch was supposed to increase the
optimization levels of parsers in debug mode to 1, because they
were too slow otherwise. But as a side effect, it also reduced the
optimization level in release mode to 1. This is not a problem
for the CQL frontend, because statement preparation is not
performance-sensitive, but it is a serious performance problem
for Alternator, where it lies in the hot path.
Fix this by only applying the -O1 to debug modes.
Fixes#12463Closes#12460
(cherry picked from commit 08b3a9c786)
Sometimes a single modification to a base partition requires updates to
a large number of view rows. A common example is deletion of a base
partition containing many rows. A large BATCH is also possible.
To avoid large allocations, we split the large amount of work into
batch of 100 (max_rows_for_view_updates) rows each. The existing code
assumed an empty result from one of these batches meant that we are
done. But this assumption was incorrect: There are several cases when
a base-table update may not need a view update to be generated (see
can_skip_view_updates()) so if all 100 rows in a batch were skipped,
the view update stopped prematurely. This patch includes two tests
showing when this bug can happen - one test using a partition deletion
with a USING TIMESTAMP causing the deletion to not affect the first
100 rows, and a second test using a specially-crafed large BATCH.
These use cases are fairly esoteric, but in fact hit a user in the
wild, which led to the discovery of this bug.
The fix is fairly simple: To detect when build_some() is done it is no
longer enough to check if it returned zero view-update rows; Rather,
it explicitly returns whether or not it is done as an std::optional.
The patch includes several tests for this bug, which pass on Cassandra,
failed on Scylla before this patch, and pass with this patch.
Fixes#12297.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12305
(cherry picked from commit 92d03be37b)
We recently (in 7fbad8de87) made sure all admission paths can trigger the eviction of inactive reads. As reader eviction happens in the background, a mechanism was added to make sure only a single eviction fiber was running at any given time. This mechanism however had a preemption point between stopping the fiber and releasing the evict lock. This gave an opportunity for either new waiters or inactive readers to be added, without the fiber acting on it. Since it still held onto the lock, it also prevented from other eviction fibers to start. This could create a situation where the semaphore could admit new reads by evicting inactive ones, but it still has waiters. Since an empty waitlist is also an admission criteria, once one waiter is wrongly added, many more can accumulate.
This series fixes this by ensuring the lock is released in the instant the fiber decides there is no more work to do.
It also fixes the assert failure on recursive eviction and adds a detection to the inactive/waiter contradiction.
Fixes: #11923
Refs: #11770Closes#12026
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: do_wait_admission(): detect admission-waiter anomaly
reader_concurrency_semaphore: evict_readers_in_the_background(): eliminate blind spot
reader_concurrency_semaphore: do_detach_inactive_read(): do a complete detach
(cherry picked from commit 15ee8cfc05)
The semaphore currently has two admission paths: the
obtain_permit()/with_permit() methods which admits permits on user
request (the front door) and the maybe_admit_waiters() which admits
permits based on internal events like memory resource being returned
(the back door). The two paths used their own admission conditions
and naturally this means that they diverged in time. Notably,
maybe_admit_waiters() did not look at inactive readers assuming that if
there are waiters there cannot be inactive readers. This is not true
however since we merged the execution-stage into the semaphore. Waiters
can queue up even when there are inactive reads and thus
maybe_admit_waiters() has to consider evicting some of them to see if
this would allow for admitting new reads.
To avoid such divergence in the future, the admission logic was moved
into a new method can_admit_read() which is now shared between the two
method families. This method now checks for the possibility of evicting
inactive readers as well.
The admission logic was tuned slightly to only consider evicting
inactive readers if there is a real possibility that this will result
in admissions: notably, before this patch, resource availability was
checked before stalls were (used permits == blocked permits), so we
could evict readers even if this couldn't help.
Because now eviction can be started from maybe_admit_waiters(), which is
also downstream from eviction, we added a flag to avoid recursive
evict -> maybe admit -> evict ... loops.
Fixes: #11770Closes#11784
(cherry picked from commit 7fbad8de87)
--online-discard option defined as string parameter since it doesn't
specify "action=", but has default value in boolean (default=True).
It breaks "provisioning in a similar environment" since the code
supposed boolean value should be "action='store_true'" but it's not.
We should change the type of the option to int, and also specify
"choices=[0, 1]" just like --io-setup does.
Fixes#11700Closes#11831
(cherry picked from commit acc408c976)
Regular INSERT statements with null values for primary key
components are rejected by Scylla since #9286 and #9314.
Batch statements missed a similar check, this patch
fixes it.
Fixes: #12060
(cherry picked from commit 7730c4718e)
When the mutation compactor has all the rows it needs for a page, it
saves the decision to stop in a member flag: _stop.
For single partition queries, the mutation compactor is kept alive
across pages and so it has a method, start_new_page() to reset its state
for the next page. This method didn't clear the _stop flag. This meant
that the value set at the end of the previous could cause the new page
and subsequently the entire query to be stopped prematurely.
This can happen if the new page starts with a row that is covered by a
higher level tombstone and is completely empty after compaction.
Reset the _stop flag in start_new_page() to prevent this.
This commit also adds a unit test which reproduces the bug.
Fixes: #12361Closes#12384
(cherry picked from commit b0d95948e1)
This series backports several patches which add or enable tests for Alternator TTL. The series does not touch the code - just tests.
The goal of backporting more tests is to get the code - which is already in branch 5.1 - tested. It wasn't a good idea to backport code without backporting the tests for it.
Closes#12200Fixes#11374
* github.com:scylladb/scylladb:
test/alternator: increase timeout on TTL tests
test/alternator: fix timeout in flaky test test_ttl_stats
test/alternator: test Alternator TTL metrics
test/alternator: skip fewer Alternator TTL tests
Due to an oversight, the local index cache isn't evicted gently
when _upper_bound existed. This is a source of reactor stalls.
Fix that.
Fixes#12271Closes#12364
(cherry picked from commit d9269abf5b)
Fix https://github.com/scylladb/scylla-doc-issues/issues/816
Fix https://github.com/scylladb/scylla-docs/issues/1613
This PR fixes the CQL version in the Interfaces page, so that it is the same as in other places across the docs and in sync with the version reported by the ScyllaDB (see https://github.com/scylladb/scylla-doc-issues/issues/816#issuecomment-1173878487).
To make sure the same CQL version is used across the docs, we should use the `|cql-version| `variable rather than hardcode the version number on several pages.
The variable is specified in the conf.py file:
```
rst_prolog = """
.. |cql-version| replace:: 3.3.1
"""
```
Closes#11320
* github.com:scylladb/scylladb:
doc: add the Cassandra version on which the tools are based
doc: fix the version number
doc: update the Enterprise version where the ME format was introduced
doc: add the ME format to the Cassandar Compatibility page
doc: replace Scylla with ScyllaDB
doc: rewrite the Interfaces table to the new format to include more information about CQL support
doc: remove the CQL version from pages other than Cassandra compatibility
doc: fix the CQL version in the Interfaces table
(cherry picked from commit ee606a5d52)
The problematic scenario this patch fixes might happen due to
unfortunate serialization of locks/unlocks between lock_pk and lock_ck,
as follows:
1. lock_pk acquires an exclusive lock on the partition.
2.a lock_ck attempts to acquire shared lock on the partition
and any lock on the row. both cases currently use a fiber
returning a future<rwlock::holder>.
2.b since the partition is locked, the lock_partition times out
returning an exceptional future. lock_row has no such problem
and succeeds, returning a future holding a rwlock::holder,
pointing to the row lock.
3.a the lock_holder previously returned by lock_pk is destroyed,
calling `row_locker::unlock`
3.b row_locker::unlock sees that the partition is not locked
and erases it, including the row locks it contains.
4.a when_all_succeeds continuation in lock_ck runs. Since
the lock_partition future failed, it destroyes both futures.
4.b the lock_row future is destroyed with the rwlock::holder value.
4.c ~holder attempts to return the semaphore units to the row rwlock,
but the latter was already destroyed in 3.b above.
Acquiring the partition lock and row lock in parallel
doesn't help anything, but it complicates error handling
as seen above,
This patch serializes acquiring the row lock in lock_ck
after locking the partition to prevent the above race.
This way, erasing the unlocked partition is never expected
to happen while any of its rows locks is held.
Fixes#12168
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12208
(cherry picked from commit 5007ded2c1)
This PR adds the link to the KB article about updating the mode after the upgrade to the 5.1 upgrade guide.
In addition, I have:
- updated the KB article to include the versions affected by that change.
- fixed the broken link to the page about metric updates (it is not related to the KB article, but I fixed it in the same PR to limit the number of PRs that need to be backported).
Related: https://github.com/scylladb/scylladb/pull/11122Closes#12148
* github.com:scylladb/scylladb:
doc: update the releases in the KB about updating the mode after upgrade
doc: fix the broken link in the 5.1 upgrade guide
doc: add the link to the 5.1-related KB article to the 5.1 upgrade guide
(cherry picked from commit 897b501ba3)
This is a backport of https://github.com/scylladb/scylladb/pull/11783.
Closes#12229
* github.com:scylladb/scylladb:
doc: replace Scylla with ScyllaDB
doc: add a comment to remove in future versions any information that refers to previous releases
doc: rewrite the notes to improve clarity
doc: remove the reperitions from the notes
Changing configuration involves two entries in the log: a 'joint
configuration entry' and a 'non-joint configuration entry'. We use
`wait_for_entry` to wait on the joint one. To wait on the non-joint one,
we use a separate promise field in `server`. This promise wasn't
connected to the `abort_source` passed into `set_configuration`.
The call could get stuck if the server got removed from the
configuration and lost leadership after committing the joint entry but
before committing the non-joint one, waiting on the promise. Aborting
wouldn't help. Fix this by subscribing to the `abort_source` in
resolving the promise exceptionally.
Furthermore, make sure that two `set_configuration` calls don't step on
each other's toes by one setting the other's promise. To do that, reset
the promise field at the end of `set_configuration` and check that it's
not engaged at the beginning.
Fixes#11288.
Closes#11325
* github.com:scylladb/scylladb:
test: raft: randomized_nemesis_test: additional logging
raft: server: handle aborts when waiting for config entry to commit
(cherry picked from commit 83850e247a)
When `io_fiber` fetched a batch with a configuration that does not
contain this node, it would send the entries committed in this batch to
`applier_fiber` and proceed by any remaining entry dropping waiters (if
the node was no longer a leader).
If there were waiters for entries committed in this batch, it could
either happen that `applier_fiber` received and processed those entries
first, notifying the waiters that the entries were committed and/or
applied, or it could happen that `io_fiber` reaches the dropping waiters
code first, causing the waiters to be resolved with
`commit_status_unknown`.
The second scenario is undesirable. For example, when a follower tries
to remove the current leader from the configuration using
`modify_config`, if the second scenario happens, the follower will get
`commit_status_unknown` - this can happen even though there are no node
or network failures. In particular, this caused
`randomized_nemesis_test.remove_leader_with_forwarding_finishes` to fail
from time to time.
Fix it by serializing the notifying and dropping of waiters in a single
fiber - `applier_fiber`. We decided to move all management of waiters
into `applier_fiber`, because most of that management was already there
(there was already one `drop_waiters` call, and two `notify_waiters`
calls). Now, when `io_fiber` observes that we've been removed from the
config and no longer a leader, instead of dropping waiters, it sends a
message to `applier_fiber`. `applier_fiber` will drop waiters when
receiving that message.
Improve an existing test to reproduce this scenario more frequently.
Fixes#11235.
Closes#11308
* github.com:scylladb/scylladb:
test: raft: randomized_nemesis_test: more chaos in `remove_leader_with_forwarding_finishes`
raft: server: drop waiters in `applier_fiber` instead of `io_fiber`
raft: server: use `visit` instead of `holds_alternative`+`get`
(cherry picked from commit 9c4e32d2e2)
Contains fixes requested in the issue (and some tiny extras), together with analysis why they don't affect the users (see commit messages).
Fixes [ #11800](https://github.com/scylladb/scylladb/issues/11800)
Closes#11926
* github.com:scylladb/scylladb:
alternator: add maybe_quote to secondary indexes 'where' condition
test/alternator: correct xfail reason for test_gsi_backfill_empty_string
test/alternator: correct indentation in test_lsi_describe
alternator: fix wrong 'where' condition for GSI range key
(cherry picked from commit ce7c1a6c52)
The SELECT JSON statement, just like SELECT, allows the user to rename
selected columns using an "AS" specification. E.g., "SELECT JSON v AS foo".
This specification was not honored: We simply forgot to look at the
alias in SELECT JSON's implementation (we did it correctly in regular
SELECT). So this patch fixes this bug.
We had two tests in cassandra_tests/validation/entities/json_test.py
that reproduced this bug. The checks in those tests now pass, but these
two tests still continue to fail after this patch because of two other
unrelated bugs that were discovered by the same tests. So in this patch
I also add a new test just for this specific issue - to serve as a
regression test.
Fixes#8078
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12123
(cherry picked from commit c5121cf273)
When we write to a materialized view, we need to know some information
defined in the base table such as the columns in its schema. We have
a "view_info" object that tracks each view and its base.
This view_info object has a couple of mutable attributes which are
used to lazily-calculate and cache the SELECT statement needed to
read from the base table. If the base-table schema ever changes -
and the code calls set_base_info() at that point - we need to forget
this cached statement. If we don't (as before this patch), the SELECT
will use the wrong schema and writes will no longer work.
This patch also includes a reproducing test that failed before this
patch, and passes afterwords. The test creates a base table with a
view that has a non-trivial SELECT (it has a filter on one of the
base-regular columns), makes a benign modification to the base table
(just a silly addition of a comment), and then tries to write to the
view - and before this patch it fails.
Fixes#10026Fixes#11542
(cherry picked from commit 2f2f01b045)
Some of the tests in test/alternator/test_ttl.py need an expiration scan
pass to complete and expire items. In development builds on developer
machines, this usually takes less than a second (our scanning period is
set to half a second). However, in debug builds on Jenkins each scan
often takes up to 100 (!) seconds (this is the record we've seen so far).
This is why we set the tests' timeout to 120.
But recently we saw another test run failing. I think the problem is
that in some case, we need not one, but *two* scanning passes to
complete before the timeout: It is possible that the test writes an
item right after the current scan passed it, so it doesn't get expired,
and then we a second scan at a random position, possibly making that
item we mention one of the last items to be considered - so in total
we need to wait for two scanning periods, not one, for the item to
expire.
So this patch increases the timeout from 120 seconds to 240 seconds -
more than twice the highest scanning time we ever saw (100 seconds).
Note that this timeout is just a timeout, it's not the typical test
run time: The test can finish much more quickly, as little as one
second, if items expire quickly on a fast build and machine.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12106
(cherry picked from commit 6bc3075bbd)
The test `test_metrics.py::test_ttl_stats` tests the metrics associated
with Alternator TTL expiration events. It normally finishes in less than a
second (the TTL scanning is configured to run every 0.5 seconds), so we
arbitrarily set a 60 second timeout for this test to allow for extremely
slow test machines. But in some extreme cases even this was not enough -
in one case we measured the TTL scan to take 63 seconds.
So in this patch we increase the timeout in this test from 60 seconds
to 120 seconds. We already did the same change in other Alternator TTL
tests in the past - in commit 746c4bd.
Fixes#11695
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11696
(cherry picked from commit 3a30fbd56c)
This patch adds a test for the metrics generated by the background
expiration thread run for Alternator's TTL feature.
We test three of the four metrics: scylla_expiration_scan_passes,
scylla_expiration_scan_table and scylla_expiration_items_deleted.
The fourth metric, scylla_expiration_secondary_ranges_scanned, counts the
number of times that this node took over another node's expiration duty.
so requires a multi-node cluster to test, and we can't test it in the
single-node cluster test framework.
To see TTL expiration in action this test may need to wait up to the
setting of alternator_ttl_period_in_seconds. For a setting of 1
second (the default set by test/alternator/run), this means this
test can take up to 1 second to run. If alternator_ttl_period_in_seconds
is set higher, the test is skipped unless --runveryslow is requested.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 297109f6ee)
Most of the Alternator TTL tests are extremely slow on DynamoDB because
item expiration may be delayed up to 24 hours (!), and in practice for
10 to 30 minutes. Because of this, we marked most of these tests
with the "veryslow" mark, causing them to be skipped by default - unless
pytest is given the "--runveryslow" option.
The result was that the TTL tests were not run in the normal test runs,
which can allow regressions to be introduced (luckily, this hasn't happened).
However, this "veryslow" mark was excessive. Many of the tests are very
slow only on DynamoDB, but aren't very slow on Scylla. In particular,
many of the tests involve waiting for an item to expire, something that
happens after the configurable alternator_ttl_period_in_seconds, which
is just one second in our tests.
So in this patch, we remove the "veryslow" mark from 6 tests of Alternator TTL
tests, and instead use two new fixtures - waits_for_expiration and
veryslow_on_aws - to only skip the test when running on DynamoDB or
when alternator_ttl_period_in_seconds is high - but in our usual test
environment they will not get skipped.
Because 5 of these 6 tests wait for an item to expire, they take one
second each and this patch adds 5 seconds to the Alternator test
runtime. This is unfortunate (it's more than 25% of the total Alternator
test runtime!) but not a disaster, and we plan to reduce this 5 second
time futher in the following patch, but decreasing the TTL scanning
period even further.
This patch also increases the timeout of several of these tests, to 120
seconds from the previous 10 seconds. As mentioned above, normally,
these tests should always finish in alternator_ttl_period_in_seconds
(1 second) with a single scan taking less than 0.2 seconds, but in
extreme cases of debug builds on overloaded test machines, we saw even
60 seconds being passed, so let's increase the maximum. I also needed
to make the sleep time between retries smaller, not a function of the
new (unrealistic) timeout.
4 more tests remain "veryslow" (and won't run by default) because they
are take 5-10 seconds each (e.g., a test which waits to see that an item
does *not* get expired, and a test involving writing a lot of data).
We should reconsider this in the future - to perhaps run these tests in
our normal test runs - but even for now, the 6 extra tests that we
start running are a much better protection against regressions than what
we had until now.
Fixes#11374
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
x
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 746c4bd9eb)
PR #9314 fixed a similar issue with regular insert statements
but missed the LWT code path.
It's expected behaviour of
modification_statement::create_clustering_ranges to return an
empty range in this case, since possible_lhs_values it
uses explicitly returns empty_value_set if it evaluates rhs
to null, and it has a comment about it (All NULL
comparisons fail; no column values match.) On the other hand,
all components of the primary key are required to be set,
this is checked at the prepare phase, in
modification_statement::process_where_clause. So the only
problem was modification_statement::execute_with_condition
was not expecting an empty clustering_range in case of
a null clustering key.
Fixes: #11954
(cherry picked from commit 0d443dfd16)
According to seastar/doc/lambda-coroutine-fiasco.md lambda that
co_awaits once loses its capture frame. In distrobuted_loader
code there's at least one of that kind.
fixes: #12175
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12170
(cherry picked from commit 71179ff5ab)
Fix https://github.com/scylladb/scylla-docs/issues/4126Closes#11122
* github.com:scylladb/scylladb:
doc: add info about the time-consuming step due to resharding
doc: add the new KB to the toctree
doc: doc: add a KB about updating the mode in perftune.yaml after upgrade
(cherry picked from commit e9fec761a2)
Release 5.1. introduced a new CQL extension that applies to the CREATE TABLE and ALTER TABLE statements. The ScyllaDB-specific extensions are described on a separate page, so the CREATE TABLE and ALTER TABLE should include links to that page and section.
Note: CQL extensions are described with Markdown, while the Data Definition page is RST. Currently, there's no way to link from an RST page to an MD subsection (using a section heading or anchor), so a URL is used as a temporary solution.
Related: https://github.com/scylladb/scylladb/pull/9810Closes#12070
* github.com:scylladb/scylladb:
doc: move the info about per-partition rate limit for the ALTER TABLE statemet from the paragraph to the list
doc: add the links to the per-partition rate limit extention to the CREATE TABLE and ALTER TABLE sections
(cherry picked from commit 6e9f739f19)
This is a backport of https://github.com/scylladb/scylladb/pull/11460.
Closes#12079
* github.com:scylladb/scylladb:
doc: update the commands to upgrade the ScyllaDB image
doc: fix the filename in the index to resolve the warnings and fix the link
doc: apply feedback by adding she step fo load the new repo and fixing the links
doc: fix the version name in file upgrade-guide-from-2021.1-to-2022.1-image.rst
doc: rename the upgrade-image file to upgrade-image-opensource and update all the links to that file
doc: update the Enterprise guide to include the Enterprise-onlyimage file
doc: update the image files
doc: split the upgrade-image file to separate files for Open Source and Enterprise
doc: clarify the alternative upgrade procedures for the ScyllaDB image
doc: add the upgrade guide for ScyllaDB Image from 2022.x.y. to 2022.x.z
doc: add the upgrade guide for ScyllaDB Image from 5.x.y. to 5.x.z
This is a backport of https://github.com/scylladb/scylladb/pull/11108.
Closes#12063
* github.com:scylladb/scylladb:
doc: apply feedback about scylla-enterprise-machine-image
doc: update the note about installing scylla-enterprise-machine-image
update the info about installing scylla-enterprise-machine-image during upgrade
doc: add the requirement to install scylla-enterprise-machine-image if the previous version was installed with an image
doc: update the info about metrics in 2022.1 compared to 5.0
doc: minor formatting and language fixes
doc: add the new guide to the toctree
doc: add the upgrade guide from 5.0 to 2022.1
PR #11577 added the 5.0->5.1 upgrade guide. At the same time, it
improved some of the common `.rst` files that were using in other
upgrade guides; e.g. the `docs/upgrade/_common/upgrade-guide-v4-rpm.rst`
file is used in the 4.6->5.0 upgrade guide.
The 5.0->5.1 upgrade guide was then refactored. The refactored version
was already backported to the 5.1 branch (#12034). But we should still
backport the improvements done in #11577. This commit contains these
improvements.
(cherry picked from commit 2513497f9a)
Closes#12055
This is a backport of https://github.com/scylladb/scylladb/pull/11461.
Closes#12044
* github.com:scylladb/scylladb:
doc: remove support for Debian 9 from versions 2022.1 and 2022.2
doc: remove support for Ubuntu 16.04 from versions 2022.1 and 2022.2
backport 11461 doc: add support for Debian 11 to versions 2022.1 and 2022.2
We added UUID device file existance check on #11399, we expect UUID
device file is created before checking, and we wait for the creation by
"udevadm settle" after "mkfs.xfs".
However, we actually getting error which says UUID device file missing,
it probably means "udevadm settle" doesn't guarantee the device file created,
on some condition.
To avoid the error, use var-lib-scylla.mount to wait for UUID device
file is ready, and run the file existance check when the service is
failed.
Fixes#11617Closes#11666
(cherry picked from commit a938b009ca)
Just like 4a8ed4c, we also need to wait for udev event completion to
create /dev/disk/by-uuid/$UUID for newly formatted disk, to mount the
disk just after formatting.
Fixes#11359
(cherry picked from commit 8835a34ab6)
When filtering with multi column restriction present all other restrictions were ignored.
So a query like:
`SELECT * FROM WHERE pk = 0 AND (ck1, ck2) < (0, 0) AND regular_col = 0 ALLOW FILTERING;`
would ignore the restriction `regular_col = 0`.
This was caused by a bug in the filtering code:
2779a171fc/cql3/selection/selection.cc (L433-L449)
When multi column restrictions were detected, the code checked if they are satisfied and returned immediately.
This is fixed by returning only when these restrictions are not satisfied. When they are satisfied the other restrictions are checked as well to ensure all of them are satisfied.
This code was introduced back in 2019, when fixing #3574.
Perhaps back then it was impossible to mix multi column and regular columns and this approach was correct.
Fixes: #6200Fixes: #12014Closes#12031
* github.com:scylladb/scylladb:
cql-pytest: add a reproducer for #12014, verify that filtering multi column and regular restrictions works
boost/restrictions-test: uncomment part of the test that passes now
cql-pytest: enable test for filtering combined multi column and regular column restrictions
cql3: don't ignore other restrictions when a multi column restriction is present during filtering
(cherry picked from commit 2d2034ea28)
There were 4 different pages for upgrading Scylla 5.0 to 5.1 (and the
same is true for other version pairs, but I digress) for different
environments:
- "ScyllaDB Image for EC2, GCP, and Azure"
- Ubuntu
- Debian
- RHEL/CentOS
THe Ubuntu and Debian pages used a common template:
```
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p1.rst
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p2.rst
```
with different variable substitutions.
The "Image" page used a similar template, with some extra content in the
middle:
```
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p1.rst
.. include:: /upgrade/_common/upgrade-image-opensource.rst
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p2.rst
```
The RHEL/CentOS page used a different template:
```
.. include:: /upgrade/_common/upgrade-guide-v4-rpm.rst
```
This was an unmaintainable mess. Most of the content was "the same" for
each of these options. The only content that must actually be different
is the part with package installation instructions (e.g. calls to `yum`
vs `apt-get`). The rest of the content was logically the same - the
differences were mistakes, typos, and updates/fixes to the text that
were made in some of these docs but not others.
In this commit I prepare a single page that covers the upgrade and
rollback procedures for each of these options. The section dependent on
the system was implemented using Sphinx Tabs.
I also fixed and changed some parts:
- In the "Gracefully stop the node" section:
Ubuntu/Debian/Images pages had:
```rst
.. code:: sh
sudo service scylla-server stop
```
RHEL/CentOS pages had:
```rst
.. code:: sh
.. include:: /rst_include/scylla-commands-stop-index.rst
```
the stop-index file contained this:
```rst
.. tabs::
.. group-tab:: Supported OS
.. code-block:: shell
sudo systemctl stop scylla-server
.. group-tab:: Docker
.. code-block:: shell
docker exec -it some-scylla supervisorctl stop scylla
(without stopping *some-scylla* container)
```
So the RHEL/CentOS version had two tabs: one for Scylla installed
directly on the system, one for Scylla running in Docker - which is
interesting, because nothing anywhere else in the upgrade documents
mentions Docker. Furthermore, the RHEL/CentOS version used `systemctl`
while the ubuntu/debian/images version used `service` to stop/start
scylla-server. Both work on modern systems.
The Docker option is completely out of place - the rest of the upgrade
procedure does not mention Docker. So I decided it doesn't make sense to
include it. Docker documentation could be added later if we actually
decide to write upgrade documentation when using Docker... Between
`systemctl` and `service` I went with `service` as it's a bit
higher-level.
- Similar change for "Start the node" section, and corresponding
stop/start sections in the Rollback procedure.
- To reuse text for Ubuntu and Debian, when referencing "ScyllaDB deb
repo" in the Debian/Ubuntu tabs, I provide two separate links: to
Debian and Ubuntu repos.
- the link to rollback procedure in the RPM guide (in 'Download and
install the new release' section) pointed to rollback procedure from
3.0 to 3.1 guide... Fixed to point to the current page's rollback
procedure.
- in the rollback procedure steps summary, the RPM version missed the
"Restore system tables" step.
- in the rollback procedure, the repository links were pointing to the
new versions, while they should point to the old versions.
There are some other pre-existing problems I noticed that need fixing:
- EC2/GCP/Azure option has no corresponding coverage in the rollback
section (Download and install the old release) as it has in the
upgrade section. There is no guide for rolling back 3rd party and OS
packages, only Scylla. I left a TODO in a comment.
- the repository links assume certain Debian and Ubuntu versions (Debian
10 and Ubuntu 20), but there are more available options (e.g. Ubuntu
22). Not sure how to deal with this problem. Maybe a separate section
with links? Or just a generic link without choice of platform/version?
Closes#11891
(cherry picked from commit 0c7ff0d2cb)
Backport notes:
Funnily, the 5.1 branch did not have the upgrade guide to 5.1 at all. It
was only in `master`. So the backport does not remove files, only adds
new ones.
I also had to add:
- an additional link in the upgrade-opensource index to the 5.1 upgrade
page (it was already in upstream `master` when the cherry-picked commit
was added)
- the list of new metrics, which was also completely missing in
branch-5.1.
Closes#12034
Ubuntu 22.04 is supported by both ScyllaDB Open Source 5.0 and Enterprise 2022.1.
Closes#11227
* github.com:scylladb/scylladb:
doc: add the redirects from Ubuntu version specific to version generic pages
doc: remove version-speific content for Ubuntu and add the generic page to the toctree
doc: rename the file to include Ubuntu
doc: remove the version number from the document and add the link to Supported Versions
doc: add a generic page for Ubuntu
doc: add the upgrade guide from 5.0 to 2022.1 on Ubuntu 2022.1
(cherry picked from commit d4c986e4fa)
This PR is related to https://github.com/scylladb/scylla-docs/issues/4124 and https://github.com/scylladb/scylla-docs/issues/4123.
**New Enterprise Upgrade Guide from 2021.1 to 2022.2**
I've added the upgrade guide for ScyllaDB Enterprise image. In consists of 3 files:
/upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p1.rst
upgrade/_common/upgrade-image.rst
/upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p2.rst
**Modified Enterprise Upgrade Guides 2021.1 to 2022.2**
I've modified the existing guides for Ubuntu and Debian to use the same files as above, but exclude the image-related information:
/upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p1.rst + /upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p2.rst = /upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian.rst
To make things simpler and remove duplication, I've replaced the guides for Ubuntu 18 and 20 with a generic Ubuntu guide.
**Modified Enterprise Upgrade Guides from 4.6 to 5.0**
These guides included a bug: they included the image-related information (about updating OS packages), because a file that includes that information was included by mistake. What's worse, it was duplicated. After the includes were removed, image-related information is no longer included in the Ubuntu and Debian guides (this fixes https://github.com/scylladb/scylla-docs/issues/4123).
I've modified the index file to be in sync with the updates.
Closes#11285
* github.com:scylladb/scylladb:
doc: reorganize the content to list the recommended way of upgrading the image first
doc: update the image upgrade guide for ScyllaDB image to include the location of the manifest file
doc: fix the upgrade guides for Ubuntu and Debian by removing image-related information
doc: update the guides for Ubuntu and Debian to remove image information and the OS version number
doc: add the upgrade guide for ScyllaDB image from 2021.1 to 2022.1
(cherry picked from commit dca351c2a6)
Fix https://github.com/scylladb/scylladb/issues/11393
- Rename the tool names across the docs.
- Update the examples to replace `scylla-sstable` and `scylla-types` with `scylla sstable` and `scylla types`, respectively.
Closes#11432
* github.com:scylladb/scylladb:
doc: update the tool names in the toctree and reference pages
doc: rename the scylla-types tool as Scylla Types
doc: rename the scylla-sstable tool as Scylla SStable
(cherry picked from commit 2c46c24608)
This is a very important aspect of the tool that was completely missing from the document before. Also add a comparison with SStableDump.
Fixes: https://github.com/scylladb/scylladb/issues/11363Closes#11390
* github.com:scylladb/scylladb:
docs: scylla-sstable.rst: add comparison with SStableDump
docs: scylla-sstable.rst: add section about providing the schema
(cherry picked from commit 2ab5cbd841)
The purpose of this PR is to update the information about the default SStable format.
It
Closes#11431
* github.com:scylladb/scylladb:
doc: simplify the information about default formats in different versions
doc: update the SSTables 3.0 Statistics File Format to add the UUID host_id option of the ME format
doc: add the information regarding the ME format to the SSTables 3.0 Data File Format page
doc: fix additional information regarding the ME format on the SStable 3.x page
doc: add the ME format to the table
add a comment to remove the information when the documentation is versioned (in 5.1)
doc: replace Scylla with ScyllaDB
doc: fix the formatting and language in the updated section
doc: fix the default SStable format
(cherry picked from commit a0392bc1eb)
This PR introduces the following changes to the documentation landing page:
- The " New to ScyllaDB? Start here!" box is added.
- The "Connect your application to Scylla" box is removed.
- Some wording has been improved.
- "Scylla" has been replaced with "ScyllaDB".
Closes#11896
* github.com:scylladb/scylladb:
Update docs/index.rst
doc: replace Scylla with ScyllaDB on the landing page
doc: improve the wording on the landing page
doc: add the link to the ScyllaDB Basics page to the documentation landing page
(cherry picked from commit 2b572d94f5)
It was pointed out to me that our description of the synchronous_updates
materialized-view option does not make it clear enough what is the
default setting, or why a user might want to use this option.
This patch changes the description to (I hope) better address these
issues.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11404
* github.com:scylladb/scylladb:
doc: cql-extensions.md: replace "Scylla" by "ScyllaDB"
doc: cql-extensions.md: improve description of synchronous views
(cherry picked from commit b9fc504fb2)
This PR is V2 of the[ PR created by @psarna.](https://github.com/scylladb/scylladb/pull/11560).
I have:
- copied the content.
- applied the suggestions left by @nyh.
- made minor improvements, such as replacing "Scylla" with "ScyllaDB", fixing punctuation, and fixing the RST syntax.
Fixes https://github.com/scylladb/scylladb/issues/11378Closes#11984
* github.com:scylladb/scylladb:
doc: label user-defined functions as Experimental
doc: restore the note for the Count function (removed by mistatke)
doc: document user defined functions (UDFs)
(cherry picked from commit 7cbb0b98bb)
Fix https://github.com/scylladb/scylladb/issues/11373
- Updated the information on the "Counting all rows in a table is slow" page.
- Added COUNT to the list of selectors of the SELECT statement (somehow it was missing).
- Added the note to the description of the COUNT() function with a link to the KB page for troubleshooting if necessary. This will allow the users to easily find the KB page.
Closes#11417
* github.com:scylladb/scylladb:
doc: add a comment to remove the note in version 5.1
doc: update the information on the Countng all rows page and add the recommendation to upgrade ScyllaDB
doc: add a note to the description of COUNT with a reference to the KB article
doc: add COUNT to the list of acceptable selectors of the SELECT statement
(cherry picked from commit 22bb35e2cb)
compaction_manager::task (and thus compaction_data) can be stopped
because of many different reasons. Thus, abort can be requested more
than once on compaction_data abort source causing a crash.
To prevent this before each request_abort() we check whether an abort
was requested before.
Closes#12004
(cherry picked from commit 7ead1a7857)
Fixes#12002.
The get_live_token_owners returns the nodes that are part of the ring
and live.
The get_unreachable_token_owners returns the nodes that are part of the ring
and is not alive.
The token_metadata::get_all_endpoints returns nodes that are part of the
ring.
The patch changes both functions to use the more authoritative source to
get the nodes that are part of the ring and call is_alive to check if
the node is up or down. So that the correctness does not depend on
any derived information.
This patch fixes a truncate issue in storage_proxy::truncate_blocking
where it calls get_live_token_owners and get_unreachable_token_owners to
decide the nodes to talk with for truncate operation. The truncate
failed because incorrect nodes were returned.
Fixes#10296Fixes#11928Closes#11952
(cherry picked from commit 16bd9ec8b1)
Wrong access to an uninitialized token instead of the actual
generated string caused the parser to crash, this wasn't
detected by the ANTLR3 compiler because all the temporary
variables defined in the ANTLR3 statements are global in the
generated code. This essentialy caused a null dereference.
Tests: 1. The fixed issue scenario from github.
2. Unit tests in release mode.
Fixes#11774
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20190612133151.20609-1-eliransin@scylladb.com>
Closes#11777
(cherry picked from commit ab7429b77d)
The view builder builds the views from a given base table in
view_builder::batch_size batches of rows. After processing this many
rows, it suspends so the view builder can switch to building views for
other base tables in the name of fairness. When resuming the build step
for a given base table, it reuses the reader used previously (also
serving the role of a snapshot, pinning sstables read from). The
compactor however is created anew. As the reader can be in the middle of
a partition, the view builder injects a partition start into the
compactor to prime it for continuing the partition. This however only
included the partition-key, crucially missing any active tombstones:
partition tombstone or -- since the v2 transition -- active range
tombstone. This can result in base rows covered by either of this to be
resurrected and the view builder to generate view updates for them.
This patch solves this by using the detach-state mechanism of the
compactor which was explicitly developed for situations like this (in
the range scan code) -- resuming a read with the readers kept but the
compactor recreated.
Also included are two test cases reproducing the problem, one with a
range tombstone, the other with a partition tombstone.
Fixes: #11668Closes#11671
(cherry picked from commit 5621cdd7f9)
The return from DescribeTable which describes GSIs and LSIs is missing
the Projection field. We do not yet support all the settings Projection
(see #5036), but the default which we support is ALL, and DescribeTable
should return that in its description.
Fixes#11470Closes#11693
(cherry picked from commit 636e14cc77)
EC2 instance metadata service can be busy, ret's retry to connect with
interval, just like we do in scylla-machine-image.
Fixes#10250
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#11688
(cherry picked from commit 6b246dc119)
As described in issue #11801, we saw in Alternator when a GSI has both partition and sort keys which were non-key attributes in the base, cases where updating the GSI-sort-key attribute to the same value it already had caused the entire GSI row to be deleted.
In this series fix this bug (it was a bug in our materialized views implementation) and add a reproducing test (plus a few more tests for similar situations which worked before the patch, and continue to work after it).
Fixes#11801Closes#11808
* github.com:scylladb/scylladb:
test/alternator: add test for issue 11801
MV: fix handling of view update which reassign the same key value
materialized views: inline used-once and confusing function, replace_entry()
(cherry picked from commit e981bd4f21)
When being stopped compaction manager may step on ENOSPC. This is not a
reason to fail stopping process with abort, better to warn this fact in
logs and proceed as if nothing happened
refs: #11245
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Commit a9805106 (table: seal_active_memtable: handle ENOSPC error)
made memtable flushing code stand ENOSPC and continue flusing again
in the hope that the node administrator would provide some free space.
However, it looks like the IO code may report back ENOSPC with some
exception type this code doesn't expect. This patch tries to fix it
refs: #11245
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The existing loop is very branchy in its attempts to find out whether or
not to abort. The "allowed_retries" count can be a good indicator of the
decision taken. This makes the code notably shorter and easier to extend
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Aborting too soon on ENOSPC is too harsh, leading to loss of
availability of the node for reads, while restarting it won't
solve the ENOSPC condition.
Fixes#11245
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#11246
Scylla's Bloom filter implementation has a minimal false-positive rate
that it can support (6.71e-5). When setting bloom_filter_fp_chance any
lower than that, the compute_bloom_spec() function, which writes the bloom
filter, throws an exception. However, this is too late - it only happens
while flushing the memtable to disk, and a failure at that point causes
Scylla to crash.
Instead, we should refuse the table creation with the unsupported
bloom_filter_fp_chance. This is also what Cassandra did six years ago -
see CASSANDRA-11920.
This patch also includes a regression test, which crashes Scylla before
this patch but passes after the patch (and also passes on Cassandra).
Fixes#11524.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11576
(cherry picked from commit 4c93a694b7)
DescribeTable is currently hard-coded to return PAY_PER_REQUEST billing
mode. Nevertheless, even in PAY_PER_REQUEST mode, the DescribeTable
operation must return a ProvisionedThroughput structure, listing both
ReadCapacityUnits and WriteCapacityUnits as 0. This requirement is not
stated in some DynamoDB documentation but is explictly mentioned in
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_ProvisionedThroughput.html
Also in empirically, DynamoDB returns ProvisionedThroughput with zeros
even in PAY_PER_REQUEST mode. We even had an xfailing test to confirm this.
The ProvisionedThroughput structure being missing was a problem for
applications like DynamoDB connectors for Spark, if they implicitly
assume that ProvisionedThroughput is returned by DescribeTable, and
fail (as described in issue #11222) if it's outright missing.
So this patch adds the missing ProvisionedThroughput structure, and
the xfailing test starts to pass.
Note that this patch doesn't change the fact that attempting to set
a table to PROVISIONED billing mode is ignored: DescribeTable continues
to always return PAY_PER_REQUEST as the billing mode and zero as the
provisioned capacities.
Fixes#11222
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11298
(cherry picked from commit 941c719a23)
When cross-shard barrier is abort()-ed it spawns a background fiber
that will wake-up other shards (if they are sleeping) with exception.
This fiber is implicitly waited by the owning sharded service .stop,
because barrier usage is like this:
sharded<service> s;
co_await s.invoke_on_all([] {
...
barrier.abort();
});
...
co_await s.stop();
If abort happens, the invoke_on_all() will only resolve _after_ it
queues up the waking lambdas into smp queues, thus the subseqent stop
will queue its stopping lambdas after barrier's ones.
However, in debug mode the queue can be shuffled, so the owning service
can suddenly be freed from under the barrier's feet causing use after
free. Fortunately, this can be easily fixed by capturing the shared
pointer on the shared barrier instead of a regular pointer on the
shard-local barrier.
fixes: #11303
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11553
The generator was first setting the marker then applied tombstones.
The marker was set like this:
row.marker() = random_row_marker();
Later, when shadowable tombstones were applied, they were compacted
with the marker as expected.
However, the key for the row was chosen randomly in each iteration and
there are multiple keys set, so there was a possibility of a key clash
with an earlier row. This could override the marker without applying
any tombstones, which is conditional on random choice.
This could generate rows with markers uncompacted with shadowable tombstones.
This broken row_cache_test::test_concurrent_reads_and_eviction on
comparison between expected and read mutations. The latter was
compacted because it went through an extra merge path, which compacts
the row.
Fix by making sure there are no key clashes.
Closes#11663
(cherry picked from commit 5268f0f837)
If user stops off-strategy via API, compaction manager can decide
to give up on it completely, so data will sit unreshaped in
maintenance set, preventing it from being compacted with data
in the main set. That's problematic because it will probably lead
to a significant increase in read and space amplification until
off-strategy is triggered again, which cannot happen anytime
soon.
Let's handle it by moving data in maintenance set into main one,
even if unreshaped. Then regular compaction will be able to
continue from where off-strategy left off.
Fixes#11543.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#11545
(cherry picked from commit a04047f390)
Range tombstones are kept in memory (cache/memtable) in
range_tombstone_list. It keeps them deoverlapped, so applying a range
tombstone which covers many range tombstones will erase existing range
tombstones from the list. This operation needs to be exception-safe,
so range_tombstone_list maintains an undo log. This undo log will
receive a record for each range tombstone which is removed. For
exception safety reasons, before pushing an undo log entry, we reserve
space in the log by calling std::vector::reserve(size() + 1). This is
O(N) where N is the number of undo log entries. Therefore, the whole
application is O(N^2).
This can cause reactor stalls and availability issues when replicas
apply such deletions.
This patch avoids the problem by reserving exponentially increasing
amount of space. Also, to avoid large allocations, switches the
container to chunked_vector.
Fixes#11211Closes#11215
(cherry picked from commit 7f80602b01)
Said method currently emits a partition-end. This method is only called
when the last fragment in the stream is a range tombstone change with a
position after all clustered rows. The problem is that
consume_partition_end() is also called unconditionally, resulting in two
partition-end fragments being emitted. The fix is simple: make this
method a no-op, there is nothing to do there.
Also add two tests: one targeted to this bug and another one testing the
crawling reader with random mutations generated for random schema.
Fixes: #11421Closes#11422
(cherry picked from commit be9d1c4df4)
Per-partition rate limiting added a new error type which should be
returned when Scylla decides to reject an operation due to per-partition
rate limit being exceeded. The new error code requires drivers to
negotiate support for it, otherwise Scylla will report the error as
`Config_error`. The existing error code override logic works properly,
however due to a mistake Scylla will report the `Config_error` code even
if the driver correctly negotiated support for it.
This commit fixes the problem by specifying the correct error code in
`rate_limit_exception`'s constructor.
Tested manually with a modified version of the Rust driver which
negotiates support for the new error. Additionally, tested what happens
when the driver doesn't negotiate support (Scylla properly falls back to
`Config_error`).
Branches: 5.1
Fixes: #11517Closes#11518
(cherry picked from commit e69b44a60f)
Commit 8ab57aa added a yield to the buffer-copy loop, which means that
the copy can yield before done and the multishard reader might see the
half-copied buffer and consider the reader done (because
`_end_of_stream` is already set) resulting in the dropping the remaining
part of the buffer and in an invalid stream if the last copied fragment
wasn't a partition-end.
Fixes: #11561
(cherry picked from commit 0c450c9d4c)
When configuring tcp-nodelay unconditionally, messaging service thinks
gossiper uses group index 1, though it had changed some time ago and now
those verbs belong to group 0.
fixes: #11465
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 2c74062962)
from Tomasz Grabiec
This series fixes lack of mutation associativity which manifests as
sporadic failures in
row_cache_test.cc::test_concurrent_reads_and_eviction due to differences
in mutations applied and read.
No known production impact.
Refs https://github.com/scylladb/scylladb/issues/11307Closes#11312
* github.com:scylladb/scylladb:
test: mutation_test: Add explicit test for mutation commutativity
test: random_mutation_generator: Workaround for non-associativity of mutations with shadowable tombstones
db: mutation_partition: Drop unnecessary maybe_shadow()
db: mutation_partition: Maintain shadowable tombstone invariant when applying a hard tombstone
mutation_partition: row: make row marker shadowing symmetric
(cherry picked from commit 484004e766)
This makes catching issues related to concurrent access of same or
adjacent entries more likely. For example, catches #11239.
Closes#11260
(cherry picked from commit 8ee5b69f80)
The intention was for these logs to be printed during the
database shutdown sequence, but it was overlooked that it's not
the only place where commitlog::shutdown is called.
Commitlogs are started and shut down periodically by hinted handoff.
When that happens, these messages spam the log.
Fix that by adding INFO commitlog shutdown logs to database::stop,
and change the level of the commitlog::shutdown log call to DEBUG.
Fixes#11508Closes#11536
(cherry picked from commit 9b6fc553b4)
Long-term index caching in the global cache, as introduced in 4.6, is a major
pessimization for workloads where accesses to the index are (spacially) sparse.
We want to have a way to disable it for the affected workloads.
There is already infrastructure in place for disabling it for BYPASS CACHE
queries. One way of solving the issue is hijacking that infrastructure.
This patch adds a global flag (and a corresponding CLI option) which controls
index caching. Setting the flag to `false` causes all index reads to behave
like they would in BYPASS CACHE queries.
Consequences of this choice:
- The per-SSTable partition_index_cache is unused. Every index_reader has
its own, and they die together. Independent reads can no longer reuse the
work of other reads which hit the same index pages. This is not crucial,
since partition accesses have no (natural) spatial locality. Note that
the original reason for partition_index_cache -- the ability to share
reads for the lower and upper bound of the query -- is unaffected.
- The per-SSTable cached_file is unused. Every index_reader has its own
(uncached) input stream from the index file, and every
bsearch_clustered_cursor has its own cached_file, which dies together with
the cursor. Note that the cursor still can perform its binary search with
caching. However, it won't be able to reuse the file pages read by
index_reader. In particular, if the promoted index is small, and fits inside
the same file page as its index_entry, that page will be re-read.
It can also happen that index_reader will read the same index file page
multiple times. When the summary is so dense that multiple index pages fit in
one index file page, advancing the upper bound, which reads the next index
page, will read the same index file page. Since summary:disk ratio is 1:2000,
this is expected to happen for partitions with size greater than 2000
partition keys.
Fixes#11202
(cherry picked from commit cdb3e71045)
The logger is proof against allocation failures, except if
--abort-on-seastar-bad-alloc is specified. If it is, it will crash.
The reclaim stall report is likely to be called in low memory conditions
(reclaim's job is to alleviate these conditions after all), so we're
likely to crash here if we're reclaiming a very low memory condition
and have a large stall simultaneously (AND we're running in a debug
environment).
Prevent all this by disabling --abort-on-seastar-bad-alloc temporarily.
Fixes#11549Closes#11555
(cherry picked from commit d3b8c0c8a6)
An incorrect size is returned from the function, which could lead to
crashes or undefined behavior. Fix by erroring out in these cases.
Fixes#11476
(cherry picked from commit 1c2eef384d)
Currently, when detaching the table from the database, we force-evict all queriers for said table. This series broadens the scope of this force-evict to include all inactive reads registered at the semaphore. This ensures that any regular inactive read "forgotten" for any reason in the semaphore, will not end up in said readers accessing a dangling table reference when destroyed later.
Fixes: https://github.com/scylladb/scylladb/issues/11264Closes#11273
* github.com:scylladb/scylladb:
querier: querier_cache: remove now unused evict_all_for_table()
database: detach_column_family(): use reader_concurrency_semaphore::evict_inactive_reads_for_table()
reader_concurrency_semaphore: add evict_inactive_reads_for_table()
(cherry picked from commit afa7960926)
Scenario:
cache = [
row(pos=2, continuous=false),
row(pos=after(2), dummy=true)
]
Scanning read starts, starts populating [-inf, before(2)] from sstables.
row(pos=2) is evicted.
cache = [
row(pos=after(2), dummy=true)
]
Scanning read finishes reading from sstables.
Refreshes cache cursor via
partition_snapshot_row_cursor::maybe_refresh(), which calls
partition_snapshot_row_cursor::advance_to() because iterators are
invalidated. This advances the cursor to
after(2). no_clustering_row_between(2, after(2)) returns true, so
advance_to() returns true, and maybe_refresh() returns true. This is
interpreted by the cache reader as "the cursor has not moved forward",
so it marks the range as complete, without emitting the row with
pos=2. Also, it marks row(pos=after(2)) as continuous, so later reads
will also miss the row.
The bug is in advance_to(), which is using
no_clustering_row_between(a, b) to determine its result, which by
definition excludes the starting key.
Discovered by row_cache_test.cc::test_concurrent_reads_and_eviction
with reduced key range in the random_mutation_generator (1024 -> 16).
Fixes#11239Closes#11240
* github.com:scylladb/scylladb:
test: mvcc: Fix illegal use of maybe_refresh()
tests: row_cache_test: Add test_eviction_of_upper_bound_of_population_range()
tests: row_cache_test: Introduce one_shot mode to throttle
row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy
Implementing json2sstable functionality. It allows generating an sstable from a JSON description of its content. Uses identical schema to dump-data, so it is possible to regenerate an existing sstable, by feeding the output of dump-data to write.
Most of the scylla storage engine features are supported. The only non-supported features are counters and non-strictly atomic data types (including frozen collections, tuples and UDTs).
Example invocation:
```
scylla sstable write --system-schema system_schema.columns --input-file ./input.json --generation 0
```
Refs: https://github.com/scylladb/scylladb/issues/9681
Future plans:
* Complete support for remaining features (counters and non-atomic types).
* Make sstable format configurable on the command line.
Closes#11181
* github.com:scylladb/scylladb:
test/cql-pytest: test_tools.py: add test for sstable write
test/cql-pytest: test-tools.py actually test with multiple sstables
test/cql-pytest: test_tools.py: reduce the number of test-cases
tools/scylla-sstable: introduce the write operation
tools/scylla-sstable: add support for writer operations
tools/scylla-sstable: dump-data: write bound-weight as int
tools/scylla-sstable: dump-data: always write deletion time for cell tombstones
tools/scylla-sstable: dump-data: add timezone to deletion_time
types: publish timestamp_from_string()
Some segments of code using wasmtime were not under an
ifdef SCYLLA_ENABLE_WASMTIME, making Scylla unable to compile
on machines without wasmtime. This patch adds the ifdef where
needed.
Closes#11200
We can now do a full circle: dump an sstable to json, generate an
sstable from it, then dump again and compare to the original json.
Expand the existing simple_no_clustering_table and
simple_clustering_table schema/data to improve coverage of things like
TTL, tombstones and static rows.
The test-cases in this suite have a parameter to run with one or
multiple input sstables. This was broken as each test table generated a
single sstable. Fix this so we actually get single/multiple input
sstable coverage.
Currently this test-case exercises all the available component dumpers
with many different schemas. This doesn't add any value for most of the
dumpers, save for the dump-data one. It does have a cost however in
run-time of these test-cases. Test the dumpers which are mostly
indifferent to the schema with just a single one, cutting the number of
generated test-cases from 70 to 30.
Allows generating an sstable based on a JSON description of its content.
Uses identical schema to dump-data, so it is possible to regenerate an
existing sstable, by feeding the output of dump-data to write.
Most of the scylladb storage engine features is supported, with the
exception of the following:
* counters
* non-strictly atomic types, including frozen collections, tuples or
UDTs.
Currently it is assumed that all operations read sstables. They get a
non-empty list of sstables as input and have no means to create
sstable-writers.
We want to add support for operations that write sstables. For this, we
relax the current top-level check about the sstable list not being
empty. We defer this empty-check for operations that actually need input
sstables. Furthermore, the operation_func gains an sstable_manager&
argument, to allow operations to create sstable writers.
Operations are now read-write capable.
In addition to the above the documentation language is adjusted to not
assume read-only operations.
Deletion time is always in UTC but whoever looks at the JSON has no way
to know that. In particular date-time parsers assume local timezone in
its absence which of course results incorrect deletion_time after
parsing.
Even on the environment which causes error during initalize Scylla,
"scylla --version" should be able to run without error.
To do so, we need to parse and execute these options before
initializing Scylla/Seastar classes.
Fixes#11117Closes#11179
Start compaction_manager as a sharded service
and pass a reference to it to the database rather
than having the database construct its own compaction_manager.
This is part of the wider scope effort to decouple compaction from replica database and table.
Closes#11099
* github.com:scylladb/scylladb:
compaction_manager: perform_cleanup, perform_sstable_upgrade: use a lw_shared_ptr for owned token ranges
compaction: cleanup, upgrade: use a lw_shared_ptr for owned token ranges
main: start compaction_manager as a sharded service
compaction_manager: keep config as member
backlog_controller: keep scheduling_group by value
backlog_controller: scheduling_group: keep io_priority_class by value
backlog_controller: scheduling_group: define default member initializers
backlog_controller: get rid of _interval member
token_metadata: impl: keep the set of normal token owners as a member
We don't need to recalculate the unique set of normal token
everytime we change `_token_to_endpoint_map`.
Similarly, this doesn't have to be done in `get_all_endpoints`.
Instead we can maintain it inexpensively in
`remove_endpoint`, and let `count_normal_token_owners`
just return its size and `get_all_endpoints` just return
the saved set.
Closes#11128Fixes#11146Closes#11158
* github.com:scylladb/scylladb:
token_metadata: allow update_normal_token_owners to yield
token_metadata: get_all_endpoints: return const unordered_set<inet_address>&
token_metadata: impl: keep the set of normal token owners as a member
When stopping the read, the multishard reader will dismantle the
compaction state, pushing back (unpopping) the currently processed
partition's header to its originating reader. This ensures that if the
reader stops in the middle of a partition, on the next page the
partition-header is re-emitted as the compactor (and everything
downstream from it) expects.
It can happen however that there is nothing more for the current
partition in the reader and the next fragment is another partition.
Since we only push back the partition header (without a partition-end)
this can result in two partitions being emitted without being separated
by a partition end.
We could just add the missing partition-end when needed but it is
pointless, if the partition has no more data, just drop the header, we
won't need it on the next page.
The missing partition-end can generate an "IDL frame truncated" message
as it ends up causing the query result writer to create a corrupt
partition entry.
Fixes: https://github.com/scylladb/scylladb/issues/9482Closes#11175
* github.com:scylladb/scylladb:
test/cql-pytest: add regression test for "IDL frame truncated" error
mutation_compactor: detach_state(): make it no-op if partition was exhausted
querier: use full_position in shard_mutation_querier
When the last non-dummy row is evicted from a partition, the partition
entry is evicted as well. The existing logic in on_evicted() leaves
the last dummy row in the partition version before evicting the
partition entry. This row may still be attached to the LRU. Eviction
of partition entry goes through mutation_cleaner::clear_gently(). If
this is preempted, the destruction may proceed in the background. If
evicition happens on the remaining row in that entry before it's
destroyed, the code will hit undefined behavior. on_evicted() calls
partition_version::is_referenced_from_entry(), which is unspecified
when the version is enqueued in the mutation_cleaner. It returns
incorrect value for the last item remaining in the LRU (middle entires evict fine).
In that case, eviction will try to access non-existent containing partition_entry,
causing undefined behavior.
Caught by debug-mode cql_query_test.test_clustering_filtering with
raft enabled. Where it manifested like this:
partition_version.hh:328:16: runtime error: load of value 7, which is not a valid value for type 'bool'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior partition_version.hh:328:16 in
Aborting on shard 0.
Instances of this issue outside of the unit test environment are not
known as of yet.
This change makes is_referenced_from_entry() return the correct value
even for versions which are queued in the mutation cleaner.
Fixes https://github.com/scylladb/scylladb/issues/11140
The series also contains some related cleanups and minor fixes for issues which
could come up later.
Closes#11187
* github.com:scylladb/scylladb:
cache_tracker: Make clear() leave no garbage
partition_snapshot_row_cursor: Fix over-counting of rows
row_cache: Fix undefined behavior during eviction under some conditions
Calling WebAssembly UDFs requires wasmtime instance. Creating such an instance is expensive,
but these instances can be reused for subsequent calls of the same UDF on various inputs.
This patch introduces a way of reusing wasmtime instances: a wasm instance cache.
The cache stores a wasmtime instance for each UDF and scheduling group. The instances are
evicted using LRU strategy and their size is based on the size of their wasm memories.
The instances stored in the cache are also dropped when the UDF is dropped itself. For that reason,
the first patch modifies the current implementation of UDF dropping, so that the instance dropping may be added
later. The patch also removes the need of compiling the UDF again when dropping it.
The second patch contains the implementation and use of the new cache. The cache is implemented
in `lang/wasm_instance_cache.hh` and the main ways of using it are the `run_script` methods from `wasm.hh`
The third patch adds tests to `test_wasm.py` that check the correctness and performance of the new
cache. The tests confirm the instance reuse, size limits, instance eviction after timeout and after dropping the UDF.
Closes#10306
* github.com:scylladb/scylladb:
wasm: test instances reuse
wasm: reuse UDF instances
schema_tables: simplify merge_functions and avoid extra compilation
In this PR, I have:
- removed the docs for Manager (including the sources for Manager 2.1 and the upgrade guides).
- added redirects to https://manager.docs.scylladb.com/.
- replaced the internal links with external links to https://manager.docs.scylladb.com/.
Closes#11162
* github.com:scylladb/scylladb:
doc: update the link to fix the warning about duplicate targets
Update docs/kb/gc-grace-seconds.rst
Update docs/_utils/redirects.yaml
doc: update the links to Manager
doc: add the link to manager.docs.scylladb.com to the toctree
doc: remove the docs for Manager - the Manager page, the guide for Manager 2.1, Manger upgrade guides
doc: add redirections from Manager 2.1 to the Manager docs
doc: add redirections to manager.docs.scylladb.com
insert_before() may need to allocate memory for a btree, so may
fail. Call cache_tracker::insert() only after successful instance so
that row counters reflect the correct state. On failure, the entry
will be unlinked automatically by rows_entry destructor, but row
counters in the cache_tracker will not be automatically decremented.
Given #11146, we see a 10ms stall when calculate_natural_endpoints
calls get_all_endpoints that up until this patch performed a
similar loop on the `_token_to_endpoint_map`, so to prevent such
a stall with large number of tokens, turn update_normal_token_owners
async, and allow yielding in the per-token tight loop.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We don't need to recalculate the unique set of normal token
everytime we change `_token_to_endpoint_map`.
Similarly, this doesn't have to be done in `get_all_endpoints`.
Instead we can maintain it inexpensively in
`remove_endpoint`, and let `count_normal_token_owners`
just return its size and `get_all_endpoints` just return
the saved set.
Note that currently topology is not updated accurately
in update_normal_token() and it may contain endpoint
that do no longer own any tokens.
If we did update topology accurately there, we
could use its locations map instead as its keys are equivalent
to the unordered_set<inet_address> we implement here.
Closes#11128
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently they are copied for the get_sstables function
so this change reduces copies.
Also, it will allow further decoupling of compaction_manager
from replica::database, by letting the caller of
perform_cleanup and perform_sstable_upgrade get the
owned token ranges from db and pass it to the perform_*
functions in the following patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And pass a reference to it to the database rather
than having the database construct its own compaction_manager.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There is no need to keep a mutable reference to the
scheduling_group passed at construction time since
setting / updating shares is using the schedulig_group /
io_priority_class id as a handle, and the id itself is never
changed by the backlog_controller.
Note that the class names are misleading, in hind sight,
they would better be called scheduling_group_id
and io_priority_class_id, respectively.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Exactly like the cpu scheduling_group, io_priority_class
contains the class id, which is a handle to the io_priority_class
and so can be kept by value, rather than by reference,
and be safely copied around.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To prepare for the next patch, implement default initialization
of the scheduling_group and io_priority_class, to the default values.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
detach_state() allows the user to resume a compaction process later,
without having to keep the compactor object alive. This happens by
generating and returning the mutation fragments the user has to re-feed
to a newly constructed compactor to bring it into the exact same state
the current compactor was at the point of stopping the compaction.
This state includes the partition-header (partition-start and static-row
if any) and the currently active range tombstone.
Detaching the state is pointless however when the compaction was stopped
such that the currently compacted partition was completely exhausted.
Allowing the state to be detached in this case seems benign but it
caused a subtle bug in the main user of this feature: the partition
range scan algorithm, where the fragments included in the detached state
were pushed back into the reader which produced them. If the partition
happened to be exhausted -- meaning the next fragment in the reader was
a partition-start or EOS -- this resulted in the partition being
re-emitted later without a partition-end, resulting in corrupt
query-result being generated, in turn resulting in an obscure "IDL frame
truncated" error.
This patch solves this seemingly benign but sinister bug by making the
return value of `detach_state()` an std::optional and returning a
disengaged optional when the partition was exhausted.
Instead of a separate partition key and position-in-partition.
This continues the recently started effort to standardize storing of
full positions on `full_position`.
This patch is also a hidden preparation for read_context::save_readers()
multishard_mutation_query.cc) no longer being able to get partition key
from compaction state in the future.
Fix https://github.com/scylladb/scylla-docs/issues/4125
I've added the upgrade guides from 5.0 to 2022.1. They are based on the previous upgrade guides from Open Source to Enterprise.
Closes#11108
* github.com:scylladb/scylladb:
doc: apply feedback about scylla-enterprise-machine-image
doc: update the note about installing scylla-enterprise-machine-image
update the info about installing scylla-enterprise-machine-image during upgrade
doc: add the requirement to install scylla-enterprise-machine-image if the previous version was installed with an image
doc: update the info about metrics in 2022.1 compared to 5.0
doc: minor formatting and language fixes
doc: add the new guide to the toctree
doc: add the upgrade guide from 5.0 to 2022.1
I created this branch to remove the external docs (Manager, Monitoring, Operator) from the core ScyllaDB documentation.
However, to make reviewing easier, this PR only covers removing the docs for ScyllaDB Monitoring Stack. I'm going to send other PRs to cover Manager and Operator.
In this PR, I have:
- removed the docs for ScyllaDB Monitoring Stack (including the sources for old versions).
- added redirects to https://monitoring.docs.scylladb.com/.
- replaced the internal links with external links to https://monitoring.docs.scylladb.com/.
Closes#11151
* github.com:scylladb/scylladb:
doc: fix the link to the Monitoring Stack
doc: fix the links in the manager section
doc: add the external link to Monitoring Stack to the menu
doc: replace the links to Monitoring Stack
doc: add the redirections for Monitoring Stack
doc: delete the Monitoring Stack documentation form the ScyllaDB docs and remove it from the toctree
When the last non-dummy row is evicted from a partition, the partition
entry is evicted as well. The existing logic in on_evicted() leaves
the last dummy row in the partition version before evicting the
partition entry. This row may still be attached to the LRU. Eviction
of partition entry goes through mutation_cleaner::clear_gently(). If
this is preempted, the destruction may proceed in the background. If
evicition happens on the remaining row in that entry before it's
destroyed, the code will hit undefined behavior. on_evicted() calls
partition_version::is_referenced_from_entry(), which is unspecified
when the version is enqueued in the mutation_cleaner. It returns
incorrect value for the last item remaining in the LRU. In that case
eviction will try to access non-existent containing partition_entry,
causing undefined behavior.
Caught by debug-mode cql_query_test.test_clustering_filtering with
raft enabled. Where it manifested like this:
partition_version.hh:328:16: runtime error: load of value 7, which is not a valid value for type 'bool'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior partition_version.hh:328:16 in
Aborting on shard 0.
Instances of this issue outside of the unit test environment are not
known as of yet.
This change makes is_referenced_from_entry() return the correct value
even for versions which are queued in the mutation cleaner.
Fixes#11140
Today, mutation_reader_merger drops unneeded readers in batches of 4,
meaning that the merger is having to keep the memory used by 3
unneeded readers in addition to the ones being currently read from.
As each may own a lot of memory, the combined effect of this waste,
coming from parallel reads, can potentially cause memory pressure.
This batching behavior was introduced in b524f96a74,
when readers had to be destroyed synchronously, as flat_mutation_reader
lacked an async close interface. But we have gone a long way since
then. Readers can be closed asynchronously and outstanding I/O
requests will be cancelled on close.
Now, we'll close readers as soon they're uneeded, one at a time,
using a continuation chain. If we're submitting close calls faster
than we can retire them, then we wait for their completion,
preventing memory usage from growing unbounded.
The benefit of this new approach will be very good when combining
disjoint readers, where only one is active at a time for producing
fragments. As soon as we're done with the current one, then it will
be closed allowing its memory to be released, before we move on
to the next reader that follows.
Refs #11040.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#11167
The forward service uses a vector of ranges owned by a particular
shard in order to split and delegate the work. The number can
grow large though, which can cause large allocations.
This commit limits the number of ranges handled at a time to 256.
Fixes#10725Closes#11182
and return status over the rest api' from Aleksandra Martyniuk
Currently, scrub returns to user the number indicating operation
result as follows:
- 1 when the operation was aborted;
- 3 in validate and segregate modes when validation errors were found
(and in segregate mode - fixed);
- 0 if operation ended successfully.
To achieve so, if an operation was aborted in abort mode, then
the exception is propagated to storage_service.cc. Also the number
of validation errors for current scrub is gathered and summed
from each shard there.
The number of validation errors is counted and registered in metrics.
Metrics provide common counters for all scrub operation within
a compaction manager, though. Thus, to check the exact number
of validation errors, the comparison of counter value before and after
scrub operation needs to be done.
Closes#11074
* github.com:scylladb/scylladb:
scrub compaction: return status indicating aborted operations over the rest api
test: move scylla_inject_error from alternator/ to cql-pytest/
scrub compaction: count validation errors and return status over the rest api
scrub compaction: count validation errors for specific scrub task
compaction: extract statistics in compaction_result
scrub compaction: register validation errors in metrics
scrub compaction: count validation errors
"
Commit 829b4c14 (repair: Make removenode safe by default) turned these
two to be read only (in fact, erase- and clear- from too).
"
* 'br-dangling-replicating-nodes' of https://github.com/xemul/scylla:
storage_service: Relax confirm_replication()
storage_service: Remove _removing_node
storage_service: Remove _replicating_nodes
This PR removes the existing Operator documentation pages from the core ScyllaDB docs. I have:
- removed the Operator page and replaced it with the link to the Operator documentation.
- created a redirect.
- updated the links to the Operator.
Closes#11154
* github.com:scylladb/scylladb:
Update docs/operating-scylla/index.rst
doc: fix the link to Operator
add the redirect to the Operator
replace the internal links with the external link to Operator
add the external link to Operator to the toctree
doc: remove the Operator docs from the core documentation
The goal is to put .svg files under git grep's radar. Otherwise a
pretty innocent 'git grep db::is_local' dumps the contents of the
docs/kb/flamegraph.svg on the screen, because it a) contains the
grep pattern and b) is looooong one-liner
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220730090026.8537-1-xemul@scylladb.com>
This reverts commit c3bad157e5, reversing
changes made to e66809d051. The checks it
adds are triggered by some dtests. While it's possible the check is
triggered due to an existing problem, better to investigate it out-of-tree.
Fixes#11169.
This method is called from REPLICATION_FINISHED handler and now just
logs a message. The verb is probably worth keeping for compatibility
at least for some time. The logging itself can be moved into handler's
lambda
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
over the rest api
Performing compaction scrub user did not know whether an operation
was aborted.
If compaction scrub is aborted, return status the user gets over
rest api is set to 1.
Move scylla_inject_error from alternator/ to cql-pytest/ so it
can be reached from various tests dirs. alternator/util.py is
renamed to alternator/alternator_util.py to avoid name shadowing.
Performing compaction scrub user did not know whether any validation
errors were encountered.
The number of validation errors per given compaction scrub is gathered
and summed from each shard. Basing on that value return status over
the rest api is set to 3 if any validation errors were encountered.
The number of validation errors is registered in metrics. Metrics
provide common counters for all scrub operation within a compaction
manager, though. Thus, to check the exact number of validation errors,
the comparison of counters before and after scrub operation needs
to be done.
Currently, if token_metadata_impl::update_normal_tokens
throws an exception before it's done, it leaves the
token_metadata_impl members partially updated
and we have no way of recovering from that.
The existing use cases take that into account
and always call it on a cloned, temporary copy of the token
metadata, so if it throws, the temporary copy is tossed away
without being applied back.
So just cement this, by adding cautions in the token_metadata
class declaration.
Closes#11127
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220728144821.130518-1-bhalevy@scylladb.com>
When stopping the read, the multishard reader will dismantle the
compaction state, pushing back (unpopping) the currently processed
partition's header to its originating reader. This ensures that if the
reader stops in the middle of a partition, on the next page the
partition-header is re-emitted as the compactor (and everything
downstream from it) expects.
It can happen however that there is nothing more for the current
partition in the reader and the next fragment is another partition.
Since we only push back the partition header (without a partition-end)
this can result in two partitions being emitted without being separated
by a partition end.
We could just add the missing partition-end when needed but it is
pointless, if the partition has no more data, just drop the header, we
won't need it on the next page.
The missing partition-end can generate an "IDL frame truncated" message
as it ends up causing the query result writer to create a corrupt
partition entry.
Fixes: https://github.com/scylladb/scylla/issues/9482Closes#11137
* github.com:scylladb/scylladb:
test/cql-pytest: add regression test for "IDL frame truncated" error
query: query_result_builder: add check for missing partition-end
mutation_compactor: detach_state(): make it no-op if partition was exhausted
querier: use full_position in shard_mutation_querier
Called from try_flush_memtable_to_sstable,
maybe_wait_for_sstable_count_reduction will wait for
compaction to catch up with memtable flush if there
the bucket to compact is inflated, having too many
sstables. In that case we don't want to add fuel
to the fire by creating yet another sstable.
Fixes#4116Closes#10954
* github.com:scylladb/scylla:
table: Add test where compaction doesn't keep up with flush rate.
compaction_manager: add maybe_wait_for_sstable_count_reduction
time_window_compaction_strategy: get_sstables_for_compaction: clean up code
time_window_compaction_strategy: make get_sstables_for_compaction idempotent
time_window_compaction_strategy: get_sstables_for_compaction: improve debug messages
leveled_manifest: pass compaction_counter as const&
Currently logalloc::region is relying on boost binomial_heap handle to properly move listeners registration when the region (when derived from dirty_memory_manager_logalloc::size_tracked_region) is moved, like boost::intrusive link hooks do -
hence 81e20ceaab/dirty_memory_manager.cc (L89-L90) does nothing.
Unfortunately, this doesn't work as expected.
This series adds a unit test that verifies the move semantics
and a fix to size_tracked_region and region_group code to make it pass.
Also "logalloc: region: get_impl might be called on disengaged _impl when moved"
fixes a couple corner cases where the shared _impl could be dereferenced when disengaged, and
the change also adds a unit test for that too.
Closes#11141
* github.com:scylladb/scylla:
logalloc: region: properly track listeners when moved
logalloc: region_impl: add moved method
logalloc: region: merge: optimize getting other impl
logalloc: region: merge: call region_impl::unlisten
logalloc: region: call unlisten rather than open coding it
logalloc: region move-ctor: initialize _impl
logalloc: region: get_impl might be called on disengaged _impl when moved
The test simulates a situation where 2 threads issue flushes to 2
tables. Both issue small flushes, but one has injected reactor stalls.
This can lead to a situation where lots of small sstables accumulate on
disk, and, if compaction never has a chance to keep up, resources can be
exhausted.
(cherry picked from commit b5684aa96d)
(cherry picked from commit 25407a7e41)
Called from try_flush_memtable_to_sstable,
maybe_wait_for_sstable_count_reduction will wait for
compaction to catch up with memtable flush if there
the bucket to compact is inflated, having too many
sstables. In that case we don't want to add fuel
to the fire by creating yet another sstable.
Fixes#4116
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To make sure fully_expired sstables are not missed
if get_sstables_for_compaction is called just heuristically,
change the state by setting _last_expired_check
to the current time only when no fully_expired_sstables are found
among the candidates.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Print the compaction_strategy `this` pointer
so we can distinguish between different instance of the
compaction_strategy object (some code paths copy it and
some may instantiate a branch new compaction_strategy object).
The motivation is detecting when the side effects of this function are
applied on the "master" instance, stored in the table shard.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The purpose of this PR is to update the README file in the `docs` folder to:
- Explain the contents of the folder (user docs vs developer docs).
- Add more information to help contributors.
- Remove outdated information.
Closes#11134
* github.com:scylladb/scylla:
docs: remove outdated information -Vale support, Lint, warning about livereload
doc: improve the section about knowledge base articles in README
doc: replace distribution names with a generic phrase: Linux distributions
doc: remove irrelevant guidelines for contributors from README
doc: language improvements in the doc's README
doc: reogrganize the content in the doc's README
doc: update the Prerequisites section in the doc's README
doc: remove redundant information from README in the docs folder
doc: add key information to the introduction in README in the docs folder
consume_clustering_fragments already ignores dummy rows, but does it in
the wrong place. Currently they're ignored after comparing them with
range tombstones. This change skips them before any useful work is done
with them.
Consider a simplified mutation reversal scenario scenario (ckp is
clustering key prefix, -1, 0, 1 are bound_weights):
schema_ptr s = schema_builder{"ks", "cf"}
.with_column("pk", bytes_type, column_kind::partition_key)
.with_column("ck1", bytes_type, column_kind::clustering_key)
.build();
Input range tombstone positions:
{clustered, ckp{}, before}
{clustered, ckp{1}, after}
Clustering rows:
{clustered, ckp{2}, equal}
{clustered, ckp{}, after} // dummy row
During reversal, clustering rows are read backwards, and reversed range
tombstone positions are read forwards (because the range tombstones are
reversed and applied backwards). The read order in the example above is:
Reversed range tombstone positions:
1: {clustered, ckp{}, before}
2: {clustered, ckp{1}, before}
Clustering rows read backwards:
3: {clustered, ckp{}, after} // dummy row
4: {clustered, ckp{2}, equal}
Then we effectively do the merge part of merge sort, trying to put all
fragments in order according to their positions from the two lists
above. However, the dummy row is used in the comparison, and it compares
to be gt each of the reversed range tombstone positions. Then we
try to emit the clustering row, but only at that point we notice it's
dummy and should be skipped. Subsequent row with ckp{2} is compared to
the last used range tombstone position and the fragments are out of
order (in reversed schema, ckp{2} should come before ckp{1}).
The solution is to move the logic skipping the dummy clustering rows to
the beginning of the loop, so they can be ignored before they're used.
Fixes: https://github.com/scylladb/scylla/issues/11147Closes#11129
* github.com:scylladb/scylla:
mutation: Add test if mutations are consumed in order
test: Move validating_consumer to test/lib/mutation_assertions.hh
mutation: Ignore dummy rows when consuming clustering fragments
Don't open-code calling the region_impl
_listeners->moved() in region move-constructor
and move-assignment op.
The other._impl->_region might be different then &other
post region::merge so let the region_impl
decide which region* is moved from.
The new_region is also set to region_impl->_region
so need to open-code that either in the said call sites.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The other _impl is presumed to be engaged already,
so just call other.get_impl() once for both use cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We can't be sure that the other_impl->_region == &other
since it could be a result of a previous merge,
so don't decide for it which region to unlisten to,
let it use its current _region.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Current ~region and region::operator= open-code
region_impl::unlisten. Just call it so it will be
easier to maintain.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
First check if _impl is engaged before accessing it
to set its _region = this in the move constructor and
move assignment operator.
Add unit test for these odd orner cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If the reader feeding the result builder is missing a partition-end
between two partition, or at end-of-stream, the result builder will
write a corrupt partition-entry into the result, ending up in an
"IDL Frame truncated" error.
It is trivial to add a check for this and this will result in a much
more clear error message, then the mysterious frame truncated error
mentioned above.
detach_state() allows the user to resume a compaction process later,
without having to keep the compactor object alive. This happens by
generating and returning the mutation fragments the user has to re-feed
to a newly constructed compactor to bring it into the exact same state
the current compactor was at the point of stopping the compaction.
This state includes the partition-header (partition-start and static-row
if any) and the currently active range tombstone.
Detaching the state is pointless however when the compaction was stopped
such that the currently compacted partition was completely exhausted.
Allowing the state to be detached in this case seems benign but it
caused a subtle bug in the main user of this feature: the partition
range scan algorithm, where the fragments included in the detached state
were pushed back into the reader which produced them. If the partition
happened to be exhausted -- meaning the next fragment in the reader was
a partition-start or EOS -- this resulted in the partition being
re-emitted later without a partition-end, resulting in corrupt
query-result being generated, in turn resulting in an obscure "IDL frame
truncated" error.
This patch solves this seemingly benign but sinister bug by making the
return value of `detach_state()` an std::optional and returning a
disengaged optional when the partition was exhausted.
Instead of a separate partition key and position-in-partition.
This continues the recently started effort to standardize storing of
full positions on `full_position`.
This patch is also a hidden preparation for read_context::save_readers()
multishard_mutation_query.cc) no longer being able to get partition key
from compaction state in the future.
Broken since the v2 output support was introduced (ad435dc).
No known adverse affects, besides mutation reads stopping a little later
than desired (on the next non-range-tombstone-change fragment) and hence
consuming more memory than the limit set for them.
Fixes: #11138Closes#11139
This series is the first step in the effort to reduce the number of metrics reported by Scylla.
The series focuses on the per-table metrics.
The combination of histograms, per-tables, and per shard makes the number of metrics in a cluster explode.
The following series uses multiple tools to reduce the number of metrics.
1. Multiple metrics should only be reported for the user tables and the condition that checked it was not updated when more non-user keyspaces were added.
2. Second, instead of a histogram, per table, per shard, it will report a summary per table, per shard, and a single histogram per node.
3. Histograms, summaries, and counters will be reported only if they are used (for example, the cas-related metrics will not be reported for tables that are not using cas).
Closes#11058
* github.com:scylladb/scylla:
Add summary_test
database: Reduce the number of per-table metrics
replica/table.cc: Do not register per-table metrics for system
histogram_metrics_helper.hh: Add to_metrics_summary function
Unified histogram, estimated_histogram, rates, and summaries
Split the timed_rate_moving_average into data and timer
utils/histogram.hh: should_sample should use a bitmask
estimated_histogram: add missing getter method
The series unifies memtable flush error handling into table::seal_active_memtable
following up on f6d9d6175f.
The goal here is to prevent an infinite retry loop as in #10498
by aborting on any error that is not bad_alloc.
Fixes#10498Closes#10691
* github.com:scylladb/scylla:
test: memtable_test: failed_flush_prevents_writes: notify_soft_pressure only once
test: memtable_test: failed_flush_prevents_writes: extend error injection
table: seal_active_memtable: abort if retried for too long
table: seal_active_memtable: abort on unexpected error
table: try_flush_memtable_to_sstable: propagate errors to seal_active_memtable
dirty_memory_manager: flush_when_needed: move error handling to flush_one/seal_active_memtable
dirty_memory_manager: flush_permit: add has_sstable_write_permit
dirty_memory_manager: flush_permit: release_sstable_write_permit: mark noexcept
dirty_memory_manager: flush_permit: make _sstable_write_permit optional
table: reindent seal_active_memtable
table: coroutinize seal_active_memtable
memtable_list: mark functions noexcept
commitlog: make discard_completed_segments and friends noexcept
dirty_memory_manager: flush_when_needed: target error handling at flush_one
database: delete unused seal_delayed_fn_type
dirty_memory_manager: mark functions noexcept
memtable: mark functions noexcept
memtable: memtable_encoding_stats_collector: mark functions noexcept
encoding_state: mark functions noexcept
logalloc: mark free functions noexcept
logalloc: allocating_section: mark functions noexcept
logalloc: allocating_section: guard: mark constructor noexcept
logalloc: reclaim_lock: mark functions noexcept
logalloc: tracker_reclaimer_lock: mark constructor noexcept
logalloc: mark shard_tracker noexcept
logalloc: region: mark functions const/noexcept
logalloc: basic_region_impl: mark functions noexcept
logalloc: region_impl: mark functions noexcept
utils: log_heap: mark functions noexcept
logalloc: region_impl: object_descriptor: mark functions noexcept
logalloc: region_group: mark functions noexcept
logalloc: tracker: mark functions const/noexcept
logalloc: tracker::impl: make region_occupancy and friends const
logalloc: tracker::impl: occupancy: get rid of reclaiming_lock
logalloc: tracker::impl: mark functions noexcept
logalloc: segment: mark functions const / noexcept
logalloc: segment_pool: add const variant of descriptor method
logalloc: segment_pool: move descriptor method to class definition
logalloc: segment_pool: mark functions const/noexcept
logalloc: segment_pool: delete unused free_or_restore_to_reserve method
utils: dynamic_bitset: mark functions noexcept
utils: dynamic_bitset: delete unused members
logalloc: segment_store, segment_pool: idx_from_segment: get a const segment* in const overload
logalloc: segment_store, segment_pool: return const segment* from segment_from_idx() const
logalloc: segment_store: make can_allocate_more_segments const
logalloc: segment_store: mark functions noexcept
logalloc: segment_descriptor: mark functions noexcept
logalloc: occupancy_stats: mark functions noexcept
min_max_tracker: mark functions noexcept
gc_clock, db_clock: mark functions noexcept
dirty_memory_manager: region_group: mark functions noexcept
dirty_memory_manager: region_group: make simple constructor noexcept
dirty_memory_manager: region_group_reclaimer mark functions noexcept
logalloc: lsa_buffer: mark functions noexcept
This patch reduces the number of metrics that is reported per table, when
the per-table flag is on.
When possible, it moves from time_estimated_histogram and
timed_rate_moving_average_and_histogram to use the unified timer.
Instead of a histogram per shard, it will now report a summary per shard
and a histogram per node.
Counters, histograms, and summaries will not be reported if they were
never used.
The API was updated accordingly so it would not break.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
There is a set of per-table metrics that should only be registered for
user tables.
As time passes there are more keyspaces that are not for the user
keyspace and there is now a function that covers all those cases.
This patch replaces the implementation to use is_internal_keyspace.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The to_metrics_summary is a helper function that create a metrics type
summary from a timed_rate_moving_average_with_summary object.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Currently, there are two metrics reporting mechanisms: the metrics layer
and the API. In most cases, they use the same data sources. The main
difference is around histograms and rate.
The API calculates an exponentially weighted moving average using a
timer that decays the average on each time tick. It calculates a
poor-man histogram by holding the last few entries (typically the last
256 entries). The caller to the API uses those last entries to build a
histogram.
We want to add summaries to Scylla. Similar to the API rate and
histogram, summaries are calculated per time interval.
This patch creates a unified mechanism by introducing an object that
would hold both the old-style histogram and the new
(estimated_histogram). On each time tick, a summary would be calculated.
In the future, we'll replace the API to report summaries instead of the
old-style histogram and deprecate the old style completely.
summary_calculator uses two estimated_histogram to calculate a summary.
timed_rate_moving_average_summary_and_histogram is a unifed class for
ihistogram, rates, summary, and estimated_histogram and will replace
timed_rate_moving_average_and_histogram.
Follow-up patches would move code from using
timed_rate_moving_average_and_histogram to
timed_rate_moving_average_summary_and_histogram. By keeping the API it
would make the transition easy.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This series refactors the code to get rid of unnecessary
allocations by extracing a helper requires_thread() function,
as well as by removing std::optional usage in forward_result,
now that it's possible to merge empty results with each other,
both ways (#11064).
Closes#11120
* github.com:scylladb/scylla:
forward_service: remove redundant optional from forward_service
forward_service: open-code running a Sestar thread
forward_service: add requires_thread helper
=== Setup ===
1) start node1 with
```
scylla --num-tokens 20000 --smp 1
```
The large number of tokens per node is used to simulate large number of nodes in the cluster (large total number of tokens for the cluster).
2) start node2 with
```
scylla --num-tokens 20000 --smp 1
```
3) Measure the time to finish bootstrap
=== Result ===
1) With speed up patch:
```
node1 (16s)
INFO 2022-06-21 14:30:00,038 [shard 0] init - Scylla version 5.1.dev-0.20220621.a7b927bda764 with build-id d78b6233e8227975cc26259280ceabf2cf7817b9 starting ...
INFO 2022-06-21 14:30:16,019 [shard 0] init - Scylla version 5.1.dev-0.20220621.a7b927bda764 initialization completed.
node2 (bootstrap node,174s)
INFO 2022-06-21 14:30:40,954 [shard 0] init - Scylla version 5.1.dev-0.20220621.a7b927bda764 with build-id d78b6233e8227975cc26259280ceabf2cf7817b9 starting ...
INFO 2022-06-21 14:33:34,899 [shard 0] init - Scylla version 5.1.dev-0.20220621.a7b927bda764 initialization completed.
```
2) Without speed up patch:
```
node1 (171s)
INFO 2022-06-21 14:38:49,065 [shard 0] init - Scylla version 5.1.dev-0.20220621.6f4bfea99431 with build-id f22bfa5a75887258ab48ee092ec49b5299365168 starting ...
INFO 2022-06-21 14:41:40,601 [shard 0] init - Scylla version 5.1.dev-0.20220621.6f4bfea99431 initialization completed.
node2 (bootstrap node, 1181s)
INFO 2022-06-21 14:41:46,997 [shard 0] init - Scylla version 5.1.dev-0.20220621.6f4bfea99431 with build-id f22bfa5a75887258ab48ee092ec49b5299365168 starting ...
INFO 2022-06-21 15:01:27,507 [shard 0] init - Scylla version 5.1.dev-0.20220621.6f4bfea99431 initialization completed.
```
The improvements for bootstrap time:
node1: 171s / 16s = 10.68X
node2: 1181s / 174s = 6.78X
Refs #10337
Refs #10817
Refs #10836
Refs #10837Closes#10850
* github.com:scylladb/scylla:
locator: Speed up abstract_replication_strategy::get_address_ranges
locator: Speed up simple_strategy::calculate_natural_endpoint
token_metadata: Speed up count_normal_token_owners
We know that sstable_run is supposed to contain disjoint files only,
but this assumption can temporarily break when switching strategies
as TWCS, for example, can incorrectly pick the same run id for
sstables in different windows during segregation. So when switching
from TWCS to ICS, it could happen a sstable_run won't contain disjoint
files. We should definitely fix TWCS and any other strategy doing
that, but sstable_run should have disjointness as actual invariant,
not be relaxed on it. Otherwise, we cannot build readers on this
assumption, so more complicated logic have to be added to merge
overlapping files.
After this patch, sstable_run will reject insertion of a file that
will cause the invariant to break, so caller will have to check
that and push that file into a different sstable run.
Closes#11116
Now that memtable flush error handling was moved entirely
to table::seal_active_memtable, we don't need to notify_soft_pressure
to keep retry going. The inifinite retry loop should
eventually either succeed or die (by isolating the node or aborting)
on its own.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If we haven't been able to flush the memtable
in ~30 minutes (based on the number of retries)
just abort assuming that the OOM
condition is permanent rather than transient.
Refs #4344
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently when we can't write the flushed sstable
due to corruption in the memtable we get into
an infinite retry loop (see #10498).
Until we can go into maintenance mode, the next best thing
would be to abort, though there is still a risk that
commitlog replay will reproduce the corruption in the
memtable and we's end up with an infinite crash loop.
(hence #10498 is not Fixed with this patch)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And let seal_active_memtable decide about how to handle them
as now all flush error handling logic is implemented there.
In particular, unlike today, sstable write errors will
cause internal error rather than loop forever.
Also, check for shutdown earlier to ignore errors
like semaphore_broken that might happen when
the table is stopped.
Refs #10498
(The issue will be considered fixed when going
into maintenance mode on write errors rather than
throwing internal error and potentially retrying forever)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently flush is retried both by dirty_memory_manager::flush_when_needed
and table::seal_active_memtable, which may be called by other paths
like table::flush.
Unify the retry logic into seal_active_memtable so that
we have similar error handling semantics on all paths.
Refs #4174
Refs #10498
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
So we can safely test whether it was released or not
by release_sstable_write_permit in a following patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that everything prior to flush_one is noexcept
make table::seal_active_memtable and the paths that call it
noexcept, making sure that any errors are returned only
as exceptional futures, and handle them in flush_when_needed().
The original handle_exception had a broader scope than now needed,
so this change is mostly technical, to show that we can narrow down
the error handling to the continuation of flush_one - and verify that
the unit test is not broken.
A later patch moves this error handling logic away to seal_active_memtable.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It was added in d20fae96a2
as a precaution not to invalidate iterators while
traversing _regions. However it is not requried as no allocation
is done on this synchronous path - therefore there is no
point in preventing reclaim.
This will allow making the respective functions const
as they merely return stats and do not modify the tracker impl.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To make the implementation inline and to prepare
for the next patch that adds a const overload of
this method.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Some methods were also marked inline when declared in the class
definition and in the ir definition site to provide a hint to
the compiler to inline them.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
dynamic_bitset allocates only when constructed.
then on it doesn't throw.
Though not that accessing bits out of range
is undefined behavior.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Maintain the const chain by returning a const segment*
from segment_from_idx() const overload.
And add a respective mutable overload to return a mutable segment*.
This is done for a similar change in idx_from_segment.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a const noexcept overload of `find_empty()` so that
can_allocate_more_segments can be const noexcept as well.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
consume_clustering_fragments already ignores dummy rows, but does it in
the wrong place. Currently they're ignored after comparing them with
range tombstones. This change skips them before any useful work is done
with them.
Consider a simplified mutation reversal scenario scenario (ckp is
clustering key prefix, -1, 0, 1 are bound_weights):
schema_ptr s = schema_builder{"ks", "cf"}
.with_column("pk", bytes_type, column_kind::partition_key)
.with_column("ck1", bytes_type, column_kind::clustering_key)
.build();
Range tombstones:
range_tombstone rt1{ckp{}, bound_kind::incl_start, ckp{1}, bound_kind::incl_end, tombstone{ts + 0, tp}};
range_tombstone rt2{ckp{1}, bound_kind::excl_start, ckp{}, bound_kind::incl_end, tombstone{ts + 1, tp}};
Input range tombstone positions:
{clustered, ckp{}, before}
{clustered, ckp{1}, after}
Clustering rows:
{clustered, ckp{2}, equal}
{clustered, ckp{}, after} // dummy row
During reversal, clustering rows are read backwards, and reversed range
tombstone positions are read forwards (because the range tombstones are
reversed and applied backwards). Position of rows is not
reversed, as regular rows always have equal positions (which does not
hold for dummy rows, which causes the problem in this case).
The read order in the example above is:
Reversed range tombstone positions:
1: {clustered, ckp{}, before}
2: {clustered, ckp{1}, before}
Clustering rows read backwards:
3: {clustered, ckp{}, after} // dummy row
4: {clustered, ckp{2}, equal}
Then we effectively do the merge part of merge sort, trying to put all
fragments in order according to their positions from the two lists
above. However, the dummy row is used in the comparison, and it compares
to be gt each of the reversed range tombstone positions. Then we
try to emit the clustering row, but only at that point we notice it's
dummy and should be skipped. Subsequent row with ckp{2} is compared to
the last used range tombstone position and the fragments are out of
order (in reversed schema, ckp{2} should come before ckp{1}).
The solution is to move the logic skipping the dummy clustering rows to
the beginning of the loop, so they can be ignored before they're used.
logalloc manages regions of log-structured allocated memory, and region_groups
containing such regions and other region_groups. region_groups were introduced
for accounting purposes - first to limit the amount of memory in memtables, then to
match new dirty memory allocation rate with memtable flushing rate so we never
hit a situation where allocation rate exceeded flush rate, and we exceed our limit.
The problem is that the abstraction is very weak - if we want to change anything
in memtable flush control we'll need to change region_groups too - and also
expensive to maintain.
The solution is to break the abstraction and move region_groups to memtable
dirty memory management code. Instead introduce a new, simpler abstraction,
the region_listener, which communicates changes in region memory consumption
to an external piece of code, which can then choose to do with it what it likes.
The long term plan is to completely remove region_groups and fold them into dirty_memory_manager:
- make each memtable a region_listener so it gets called back after size changes
- make memtables inform their dirty_memory_manager about the size to dirty_memory_manager can decide to throttle writes and which memtable to pick to flush
Closes#10839
* github.com:scylladb/scylla:
logalloc: drop region_impl public accessors
logalloc, dirty_memory_manager: move size-tracking binomial heap out of logalloc
logalloc: relax lifetime rules around region_listener
logalloc, dirty_memory_manager: move region_group and associated code
logalloc: expose tracker_reclaimer_lock
logalloc: reimplement tracker_reclaim_lock to avoid using hidden classes
logalloc: reduce friendship between region and region_group
logalloc: decouple region_group from region
memtable: stop using logalloc::region::group() to test for flushed memtables
This patch split the timed_rate_moving_average functionality into two, a
data class: rates_moving_average, and a wrapper class
timed_rate_moving_average that uses a timer to update the rates
periodically.
To make the transition as simple as possible timed_rate_moving_average,
takes the original API.
A new helper class meter_timer was introduced to handle the timer update
functionality.
This change required minimal code adaptation in some other parts of the
code.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch fixes a bug in should_sample that uses its bitmask
incorrectly.
basic_ihistogram has a feature that allows it to sample values instead
of taking a timer each time.
To decide if it should sample or not, it uses a bitmask. The bitmask
is of the form 2^n-1, which means 1 out of 2^n will be sampled.
For example, if the mask is 0x1 (2^2-1) 1 out of 2 will be sampled.
If the mask is 0x7 (2^3-1) 1 out of 8 will be sampled.
There was a bug in the should_sampled() method.
The correct form is (value&mask) == mask
Ref #2747
It does not solve all of #2747, just the bug part of it.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This pull request introduces a "synchronous mode" for global views. In this mode, all view updates are applied synchronously as if the view was local.
Marking view as a synchronous one can be done using `CREATE MATERIALIZED VIEW` and `ALTER MATERIALIZED VIEW`. E.g.:
```cql
ALTER MATERIALIZED VIEW ks.v WITH synchronous_updates = true;
```
Marking view as a synchronous one was done using tags (originally used by alternator). No big modifications in the view's code were needed.
Fixes: https://github.com/scylladb/scylla/issues/10545Closes#11013
* github.com:scylladb/scylla:
cql-pytest: extend synchronous mv test with new cases
cql-pytest: allow extra parameters in new_materialized_view
docs: add a paragraph on view synchronous updates
test/boost/cql_query_test: add test setting synchronous updates property
test: cql-pytest: add a test for synchronous mode materialized views
db: view: react to synchronous updates tag
cql3: statements: cf_prop_defs: apply synchronous updates tag
alternator, db: move the tag code to db/tags
cql3: statements: add a synchronous_updates property
To get the list of tokens for a given node, we loop through all the
tokens and calculate the nodes that are responsible for the token.
In case of the everywhere_topology, we know any node that is part of the
the ring will be responsible for all tokens.
This patch adds a fast path for everywhere_topology to avoid calculating
natural endpoints.
Refs #10337
Refs #10817
Refs #10836
Refs #10837
If the number of nodes in the cluster is smaller than the desired
replication factor we should return the loop when endpoints already
contains all the nodes in the cluster because no more nodes could be
added to endpoints lists
Refs #10337
Refs #10817
Refs #10836
Refs #10837
Currently, a set of nodes is built from _token_to_endpoint_map to get
the number of nodes in _token_to_endpoint_map.
To make it faster so we can call it on a fast path in the following
patch, a _nr_normal_token_owners member is introduced to track the
number.
Refs #10337
Refs #10817
Refs #10836
Refs #10837
This commit refactors the code to get rid of unnecessary
std::optional usage in forward_result, since now it's possible
to merge empty results with each other, both ways (#11064).
Fix mixing of log filename and log summary in error reporting for
CQLApprovalTest and PythonTest.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#11125
With the region heap handle removed from logalloc::region, there is
nothing remaining there that needs violation of the abstraction
boundary, so we can drop these hacks.
The region_group mechanism used an intrusive heap handle embedded in
logalloc::region to allow region_group:s to track the largest region. But
with region_group moved out of logalloc, the handle is out of place.
Move it out, introducing a new intermediate class size_tracked_region
to hold the heap handle. We might eventually merge the new class into
memtable (which derives from it), but that requires a large rearrangement
of unit tests, so defer that.
Currently, a region_listener is added during construction and removed
during destruction. This was done to mimick the old region(region_group&)
constructor, as region_listener replaces region_group.
However, this makes moving the binomial heap handle outside logalloc
difficult. The natural place for the handle is in a derived class
of logalloc::region (e.g. memtable), but members of this derived class
will be destroyed earlier than the logalloc::region here. We could play
trickes with an earlier base class but it's better to just decouple
region lifecycle from listener lifecycle.
Do that be adding listen()/unlisten() methods. Some small awkwardness
remains in that merge() implicitly unlistens (see comment in
region::unlisten).
Unit tests are adjusted.
region_group is an abstraction that allows accounting for groups of
regions, but the cost/benefit ratio of maintaining the abstraction
is poor. Each time we need to change decision algorithm of memtable
flushing (admittedly rarely), we need to distill that into an abstraction
for region_groups and then use it. An example is virtual regions groups;
we wanted to account for the partially flushed memtables and had to
invent region groups to stand in their place.
Rather than continuing to invest in the abstraction, break it now
and move it to the memtable dirty memory manager which is responsible
for making those decisions. The relevant code is moved to
dirty_memory_manager.hh and dirty_memory_manager.cc (new file), and
a new unit test file is added as well.
A downside of the change is that unit testing will be more difficult.
Right now tracker_reclaim_lock uses tracker::impl::reclaiming_lock,
which won't be visible if we want to expose tracker_reclaim_lock and
use it from another translation unit. However, it's simple to switch
to an implementation that doesn't require an unknown-size data member,
and instead increment a counter via a pointer, so do that.
- add conversions between region and region_impl
- add accessor for the binomial heap handle
- add accessor for region_impl::id()
- remove friend declarations
This helps in moving region_group to a different source file, where
the definitions of region_impl will not be visible.
As a first step in moving region_group away from logalloc, decouple
communications between region and region_group. We introduce region_listener,
that listens for the events that region passed directly to region_group.
A region_group now installs a region_listener in a region, instead of
having region know about the region_group directly.
This decoupling is still leaky:
- merge() chooses to forget the merged-from region's region_listener.
This happens to be suitable for the only user of merge().
- We're still embedding the binomial heap handle, used by region_group
to keep track of region sizes, in regions. A complete decoupling would
transfer that responsibility to region_group.
Currently, the memtable reader uses logalloc::region::group() to test
for whether a memtable has been flushed. If a memtable doesn't belong
to a region group (from dirty_memory_manager), it is flushed.
This is quite tortuous - logalloc::region::merge() makes the merged-from
region identical to the merged-to region. The merged-to region, the cache,
doesn't have a group, so the check works.
Since we're making region groups part of dirty_memory_manager, the cache
will no longer have this indirect way of communication with memtable. But
instead we can use a direct callback it already has -
on_detach_from_region_group(). Use that to set a flag, and examine it in
the read path.
Previous interface forced the caller to allocate forward_aggregates
in order to be able to conditionally run the merging code inside
a Seastar thread, which is suboptimal. By open-coding the condition,
it's possible to drop the do_with, saving an allocation.
Prevent stalls in this path as seen in performance testing.
Also, add a respective rest_api test.
Fixes#11114Closes#11115
* github.com:scylladb/scylla:
storage_service: reserve space in get_range_to_address_map and friends
storage_service: coroutinize get_range_to_address_map and friends
storage_service: pass replication map to get_range_to_address_map and friends
storage_service: get_range_to_address_map: move selection of arbitrary ks to api layer
test: rest_api: test range_to_endpoint_map and describe_ring
Merging empty results was already allowed, but in one way only:
empty.merge(nonempty, r); // was permitted
nonempty.merge(empty, r); // not permitted
With this commit, both methods are permitted.
In order to remove copying, the other result is now taken
by rvalue reference, with all call sites being updated
accordingly.
Fixes#10446Fixes#10174Closes#11064
* round up reported time to microseconds
* add backtrace if stall detected
* add call site name (hierarchical when timers are nested)
* put timers in more places
* reduce possible logspam in nested timers by making sure to report on things only once and to not report on durations smaller than those already reported on
Closes#10576
* github.com:scylladb/scylla:
utils: logalloc: fix indentation
utils: logalloc: split the reclaim_timer in compact_and_evict_locked()
utils: logalloc: report segment stats if reclaim_segments() times out
utils: logalloc: reclaim_timer: add optional extra log callback
utils: logalloc: reclaim_timer: report non-decreasing durations
utils: logalloc: have reclaim_timer print reserve limits
utils: logalloc: move reclaim timer destructor for more readability
utils: logalloc: define a proper bundle type for reclaim_timer stats
utils: logalloc: add arithmetic operations to segment_pool::stats
utils: logalloc: have reclaim timers detect being nested
utils: logalloc: add more reclaim_timers
utils: logalloc: move reclaim_timer to compact_and_evict_locked
utils: logalloc: pull reclaim_timer definition forward
utils: logalloc: reclaim_timer make tracker optional
utils: logalloc: reclaim_timer: print backtrace if stall detected
utils: logalloc: reclaim_timer: get call site name
utils: logalloc: reclaim_timer: rename set_result
utils: logalloc: reclaim_timer: rename _reserve_segments member
utils: logalloc: reclaim_timer round up microseconds
And add calls to maybe_yield to prevent stalls in this path
as seen in performance testing.
Also, add a respective rest_api test.
Fixes#11114
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
A series of refactors to the `raft_group0` service.
Read the commits in topological order for best experience.
This PR is more or less equivalent to the second-to-last commit of PR https://github.com/scylladb/scylla/pull/10835, I split it so we could have an easier time reviewing and pushing it through.
Closes#11024
* github.com:scylladb/scylla:
service: storage_service: additional assertions and comments
service/raft: raft_group0: additional logging, assertions, comments
service/raft: raft_group0: pass seed list and `as_voter` flag to `join_group0`
service/raft: raft_group0: rewrite `remove_from_group0`
service/raft: raft_group0: rewrite `leave_group0`
service/raft: raft_group0: split `leave_group0` from `remove_from_group0`
service/raft: raft_group0: introduce `setup_group0`
service/raft: raft_group0: introduce `load_my_addr`
service/raft: raft_group0: make some calls abortable
service/raft: raft_group0: remove some temporary variables
service/raft: raft_group0: refactor `do_discover_group0`.
service/raft: raft_group0: rename `create_server_for_group` to `create_server_for_group0`
service/raft: raft_group0: extract `start_server_for_group0` function
service/raft: raft_group0: create a private section
service/raft: discovery: `seeds` may contain `self`
Before they are made asynchronous in the next patch,
so they work on a coherent snapshot of the token_metadata and
replication map as their caller.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We could yield between updating the list of servers in raft/fsm
and updating the raft_address_map, e.g. in case of a set_configuration.
If tick_leader happens before the raft_address_map is updated,
is_alive will be called with server_id that is not in the map yet.
Fix: scylladb/scylla-dtest#2753
Closes#11111
It is only needed for the "storage_service/describe_ring" api
and service/storage_service shouldn't bother with it.
It's an api sugar coating.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, the WHERE clause grammar is constrained to a conjunction of
relations: `WHERE a = ? AND b = ? AND c > ?`. The restriction happens in three
places:
1. the grammar will refuse to parse anything else
2. our filtering code isn't prepared for generic expressions
3. the interface between the grammar and the rest of the cql3 layer is via a vector of terms rather than an expression
While most of the work will be in extending the filtering code, this series tackles the
interface; it changes the `whereClause` production to return an expression rather than
a vector. Since much of cql3 layer is interested in terms, a new boolean_factors() function
is introduced to convert an expression to its boolean terms.
Closes#11105
* github.com:scylladb/scylla:
cql3: grammar: make where clause return an expression
cql3: util: deinline where clause utilities
cql3: util: change where clause utilities to accept a single expression rather than a vector of terms
cql3: statement_restrictions: accept a single expression rather than a vector
cql3: statement_restrictions: merge `if` and `for`
cql3: select_statement: remove wrong but harmless std::move() in prepare_restrictions
cql3: expr: add boolean_factors() function to factorize an expression
cql3: expression: define operator==() for expressions
cql3: values: add operator==() for raw_value
The new cases cover:
- a materialized view created with synchronous updates from the start
- a materialized view created with synchronous updates,
but then alter to not have synchronous updates anymore
The test verifies if a synchronous updates code path was triggered in a
view that had synchronous_updates property set to true.
Done by inspecting query traces.
Code that waited for all remote view updates was already there. This
commit modifies the conditions of this wait to take into account the
"synchronous mode" (enabled when db::SYNCHRONOUS_VIEW_UPDATES_TAG_KEY is
set).
This commit defines a new tag key (SYNCHRONOUS_VIEW_UPDATES_TAG_KEY) to
be used for marking "synchronous mode" views. This key is used in
`cf_prop_defs::apply_to_builder` if the properties contain
KW_SYNCHRONOUS_UPDATES.
Tags are a useful mechanism that could be used outside of alternator
namespace. My motivation to move tags_extension and other utilities to
db/tags/ was that I wanted to use them to mark "synchronous mode" views.
I have extracted `get_tags_of_table`, `find_tag` and `update_tags`
method to db/tags/utils.cc and moved alternator/tags_extension.hh to
db/tags/.
The signature of `get_tags_of_table` was changed from `const
std::map<sstring, sstring>&` to `const std::map<sstring, sstring>*`
Original behavior of this function was to throw an
`alternator::api_error` exception. This was undesirable, as it
introduced a dependency on the alternator module. I chose to change it
to return a potentially null value, and added a wrapper function to the
alternator module - `get_tags_of_table_or_throw` to keep the previous
throwing behavior.
This property can be used with CREATE MATERIALIZED VIEW and ALTER
MATERIALIZED VIEW statements. Setting it allows global views to enter
"synchronous mode". In this mode, all view updates are also applied
synchronously as if the view was local. This may reduce their
availability, but has the benefit of propagating a potential
inconsistency risk (in form of a write error) to the user, who can
respond to it appropriately (e.g. retry the write or fix the view
later).
"scylla task_histogram" and "scylla fiber" will now show coroutine "promises".
Refs #10894Closes#11071
* github.com:scylladb/scylla:
test: gdb: test that "task_histogram -a" finds some coroutines
scylla-gdb.py: recognize coroutine-related symbols as task types
scylla-gdb.py: whitelist the .text section for task "vtables"
scylla-gdb.py: fix an error message
The cql-pytest cassandra_tests/validation/operations/select_test.py::
testSelectWithAlias uses a TTL but not because it wants to test the TTL
feature - it just wants to check the SELECT aliasing feature. The test
writes a TTL of 100 and then reads it back using an alias. We would
normally expect to read back 100 or 99, but to guard against a very slow
test machine, the test verified that we read back something between 70
and 100. I thought that allowing a ridiculous 30 second delay between
the write and the read requests was more than enough.
But in one run of the aarch64 debug build, this ridiculous 30 seconds
wasn't ridiculous enough - the delay ended up 35 seconds, and the
test failed!
So in this patch, I just make it even more ridiculous - we write 1000
and expect to read something over 100 - allowing a 900 second delay
in the test.
Note that neither the earlier 30-second or current 900-second delay
slows down the test in any way - this test will normally complete in
milliseconds.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11085
In preparation of the relaxation of the grammar to return any expression,
change the whereClause production to return an expression rather than
terms. Note that the expression is still constrained to be a conjunction
of relations, and our filtering code isn't prepared for more.
Before the patch, if the WHERE clause was optional, the grammar would
pass an empty vector of expressions (which is exactly correct). After
the patch, it would pass a default-constructed expression. Now that
happens to be an empty conjunction, which is exactly what's needed, but
it is too accidental, so the patch changes optional WHERE clauses to
explicitly generate an empty conjunction if the WHERE clause wasn't
specified.
Move closer to the goal of accepting a generic expression for WHERE
clause by accepting a generic expression in statement_restrictions. The
various callers will synthesize it from a vector of terms.
std::move(_where_clause) is wrong, because _where_clause is used later
(when analyzing GROUP BY), but also harmless (because the
statement_restrictions constructor accepts it by const reference).
To avoid confusion in the next patch where we'll pass _where_clause
to a different function, remove the bad std::move() in advance here.
When analyzing a WHERE clause, we want to separate individual
factors (usually relations), and later partition them into
partition key, clustering key, and regular column relations. The
first step is separation, for which this helper is added.
Currently, it is not required since the grammar supplies the
expression in separated form, but this will not work once it is
relaxed to allow any expression in the WHERE clause.
A unit test is added.
This is useful for implementing operator==() for expressions, which in
turn require comparing constants, which contain raw_values.
Note that this is not CQL comparison (that would be implemented
in cql3::expr::evaluate() and would return a CQL boolean, not a C++
boolean, but a traditional C++ value comparison.
Fix https://github.com/scylladb/scylla-docs/issues/4041
I've added the upgrade guides from 2022.x.y to 2022.x.z. They are based on the previous upgrade guides for patch releases.
Closes#11104
* github.com:scylladb/scylla:
doc: add the new upgrade guide to the toctree
doc: add the upgrage guides from 2022.x.y to 2022.x.z
The criteria is too permissive because coroutine symbols (those
without the "[clone .resume]" part at the end, anyway) look like
normal function names; hopefully this won't give too many false
positives to become a problem.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Actual vtables do not reside there, but coroutine object vptrs point
at the actual coroutine code, which is.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Expiring entries are added when a message is received from an unknown
host. If the host is later added to the raft configuration they become
non expiring. After that they can only be removed when the host is
dropped from the configuration, but they should never become expiring
again.
Refs #10826
This patch avoids unncessary CACHE_HITRATES updates through gossip.
After this patch:
Publish CACHE_HITRATES in case:
- We haven't published it at all
- The diff is bigger than 1% and we haven't published in the last 5 seconds
- The diff is really big 10%
Note: A peer node can know the cache hitrate through read_data
read_mutation_data and read_digest RPC verbs which have cache_temperature in
the response. So there is no need to update CACHE_HITRATES through gossip in
high frequency.
We do the recalculation faster if the diff is bigger than 0.01. It is useful to
do the calculation even if we do not publish the CACHE_HITRATES though gossip,
since the recalculation will call the table->set_global_cache_hit_rate to set
the hitrate.
Fixes#5971Closes#11079
In issue #10966, a user noticed that Alternator writes may be reordered
(a later write to an item is ignored with the earlier write to the same
item "winning") if Scylla nodes do not have synchronized time and if
always_use_lwt write isolation mode is not used.
In this patch I add to docs/alternator/compatibility.md a section about
this issue, what causes it, and how to solve or at least mitigate it.
Fixes#10966
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11094
Move some rare logs from TRACE to INFO level.
Add some assertions.
Write some more comments, including FIXMEs and TODOs.
Remove unnecessary `_shutdown_gate.hold()` (this is not a background
task).
Group 0 discovery would internally fetch the seed list from gossiper.
Gossiper would return the seed list from conf/scylla.yaml. This seed
list is proper for the bootstrapping scenario - we specify the initial
contact points for a node that joins a cluster.
We'll have to use a different list of seeds for group 0 discovery for
the upgrade scenario. Prepare for that by taking the seed list as a
parameter.
In the bootstrap scenario we'll pass the seed list down from
`storage_service::join_cluster`.
Additionally, `join_group0` now takes an `as_voter` flag, which is
`false` in the bootstrap scenario (we initially join as a non-voter) but
will be `true` in the upgrade scenario.
See previous commit. `remove_from_group0` had a similar problem as
`leave_group0`: it would handle the case where `raft_group0::_group0`
variant was not `raft::group_id` (i.e. we haven't joined group 0), but
RAFT local feature was enabled - i.e. the yet-unimplemented upgrade case
- by running discovery and calling `send_group0_modify_config`.
Instead, if we see that we've joined group 0 before, assume that we're
still a member and simply use the Raft `modify_config` API to remove
another server. If we're not a member it means we either decommissioned
or were removed by someone else; then we have no business trying to
remove others. There's also the unimplemented upgrade case but that will
come in another pull request.
Finally, add some logic for handling an edge case: suppose we joined
group 0 recently and we still didn't fully update our RPC address map
(it's being updated asynchronously by Raft's io_fiber). Thus we may fail
to find a member of group 0 in the address map. To handle this, ensure
we're up-to-date by performing a Raft read barrier.
State some assumptions in a comment.
Add a TODO for handling failures.
Remove unnecessary `_shutdown_gate.hold()` (this is not a background
task).
One of the following cases is true:
1. RAFT local feature is disabled. Then we don't do anything related to
group 0.
2. RAFT local feature is enabled and when we bootstrapped, we joined
group 0. Then `raft_group0::_group0` variant holds the
`raft::group_id` alternative.
3. RAFT local feature is enabled and when we bootstrapped we didn't join
group 0. This means the RAFT local feature was disabled when we
bootstrapped and we're in the (unimplemented yet) upgrade scenario.
`raft_group0::_group0` variant holds the `std::monostate` alternative.
The problem with the previous implementation was that it checked for the
conditions of the third case above - that RAFT local feature is enabled
but `_group0` does not hold `raft::group_id` - and if those conditions
were true, it executed some logic that didn't really make sense: it ran
the discovery algorithm and called `send_group0_modify_config` RPC.
In this rewrite I state some assumptions that `leave_group0` makes:
- we've finished the startup procedure.
- we're being run during decommission - after the node entered LEFT
status.
In the new implementation, if `_group0` does not hold `raft::group_id`
(checked by the internal `joined_group0()` helper), we simply return.
This is the yet-unimplemented upgrade case left for a follow-up PR.
Otherwise we fetch our Raft server ID (at this point it must be present
- otherwise it's a fatal error) and simply call `modify_config` from the
`raft::server` API.
Remove unnecessary call to `_shutdown_gate.hold()` (this is not a
background task).
`leave_group0` was responsible for both removing a different node from
group 0 and removing ourselves (leaving) group 0. The two scenarios are
a bit different and the handling will be rewritten in following commits.
Split `leave_group0` into two functions. Remove the incorrect comment
about idempotency - saying that the procedure is idempotent is an
oversimplification, one could argue it's incorrect since the second call
simply hangs, at least in the case of leaving group 0; following commits
will state what's happening more precisely.
Add some additional logging and assertions where the two functions are
called in `storage_service`.
Contains all logic for deciding to join (or not join) group 0.
Prepare for the case where we don't want to join group 0 immediately on
startup - the upgrade scenario (will be implemented in a follow-up).
Move the group 0 setup step earlier in `storage_service::join_cluster`.
`join_group0()` is now a private member of `raft_group0`. Some more
comments were written.
Compared to `load_or_create_my_addr` this function assumes that
the address is already present on disk; if not, it's a fatal error.
Use it in places where it would indeed be a fatal error
if the address was missing.
There are some calls to `modify_config` which should react to aborts
(e.g. when we shutdown Scylla).
There are also calls to `send_group0_modify_config` which should
probably also react to aborts, but the functions don't take
an abort_source parameter. This is fixable but I left TODOs for now.
The function no longer accesses the `_group0` variant directly, instead
it is made a member of `service::persistent_discovery`; the caller
guarantees that `persistent_discovery` is not destroyed before the
function finishes.
The function is now named `run`. A short comment was written at the
declaration site.
Make some members of `persistent_discovery` private, as they are only
used by `run`.
Simplify `struct tracker`, store the discovery output separately
(`struct tracker` is now responsible for a single thing).
Enclose the `parallel_for_each` over requests in a common coroutine
which keeps alive all the necessary things for the loop body and
performs the last step which was previously inside a `then`.
The set of seeds passed to the discovery algorithm may contain `self`.
The implementation will filter the `self` out (it calls `step(seeds)`;
`step` iterates over the given list of peers and ignores `_self`).
Specify this at the `discovery` constructor declaration site.
Simplify the code constructing `persistent_discovery` in
`raft_group0::discover_group0` using this assumption.
Add a test for a wasm aggregate function
which uses the new metrics to check if the cache has
been hit at least once.
Also check that the cache can get reused on different
queries, by testing that the number of queries is
higher than the number of cache misses.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
When executing a wasm UDF, most of the time is spent on
setting up the instance. To minimize its cost, we reuse
the instance using wasm::instance_cache.
This patch adds a wasm instance cache, that stores
a wasmtime instance for each UDF and scheduling group.
The instances are evicted using LRU strategy. The
cache may store some entries for the UDF after evicting
the instance, but they are evicted when the corresponding
UDF is dropped, which greatly limits their number.
The size of stored instances is estimated using the size
of their WASM memories. In order to be able to read the
size of memory, we require that the memory is exported
by the client.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Currently, we have 2 mere_functions methods, where one is only the only
call to the other. We can replace them with a simple one.
The merge_functions method compiles a UDF (using create_func) only to
read its signature. We can avoid that by reading it from the row ourselves.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
The Scylla Alternator documentation is now part of the Scylla user documentation (previously it was dev documentation).
This PR updates the links to the Alternator documentation.
Closes#11089
* github.com:scylladb/scylla:
doc: update the link to Alternator for DynamoDB users
doc: fix the links to Alternator
This PR removes all code that used classes `restriction`, `restrictions` and their children.
There were two fields in `statement_restrictions` that needed to be dealt with: `_clustering_columns_restrictions` and `_nonprimary_key_restrictions`.
Each function was reimplemented to operate on the new expression representaiion and eventually these fields weren't needed anymore.
After that the restriction classes weren't used anymore and could be deleted as well.
Now all of the code responsible for analyzing WHERE clause and planning a query works on expressions.
Closes#11069
* github.com:scylladb/scylla:
cql3: Remove all remaining restrictions code
cql3: Move a function from restrictions class to the test
cql3: Remove initial_key_restrictions
cql3: expr: Remove convert_to_restriction
cql3: Remove _new from _new_nonprimary_key_restrictions
cql3: Remove _nonprimary_key_restrictions field
cql3: Reimplement uses of _nonprimary_key_restrictions using expression
cql3: Keep a map of single column nonprimary key restrictions
cql3: Remove _new from _new_clustering_columns_restrictions
cql3: Remove _clustering_columns_restrictions from statement_restrictions
cql3: Use a variable instead of dynamic cast
cql3: Use the new map of single column clustering restrictions
cql3: Keep a map of single column clustering key restrictions
cql3: Return an expression in get_clustering_columns_restrctions()
cql3: Reimplement _clustering_columns_restrictions->has_supporting_index()
cql3: Don't create single element conjunction
cql3: Add expr::index_supports_some_column
cql3: Reimplement has_unrestricted_components()
cql3: Reimplement _clustering_columns_restrictions->need_filtering()
cql3: Reimplement num_prefix_columns_that_need_not_be_filtered
cql3: Use the new clustering restrictions field instead of ->expression
cql3: Reimplement _clustering_columns_restrictions->size() using expressions
cql3: Reimplement _clustering_columns_restrictions->get_column_defs() using expressions
cql3: Reimplement _clustering_columns_restrictions->is_all_eq() using expressions
cql3: expr: Add has_only_eq_binops function
cql3: Reimplement _clustering_columns_restrictions->empty() using expressions
This PR is V2 of https://github.com/scylladb/scylla/pull/11065.
The scope of updates:
- Created a _/cql/_ folder.
- Moved all the CQL-related pages from _/getting-started/_ to _/cql/_ .
- Moved the _cql-extensions.md_ file from _/dev/_ to _/cql/_ .
- Removed the outdated files and references.
- Updated the links to the CQL-related pages.
Closes#11083
* github.com:scylladb/scylla:
doc: update the links following the content reorganization
doc: remove the outdated cql pages and delete them from the indexes
doc: add index.rst for the cql folder and add it to toctree
doc: move cql-extensions.md from the dev docs to the cql folder
doc: move the CQL pages from getting-started to cql
doc: add redirections for the CQL pages
* seastar 6d4a0cb7a3...1d4432ed28 (11):
> rpc: Ignore failed future in connection::send()
> install-dependencies: centos-{7,8}: use {DTS,GTS}-11 instead of {DTS,GTS}-9
> coroutine: change access specifier of seastar::task member
> Merge 'build: try to enable io_uring if it is not specified' from Kefu Chai
> build: find_package() only if necessary
> Merge "rpc: handle connection negotiation error during stream sink creation " from Gleb
> build: try to enable io_uring if it is not specified
> test: rpc: add test that inject error during stream connection negotiation.
> test: rpc: inject errors only on streaming connections
> test: rpc: allow specifying after what limit a connection start producing errors
> rpc: do not destroy stream connection without stopping in case of negotiation failure
Fixes#10943Closes#11082
query_result was the wrong place to put last position into. It is only
included in data-responses, but not on digest-responses. If we want to
support empty pages from replicas, both data and digest responses have
to include the last position. So hoist up the last position to the
parent structure: query::result. This is a breaking change inter-node
ABI wise, but it is fine: the current code wasn't released yet.
Closes#11072
Whether a server can vote in a Raft configuration is not part of the
address. `server_address` was used in many context where `can_vote` is
irrelevant.
Split the struct: `server_address` now contains only `id` and
`server_info` as it did before `can_vote` was introduced. Instead we
have a `config_member` struct that contains a `server_address` and the
`can_vote` field.
Also remove an "unsafe" constructor from `server_address` where `id` was
provided but `server_info` was not. The constructor was used for tests
where `server_info` is irrelevant, but it's important not to forget
about the info in production code. Replace the constructor with helper
functions which specify in comments that they are supposed to be used in
tests or in contexts where `info` doesn't matter (e.g. when checking
presence in an `unordered_set`, where the equality operator and hash
operate only on the `id`).
Closes#11047
* github.com:scylladb/scylla:
raft: fsm: fix `entry_size` calculation for config entries
raft: split `can_vote` field from `server_address` to separate struct
serializer_impl: generalize (de)serialization of `unordered_set`
to_string: generalize `operator<<` for `unordered_set`
The node operations using node_ops_cmd have the following procedure:
1) Send node_ops_cmd::replace_prepare to all nodes
2) Send node_ops_cmd::replace_heartbeat to all nodes
In a large cluster 1) might take a long time to finish, as a result when
the node starts to perform 2), the heartbeat timer on the peer nodes which
is 30s might have already timed out. This fails the whole node
opeartions.
We have patches to make 1) more efficient and faster.
https://github.com/scylladb/scylla/pull/10850https://github.com/scylladb/scylla/pull/10822
In addition to that, this patch increases the heartbeat timeout to reduce
the false positive of timeout.
Refs #10337
Refs #11078Closes#11081
The classes restriction, restrictions and its children
aren't used anywhere now and can be safely removed.
Some includes need to be modified for the code to compile.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
statement_restrictions_test uses a function that is defined
in multi_column_restriction.hh.
This file will be removed soon and for the test to still work
the function is moved to the test source.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
initial_key restrictions was a class used by statement_restrictions
to represent empty restrictions of different types and simplify
restriction merging logic. They are not used anymore and can
be removed.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The _new prefix was used to distinguish the new field
from the old represenation.
Now the new field has fully replaced the old one
and _new can be removed from its name.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
All code that made use of _nonprimary_key_restrictions
has been modified to use _new_nonprimary_key_restrictions
instead.
The field can be removed.
Additionally the old code responsible for adding new restrictions
can be fully removed, everything is now done using add_restriction.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
All parts of the code that use _nonprimary_key_restrictions
are changed to use _new_nonprimary_key_restrictions instead.
I decided not to split this into multiple commits,
as there isn't a lot of changes and they are
analogous to the ones done before for partition
and clustering columns.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Keep a map of extracted restrictions for each restricted nonprimar column.
This map will be useful, just like the ones for clustering and partition
columns.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The _new was used to distinguish from the old field
during transition. Now the old field has been deleted
and the new one can take its place.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Currently, all the mutations this test generates are applied on shard 0.
In rare cases, this may lead to the following crash, when the flushed
sstable doesn't contain any key that belongs to the current shard,
as seen in https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1390/artifact/testlog/x86_64/dev/database_test.test_truncate_without_snapshot_during_writes.114.log
```
WARN 2022-07-17 17:41:36,630 [shard 0] sstable - create_sharding_metadata: range=[{-468459073612751032, pk{00046b657930}}, {-468459073612751032, pk{00046b657930}}] has no intersection with shard=0 first_key={key: pk{00046b657930}, token:-468459073612751032} last_key={key: pk{00046b657930}, token:-468459073612751032} ranges_single_shard=[] ranges_all_shards={{1, {[{-468459073612751032, pk{00046b657930}}, {-468459073612751032, pk{00046b657930}}]}}}
ERROR 2022-07-17 17:41:36,630 [shard 0] table - failed to write sstable /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db: std::runtime_error (Failed to generate sharding metadata for /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db)
ERROR 2022-07-17 17:41:36,631 [shard 0] table - Memtable flush failed due to: std::runtime_error (Failed to generate sharding metadata for /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db). Aborting, at 0x329e28e 0x329e780 0x329ea88 0xf5bc69 0xf956b1 0x3196dc4 0x3198037 0x319742a 0x32be2e4 0x32bd8e1 0x32ba01c 0x317f97d /lib64/libpthread.so.0+0x92a4 /lib64/libc.so.6+0x100322
```
Instead, generate random keys and apply them on their
owning shard, and truncate all database shards.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#11066
* github.com:scylladb/scylla:
database_test: test_truncate_without_snapshot_during_writes: apply mutation on the correct shard
table: try_flush_memtable_to_sstable: consume: close reader on error
All code using the _clustering_columns_restrictions field
has been modified to instead use _new_clustering_columns_restrictions
expression representation.
The old field can now be removed.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
There is a dynamic cast used to determine whether
clustering columns are restricted by a multi column
restriction.
Instead of doing that we can just use the _has_multi_column
variable.
It's also used a few lines higher, which means that
it should be already initialized.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
This PR extends #9209. It consists of 2 main points:
To enable parallelization of user-defined aggregates, reduction function was added to UDA definition. Reduction function is optional and it has to be scalar function that takes 2 arguments with type of UDA's state and returns UDA's state
All currently implemented native aggregates got their reducible counterpart, which return their state as final result, so it can be reduced with other result. Hence all native aggregates can now be distributed.
Local 3-node cluster made with current master. `node1` updated to this branch. Accessing node with `ccm <node-name> cqlsh`
I've tested belowed things from both old and new node:
- creating UDA with reduce function - not allowed
- selecting count(*) - distributed
- selecting other aggregate function - not distributed
Fixes: #10224Closes#10295
* github.com:scylladb/scylla:
test: add tests for parallelized aggregates
test: cql3: Add UDA REDUCEFUNC test
forward_service: enable multiple selection
forward_service: support UDA and native aggregate parallelization
cql3:functions: Add cql3::functions::functions::mock_get()
cql3: selection: detect parallelize reduction type
db,cql3: Move part of cql3's function into db
selection: detect if selectors factory contains only simple selectors
cql3: reducible aggregates
DB: Add `scylla_aggregates` system table
db,gms: Add SCYLLA_AGGREGATES schema features
CQL3: Add reduce function to UDA
gms: add UDA_NATIVE_PARALLELIZED_AGGREGATION feature
"
Same thing was done for compaction class some time ago, now
it's time for streaming to keep repair-generated IO in bounds.
This set mostly resembles the one for compaction IO class with
the exception that boot-time reshard/reshape currently runs in
streaming class, but that's nod great if the class is throttled,
so the set also moves boot-time IO into default IO class.
"
* 'br-streaming-class-throttling-2' of https://github.com/xemul/scylla:
distributed_loader: Populate keyspaces in default class
streaming: Maintain class bandwidth
streaming: Pass db::config& to manager constructor
config: Add stream_io_throughput_mb_per_sec option
sstables: Keep priority class on sstable_directory
Having this map is useful in a bunch of places.
To keep code simple it could be created from scratch each time,
but it's also used in do_filter, so this could actually
affect performance.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
get_clustering_columns_restrctions() used to return
a shared pointer to the clustering_restrictions class.
Now everything is being converted to expression,
so it should return an expression as well.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Currently, all the mutations this test generates are applied on shard 0.
In rare cases, this may lead to the following crash, when the flushed
sstable doesn't contain any key that belongs to the current shard,
as seen in https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1390/artifact/testlog/x86_64/dev/database_test.test_truncate_without_snapshot_during_writes.114.log
```
WARN 2022-07-17 17:41:36,630 [shard 0] sstable - create_sharding_metadata: range=[{-468459073612751032, pk{00046b657930}}, {-468459073612751032, pk{00046b657930}}] has no intersection with shard=0 first_key={key: pk{00046b657930}, token:-468459073612751032} last_key={key: pk{00046b657930}, token:-468459073612751032} ranges_single_shard=[] ranges_all_shards={{1, {[{-468459073612751032, pk{00046b657930}}, {-468459073612751032, pk{00046b657930}}]}}}
ERROR 2022-07-17 17:41:36,630 [shard 0] table - failed to write sstable /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db: std::runtime_error (Failed to generate sharding metadata for /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db)
ERROR 2022-07-17 17:41:36,631 [shard 0] table - Memtable flush failed due to: std::runtime_error (Failed to generate sharding metadata for /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db). Aborting, at 0x329e28e 0x329e780 0x329ea88 0xf5bc69 0xf956b1 0x3196dc4 0x3198037 0x319742a 0x32be2e4 0x32bd8e1 0x32ba01c 0x317f97d /lib64/libpthread.so.0+0x92a4 /lib64/libc.so.6+0x100322
```
Instead, generate random keys and apply them on their
owning shard, and truncate all database shards.
Fixes#11076
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In case the expression is empty and we want to merge it
with a new restriction we can just set the expression
to the new restriction.
Later this will make it easier to distinguish which case
of multi column restrictions are we dealing with.
IN and EQ can only have a single binary operator,
but slice might have two.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add a function that checks if there is an index
which supports one of the columns present in
the given expression.
This functionality will soon be needed for
clustering and nonprimary columns so it's
good to separate into a reusable function.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
If an exception is throws in `consume` before
write_memtable_to_sstable is called or if the latter fails,
we must close the reader passed to it.
Fixes#11075
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch makes memtable_flush_static_shares liveupdateable
to avoid having to restart the cluster after updating
this config.
Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
This patch makes compaction_static_shares liveupdateable
to avoid having to restart the cluster after updating
this config.
Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
This patch adds the _static_shares variable to the backlog_controller so that
instead of having to use a separate constructor when controller is disabled,
we can use a single constructor and periodically check on the adjust method
if we should use the static shares or the controller. This will be useful on
the next patches to make compaction_static_shares and memtable_flush_static_shares
live updateable.
Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
Currently, we use following naming convention for relocatable package
filename:
${package_name}-${arch}-package-${version}.${release}.tar.gz
But this is very different with Linux standard packaging system such as
.rpm and .deb.
Let's align the convention to .rpm style, so new convention should be:
${package_name}-${version}-${release}.${arch}.tar.gz
Closes#9799Closes#10891
* tools/java de8289690e...d0143b447c (1):
> build_reloc.sh: rename relocatable packages
* tools/jmx fe351e8...06f2735 (1):
> build_reloc.sh: rename relocatable packages
* tools/python3 e48dcc2...bf6e892 (1):
> reloc/build_reloc.sh: rename relocatable packages
The streaming class throughput can be limitd with the respective option.
Doing boot-time reshard/reshape doesn't need to obey it, as the node is
not yet up but instead should get there as soon as possible.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The stream_manager will bookkeep the streaming bandwidth option, to
subscribe on its changes it needs the config reference. It would be
better if it was stream_manager::config, but currently subscription on
db::config::<stuff> updates is not very shard-friendly, so we need to
carry the config reference itself around.
Similar trouble is there for compaction_manager. The option is passed
through its own config, but the config is created on each shard by
database code. Stream manager config would be created once by main code
on shard 0.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's going to control the bandwidth for the streaming prio class.
For now it's jsut added but does't work for real
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Current code accepts priotity class as an argument to various functions
that need it and all its callers use streaming class. Next patches will
needs to sometimes use default class, but it will require heavy patching
of the distributed loader. Things get simpler if the priority class is
kept on sstable_directory on start.
This change also simplifies the ongoing effort on unification of sched
and IO classes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When collecting a histogram of smp-queues population empty queues also
count, but it makes the output very long and not very informative.
Skipping empty queues increases signal / noise ratio.
v2:
- print the number of omitted empty queues
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220718180912.2931-1-xemul@scylladb.com>
We want to print the backwards fiber in reverse, starting with the
furthest-away task in the chain. For this, the task list returned by
`_walk()` has to be reversed.
Closes#11062
We forgot about `can_vote`.
Stumbled on this while separating `can_vote` to separate struct.
Note that `entry_size` is still inaccurate (#11068) but the patch is an
improvement.
Refs: #11068
Whether a server can vote in a Raft configuration is not part of the
address. `server_address` was used in many context where `can_vote` is
irrelevant.
Split the struct: `server_address` now contains only `id` and
`server_info` as it did before `can_vote` was introduced. Instead we
have a `config_member` struct that contains a `server_address` and the
`can_vote` field.
Also remove an "unsafe" constructor from `server_address` where `id` was
provided but `server_info` was not. The constructor was used for tests
where `server_info` is irrelevant, but it's important not to forget
about the info in production code. The constructor was used for two
purposes:
- Invoking set operations such as `contains`. To solve this we use C++20
transparent hash and comparator functions, which allow invoking
`contains` and similar functions by providing a different key type (in
this case `raft::server_id` in set of addresses, for example).
- constructing addresses without `info`s in tests. For this we provide
helper functions in the test helpers module and use them.
The code is copied from:
single_column_primary_key_restrictions<clustering_key>
::num_prefix_columns_that_need_not_be_filtered
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Instead of writing
_clustering_columns_restrictions->expression
It's better to use the new field:
_new_clustering_columns_restrictions
These expressions should be the same.
It removes another use of the unwanted restrictions field.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add a function which checks that an expression
contains only binary operators with '='.
Right now this check is done only in a single place,
but soon the same check will have to be done
for clustering columns as well, so the code
is moved to a separate function to prevent duplication.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
All occurences of _clustering_columns_restrictions->empty()
have been replaced with code that operates on the new
expression representation: _new_clustering_columns_restrictions.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Enables parallelization of query like `SELECT MIN(x), MAX(x)`.
Compatibility is ensured under the same cluster feature as
UDA and native aggregates parallelization. (UDA_NATIVE_PARALLELIZED_AGGREGATION)
Enables parallelization of UDA and native aggregates. The way the
query is parallelized is the same as in #9209. Separate reduction
type for `COUNT(*)` is left for compatibility reason.
`mock_get` was created only for forward_service use, thus it only checks for
aggregate functions if no declared function was found.
The reason for this function is, there is no serialization of `cql3::selection::selection`,
so functions lying underneath these selections has to be refound.
Most of this code is copied from `functions::get()`, however `functions::get()` is not used because it requires to
mock or serialize expressions and `functions::find()` is not enough,
because it does not search for dynamic aggregate functions
Moving `function`, `function_name` and `aggregate_function` into
db namespace to avoid including cql3 namespace into query-request.
For now, only minimal subset of cql3 function was moved to db.
Because `selection` is not serializable and it has to be send via network
to parallelize query, we have to mock the selection. To simplify
the mocking, for now only single selectors for aggregate's arguments
are allowed (no casting or other functions as arguments).
This work gets us a step closer to compaction groups.
Everything in compaction layer but compaction_manager was converted to table_state.
After this work, we can start implementing compaction groups, as each group will be represented by its own table_state. User-triggered operations that span the entire table, not only a group, can be done by calling the manager operation on behalf of each group and then merging the results, if any.
Closes#11028
* github.com:scylladb/scylla:
compaction: remove forward declaration of replica::table
compaction_manager: make add() and remove() switch to table_state
compaction_manager: make run_custom_job() switch to table_state
compaction_manager: major: switch to table_state
compaction_manager: scrub: switch to table_state
compaction_manager: upgrade: switch to table_state
compaction: table_state: add get_sstables_manager()
compaction_manager: cleanup: switch to table_state
compaction_manager: offstrategy: switch to table_state()
compaction_manager: rewrite_sstables(): switch to table_state
compaction_manager: make run_with_compaction_disabled() switch to table_state
compaction_manager: compaction_reenabler: switch to table_state
compaction_manager: make submit(T) switch to table_state
compaction_manager: task: switch to table_state
compaction: table_state: Add is_auto_compaction_disabled_by_user()
compaction: table_state: Add on_compaction_completion()
compaction: table_state: Add make_sstable()
compaction_manager: make can_proceed switch to table_state
compaction_manager: make stop compaction procedures switch to table_state
compaction_manager: make get_compactions() switch to table_state
compaction_manager: change task::update_history() to use table_state instead
compaction_manager: make can_register_compaction() switch to table_state
compaction_manager: make get_candidates() switch to table_state
compaction_manager: make propagate_replacement() switch to table_state
compaction: Move table::in_strategy_sstables() and switch to table_state
compaction: table_state: Add maintenance sstable set
compaction_manager: make has_table_ongoing_compaction() switch to table_state
compaction_manager: make compaction_disabled() switch to table_state
compaction_manager: switch to table_state for mapping of compaction_state
compaction_manager: move task ctor into source
Fix https://github.com/scylladb/scylla-docs/issues/4040
Fix https://github.com/scylladb/scylla-docs/issues/4128
This PR adds the upgrade guides from ScyllaDB Enterprise 2021.1 to 2022.1. They are based on the previous guides.
Closes#11036
* github.com:scylladb/scylla:
doc: add the description of the new metrics in 2022.1
doc: remove the upgrade guide for Ubuntu 16.04 (no longer supported in version 2022.1)
doc: remove the outdated warning
Update docs/upgrade/_common/upgrade-guide-from-2021.1-to-2022.1-ubuntu-and-debian.rst
Update docs/upgrade/_common/upgrade_to_2022_warning.rst
doc: add a space on line 60 to fix the warning
doc: document metric update for 2022.1
doc: add the upgrade guide from 2021.1 to 2022.1
This series adds removal of dropped table directory when it has no remaining snapshots.
There are 2 code paths that take of that:
1. when the table is dropped and there are no active snapshots for it (typically when auto_snapshot disabled).
2. or when the last snapshot is cleared, leaving no other snapshot for a dropped table.
Unit tests were extended to covert these scenarios.
Fixes#10896Closes#11001
* github.com:scylladb/scylla:
legacy_schema_migrator: simplify drop_legacy_tables
database: clear_snapshot: remove dropped table directory when it has no remaining snapshots
database: clear_snapshot: make it a coroutine and use thread
database_test: add clear_multiple_snapshots test
database: make drop_column_family private
schema_tables: merge_tables_and_views: use drop_table_on_all_shards
database_test: drop_table_with_snapshots: test auto_snapshot
database_test: populate_from_quarantine_works: pass optional db:config to do_with_some_data
database: drop_table_on_all_shards: remove table directory having no snapshots
sstables: define table_subdirectories
sstables: officially define pending_delete_dir
database: add drop_table_on_all_shards
There is no need for utils::make_joinpoint now
that the function calls replica::database::drop_table_on_all_shards.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
and use an async thread around `directory_lister`
rather than `lister::scan_dir` to simplify the implementation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Based on the `clear_snapshot` test.
Test with multiple snapshots and different
combinations of parameters to database::clear_snapshot.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
So that the dropped table's directory can be
removed after it has been dropped on all shards
if it has no snapshots.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Refactor test_drop_table_with_auto_snapshot out of
drop_table_with_snapshots, adding a auto_snapshot param,
controlling how to configure the cql_test_env db:.config::auto_snapshot,
so we can test both cases - auto_snapshot enabled and disabled.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Instead of just `tmpdir_for_data`, so we can easily set auto_snapshot
for `drop_table_with_snapshots` in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If the table to remove has no snapshots then
completely remove its directory on storage
as the left-over directory slows down operations on the keyspace
and makes searching for live tables harder.
Fixes#10896
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than using the "pending_delete" string
in `pending_delete_dir_basename()`, so it can
be orderly removed in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Runs drop_column_family on all database shards.
Will be extended later to consider removing the table directory.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
compaction_manager.cc still cannot stop including replica/database.hh
because upgrade and scrub still take replica::database as param,
but I'll remove it soon in another series.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
rewrite_sstables() is used by maintenance compactions that perform
an operation on a single file at a time.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Now that submit() switched to table_state, compaction_reenabler
and friends can switch to table_state too.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
auto_compaction_disabled_by_user is a configuration that can be enabled
or disabled on a particular table. We're adding this interface to
avoid having to push the configuration for every compaction_state,
which would result in redundant information as the configuration
value is the same for all table states.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The idea is that we'll have a single on-completion interface for both
"in-strategy" and off-strategy compactions, so not to pollute table_state
with one interface for each.
replica::table::on_compaction_completion is being moved into private namespace.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
compaction_manager needs this interface when setting the sstable
creation lambda in compaction_descriptor, which is then forwarded
into the actual compaction procedure.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
they're used to stop all ongoing compaction on behalf of a given table
T. Today, each table has a single table_state representing it, but after
we implement compaction groups, we'll need to call the procedure for
each group in a table. But the discussion doesn't belong here, as
compaction group work will only come later. By the time being, we're
only making compaction manager fully switch to table_state.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The only external user of get_compactions() doesn't use any filtering,
so after table_state switch, one will be allowed to get all jobs
running associated with a table_state.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
propagate_replacement is used by incremental compaction to notify
ongoing compaction about sstable list updates, such that the
ongoing job won't hold reference to exhausted sstables.
So it needs to switch to table_state, too.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
in_strategy_sstables() doesn't have to be implemented in table, as it's
simply about main set with maintenance and staging files filtered out.
Also, let's make it switch to table_state as part of ongoing work.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
manager stores a state for each table. As we're transitioning towards
table_state, the mapping of a table to compaction state will now use
table_state ptr as key. table_state ptr is stable and its lifetime
is the same as table.
we're temporarily adding a ptr to compaction_state, as there's lots
of dependency on replica::table, but we'll get rid of it once
we complete the transition.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's to be able to get table_state from table in subsequent patch,
as table only has a forward declaration to it in compaction_manager.hh
to avoid including database.hh.
Once everything is moved to table_state, then ctor can be moved
back into header.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
storage_service/keyspaces?type=user along with user keyspaces returned
the keyspaces that were internal but non-system.
The list of the keyspaces for the user option
(storage_service/keyspaces?type=user) contains neither system nor
internal but only user keyspaces.
Fixes: #11042Closes#11049
Rename reclaim_timer::_reserve_segments to _segments_to_release
as it is clearer and more suitable for later patches
that will add reclaim_timers in more functions.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Remove some stale entries, add new entries for docs/.
Closes#11046
* github.com:scylladb/scylla:
CODEOWNERS: add owners for docs/
CODEOWNERS: remove @haaawk
User documentation was recently migrated to scylla.git, and this is
maintained by non scylla-core people. Add entries for docs/ so they are
notified when somebody submits changes to docs/.
These are cleanups needed for upcoming series that will make manager switch to table abstraction.
Closes#11037
* github.com:scylladb/scylla:
compaction_manager: remove unused variable in rewrite_sstable()
table: remove ref from on_compaction_completion() signature
table: use compaction_completion_desc to describe changes for off-strategy
compaction_manager: rename table_state's get_sstable_set to main_sstable_set
* seastar 7d8d846b26...6d4a0cb7a3 (18):
> io: Adjust IO latency goal on fair-queue level
Fixes#10927
> coroutine: exception: deprecate return_exception(exception_ptr)
> Merge "Make fair-queue class manipulations noexcept" from Pavel E
> lowres_timers: Put timeout to infinity if no timers armed
> util/conversion: support IEC prefix like "Ki"
> util/conversion: use string_view instead of string
> thread: fix backtrace termination for s390x on clang
> *: add fmt::ostream_formatter<> so {fmt} can use operator<<
> net: Remove operator<< for ipv4_addr
> rpc-impl: Log "caught exception" when catching exception
> rpc: Don't format non-trivial types with format specifier
> sharded: use std::invoke() to call mapper function
> Merge 'Avoid false-positive warnings in Gcc 12.1.1' from Nadav Har'El
> tls_test: Remove dns bottle neck + improve read loop in google connect test
> Revert "sstring: restore compatibility with std::string"
> sstring: restore compatibility with std::string
> tls_test: Make google https connect routine loop buffer reads
> coroutine: add buffer support to async generator
Closes#11033
Now update_sstable_lists_on_off_strategy_completion() and
on_compaction_completion() can be called from the same unified
interface.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
To make it possible to add a single interface in table_state for
updating sstable list on behalf of both off-strategy and in-strategy
compactions, update_sstable_lists_on_off_strategy_completion() will
work with compaction_completion_desc too for describing sstable set
changes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
With compaction_manager switching to table_state, we'll need to
introduce a method in table_state to return maintenance set.
So better to have a descriptive name for main set.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
- bytes_ostream has a default initial chunk size of 512.
- let's say we call bytes_ostream::write() to write 500 bytes.
- as next_alloc_size() takes into account space to hold chunk metadata
(24 bytes) + chunk data, then 512 bytes is not enough, so it returns
500 + 24 instead to be allocated.
- when allocating next chunk, next_alloc_size() will use the size of
existing chunk, which is 500 bytes (without metadata) and multiply it
to 2 (growth factor), so 1000 bytes is allocated for it.
So allocations can be non power-of-two, resulting in memory waste.
When seastar is allocating from small pools, the waste is not terrible
(although accumulated small wastes can be problematic), but once
allocations pass the large threshold (16k), then alignment is 4k
(page size) and the waste is not negligible.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#11027
cql-pytest contains subdirectories with tests
ported from Cassandra. It's desirable to preserve
the same layout and file names for these tests as in
the original source tree. To do that, add support for recursive
search of tests to PythonTestSuite. The log files for
the tests which are found recursively are created in subdirs
of the test tmpdir.
While implementing the feature, switch to using pathlib,
since a) it supports rglob (recursive glob) and b) it
was requested in one of the earlier reviews.
Closes#11018
This patch adds several more tests for Alternator's UpdateItem operation.
These tests verify a few simple cases that, surprisingly, never had test
coverage. The new tests pass (on both DynamoDB and Alternator) so did not
expose any bug.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11025
If the compaction_descriptor returned by `time_window_compaction_strategy::get_sstables_for_compaction`
is marked with `has_only_fully_expired::yes` it should always be compacted
since `time_window_compaction_strategy::get_sstables_for_compaction` is not idempotent.
It sets `_last_expired_check` and if compaction is postponed and retried before
`expired_sstable_check_frequency` has passed, it will not look for those fully-expired
sstables again. Plus, compacting them is the cheapest possible as it does not require
reading anything, just deleting the input sstables, so there's no reason not postpone it.
Also, extend `max_ongoing_compaction_test` to test serialization of compaction jobs with the same weight.
Fixes#10989Closes#10990
* github.com:scylladb/scylla:
compaction_manager: always register descriptor with fully expired sstables for compaction
test: max_ongoing_compaction_test: test serialization of regular compaction with same weight
test: max_ongoing_compaction_test: reindent refactored code
test: max_ongoing_compaction_test: define compact_all_tables lambda
test: max_ongoing_compaction_test: refactor make_table_with_single_fully_expired_sstable
test: max_ongoing_compaction_test: reduce number of tables
This series adds the infrastructure needed for testing user permissions, like the ability to create temporary roles and CQL sessions which log in as different users, and a few initial test cases for granting and revoking permissions.
Closes#10998
* github.com:scylladb/scylla:
cql-pytest: add a case for granting/revoking data permissions
cql-pytest: add new_user and new_session utils
cql-pytest: speed up permissions refresh period for tests
Said parameter is a convenience so downstream consumers of the
mutation compactors don't have to check the `bool is_live` already
passed to them. This convenience however causes a template parameter and
additional logic for the compactor. As the most prominent of these
consumers (the query result builder) will soon have to switch to
`emit_only_live_rows::no` for other reasons anyway (it will want to count
tombstones), we take the opportunity to switch everybody to ::no. This
can be done with very little additional complexity to these consumers --
basically an additional if or two. With everybody using the `::no` variant
of the compactor, we can remove this template parameter and the logic
associated with it altogether.
Closes#10931
* github.com:scylladb/scylla:
multishard_mutation_query: remove now pointless compact_for_result_state typedef
mutation_compactor: remove only-live related logic
mutation_compactor: remove emit_only_live_rows template parameter
mutation_compactor: remove unused compact_mutation_state::parameters
querier: remove {data,mutation}_querier aliases
querier: remove now pointless emit_only_live_rows template parameter
tree: use emit_only_live_rows::no
querier: querier_cache: de-override insert() methods
This is part of support installing executables from PIP package,
now we support installing executable from PIP package but it will
install under /opt/scylladb/python3/bin.
To call these commands without speciying full path, we also need to install
symlink to /usr/bin.
To do this, we need new list which specifies command name for symlink.
Closes#10748
This series includes an assortment of loosely related improvements developed for a recent investigation. The changes include:
* Fix broken `std_deque` wrapper.
* Make `scylla smp-queues` fast.
* Teach `scylla smp-queues` to filter for both sender CPU (`--from`) and receiver CPU (`--to`) or both.
* Teach `scylla smp-queues` to make histogram over content of the queues -- i.e. the type of tasks in the smp queues.
* Teach `scylla smp-queues` to filter for tasks belonging to a certain scheduling group.
* Teach `scylla task_histogram` to include only tasks in the histogram.
* Teach `scylla task_histogram` to filter for tasks belonging to a certain scheduling group.
* Teach `scylla-fiber` to walk in both directions.
And some refactoring.
Fixes: https://github.com/scylladb/scylla/issues/7059Closes#11019
* github.com:scylladb/scylla:
docs/dev/debugging.md: update continuation chain traversal guide
scylla-gdb.py: scylla fiber: walk continuation chain in both directions
scylla-gdb.py: scylla fiber: allow passing analyzed pointers to _probe_pointer()
scylla-gdb.py: scylla fiber: hoist preparatory code out of _walk()
scylla-gdb.py: scylla task_histogram: add --scheduling-groups option
scylla-gdb.py: scylla task_histogram: add --filter-tasks option
scylla-gdb.py: scylla task_histogram: use histogram class
scylla-gdb.py: scylla-fiber: extract symbol matching logic
scylla-gdb.py: histogram: add limit feature
scylla-gdb.py: histogram: handle formatting errors
scylla-gdb.py: intrusive_slist: avoid infinite recursion in __len__()
scylla-gdb.py: scylla smp-queues: add --scheduling-group option
scylla-gdb.py: scylla smp-queues: add --content switch
scylla-gdb.py: smp-queue: add filtering capability
scylla-gdb.py: make scylla smp-queues fast
scylla-gdb.py: fix disagreement between std_deque len() and iter()
If the compaction_descriptor returned by time_window_compaction_strategy::get_sstables_for_compaction
is marked with has_only_fully_expired::yes it should always be compacted
since time_window_compaction_strategy::get_sstables_for_compaction is not idempotent.
It sets _last_expired_check and if compaction is postponed and retried before
expired_sstable_check_frequency has passed, it will not look for those fully-expired
sstables again. Plus, compacting them is the cheapest possible as it does not require
reading anything, just deleting the input sstables, so there's no reason not postpone it.
Fixes#10989
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To test both expired and non-expired sstables scenarios
we need to pass this helper function the expected number
of sstables before compaction and after compaction.
When compaction a set of fully-expired sstables,
we expect none to remain, while when the set of sstables
is not fully expired, we'll expect 1 output sstable
after compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
So we can use the lower-level build blocks to
test compaction serialization of both fully-expired
and non-fully-expired sstables scenarios in the following patches.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The command wasn't tested fully,
and when tested, started failing scylla-gdb test.
Before Raft, the list of connections to print in a single-node setup
was always empty, so a mistake in the gdb script command 'scylla netw'
didn't lead to a test failure.
With raft, there is always an RPC connection to self after initial
bootstrap, and the test begins to print connections (and fail, because
there is a bug in the printing code).
Fix that bug.
Closes#11012
Now that we use emit_only_live_rows::no everywhere we can remove this
template parameters. Only the template parameter is removed, the
internal logic around it is left in place (will be removed in a next
patch), by hard-wiring `only_live()`.
emit_only_live_rows is a convenience so downstream consumers of the
mutation compactors don't have to check the `bool is_live` already
passed to them. This convenience however causes a template parameter and
additional logic for the compactor. As the most prominent of these
consumers (the query result builder) will soon have to switch to
emit_only_live_rows::no for other reasons anyway (it will want to count
tombstones), we take the opportunity to switch everybody to ::no. This
can be done with very little additional complexity to these consumer --
basically an additional if or two.
This prepares the ground for removing this template parameter and the
associate logic from the compactor.
Soon, the currently two distinct types of queriers will be merged, as
the template parameter differentiating them will be gone. This will make
using type based overload for insert() impossible, as 2 out of the 3
types will be the same. Use different names instead.
Parameterize _walk() with a method that does the actual walking. This is
a trivial change as it was already delegating all the walking logic to
_do_walk(). The latter is renamed to _walk_forward() and we add a new
method called _walk_backward() which implements walking the continuation
chain backwards (towards tasks waited on by the queried task).
The starting task is now printed at index #0, tasks waited on by the
starting task have negative indexes, tasks waiting on the starting task
have positive indexes (like before).
With this scylla fiber can be used to dump an entire fiber (barring any
difficulties detecting following more special tasks like threads).
We soon want to teach _walk() to walk in both directions. In preparation
to that, we extract all generic preparatory code that is related to the
starting task and combining arguments. This now resides in invoke(),
_walk() should only be concerned with traversing the continuation chain.
This PR introduces improvements to `expr::to_restriction` and prepares the validation part for restriction classes removal.
`expr::to_restriction` is currently used to take a restriction from the WHERE clause, prepare it, perform some validation checks and finally convert it to an instance of the restriction class.
Soon we will get rid of the restriction class.
In preparation for that `expr::to_restriction` is split into two independent parts:
* The part that prepares and validates a binary_operator
* The part that converts a binary_operator to restriction
Thanks to this split getting rid of restriction class will be painless, we will just stop using the second part.
`to_restriction.cc` is replaced by `restrictions.hh/cc`. In the future we can put all the restriction expressions code there to avoid clutter in `expression.hh/cc`.
This change made it much easier to fix#10631, so I did that as well.
Fixes: #10631Closes#10979
* github.com:scylladb/scylla:
cql-pytest: Test that IS NOT only accepts NULL
cql-pytest: Enable testInvalidCollectionNonEQRelation
cql3: Move single element IN restrictions handling
cql3: Check for disallowed operators early
cql3: Simplify adding restrictions
cql3: Reorganize to_restriction code
cql3: Fix IS NOT NULL check in to_restriction
cql3: Swap order of arguments in error message
bytes_ostream is an incremental builder for a discontiguous byte container.
managed_bytes is a non-incremental (size must be known up front) byte
container, that is also compatible with LSA. So far, conversion between
them involves copying. This is unfortunate, since query_result is generated
as a bytes_ostream, but is later converted to managed_bytes (today, this
is done in cql3::expr::get_non_pk_values() and
compound_view_wrapper::explode(). If the two types could be made compatible,
we could use managed_bytes_view instead of creating new objects and avoid
a copy. It's also nicer to have one less vocabulary type.
This patch makes bytes_ostream use managed_bytes' internal representation
(blob_storage instead of bytes_ostream::chunk) and provides a conversion
to managed_bytes. All bytes_ostream users are left in place, but the goal
is to make bytes_ostream a write-only type with the only observer a conversion
to managed_bytes.
It turns out to be relatively simple. The internal representations were
already similar. I made blob_storage::ref_type self-initializing to
reduce churn (good practice anyway) and added a private constructor
to managed_bytes for the conversion.
Note that bytes_ostream can only be used to construct a non-LSA managed_bytes,
but LSA uses of managed_bytes are very strictly controlled (the entry
points to memtable and cache) so that's not a problem.
A unit test is added.
Closes#10986
After acquiring the _compaction_state write lock,
select all sstables using get_candidates and register them
as compacting, then unlock the _compaction_state lock
to let regular compaction run in parallel.
Also, run major compaction in maintenance scheduling group.
We should separate the scheduling groups used for major compaction
from the the regular compaction scheduling group so that
the latter can be affected by the backlog tracker in case
backlog accumulates during a long running major compaction.
Fixes#10961Closes#10984
* github.com:scylladb/scylla:
compaction_manager: major_compaction_task: run in maintenance scheduling groupt
compaction_manager: allow regular compaction to run in parallel to major
The IS_NOT operator can only be used during materialized view creation
and it can only be used to express IS NOT NULL.
Trying to write something like IS NOT 42 should cause an error.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Restrictions like
col IN (1)
get converted to
col = 1
as an optimization/simplification.
This used to be done in prepare_binary_operator,
but it fits way better inside of
validate_and_prepare_new_restriction.
When it was being done in prepare_binary_operator
the conversion happened before validation checks
and the error messages would describe an equality
restriction despite the user making an IN restriction.
Now the conversion happens after all validation
is finished, which ensures that all checks are
being done on the original expression.
Fixes: #10631
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Move checking for disallowed operators
earlier in the code flow.
This is needed to pass some tests that
expect one error message instead of the other.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The code that adds restrictions in statement_restrictions.cc
is unnecessarily convoluted.
The code to handle IS NOT NULL is actually repeated twice,
once in the constructor and once in add_is_not_restriction.
I missed this when I orignally modified this code.
There is no need to keep duplicate code, we can just
use the new add_is_not_restriction.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
expr::to_restriction is currently used to
take a restriction from the WHERE clause,
prepare it, perform some validation checks
and finally convert it to an instance of
the restriction class.
Soon we will get rid of the restriction class.
In preparation for that expr::to_restriction
is split into two independent parts:
* The part that prepares and validates a binary_operator
* The part that converts a binary_operator to restriction
Thanks to this split getting rid of restriction class
will be painless, we will just stop using the
second part.
This commit splits expr::to_restriction into two functions;
* validate_and_prepare_new_restriction
* convert_to_restriction
that handle each of those parts.
All helper validation methods in the anonymous namespace
are copied from the to_restriction.cc file.
to_restriction.cc isn't the best filename for the new functionality,
so it has been renamed to restrictions.hh/cc.
In the future all the code regarding restrictions could be
put there to reduce clutter in expression.hh/cc
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
expr::to_restriction performs a check to see if
the restriction is of form: `col IS NOT NULL`
There is a mistake in this check.
It uses is<null>(prepared_binop.rhs)
to determine if the right hand side of binary operator
is a null, but the binary operator is already prepared.
During preparation expr::null is converted to expr::constant
and that wouldn't be detected by this check.
The check has been changed to check for null constant instead
of expr::null.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The error message displays two arguments in
a specific order, but the tests actually
expect them to be swapped.
Swap the arguments to match the expected
error messages in tests.
It wasn't detected earlier because the
check was never reached, but this will change
soon in the following commits.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Causing the histogram to be made from the scheduling groups of the found
tasks. Allows for finding out which scheduling group dominates in-memory
tasks. This currently cannot be determined, scylla task-queues only
includes ready tasks.
Allowing to include only task objects in the histogram. Leads to
histograms with less noise but might exclude potentially important items
due to the filtering being inexact.
Both the content and the formatter method is caller-provided. Mistakes
are easy to come by. Instead of aborting the entire operation, just a
print an error if an item fails to format.
Said method currently uses a list() to iterate over all elements,
determining the length. Passing `self` to `list()` will however make
call `len()` first, causing infinite recursion.
When present on the command line, the histogram is created over the
content of the queues, rather than the number of items in them.
It is possible to filter in combination with --content. In particular it
can be used to see the content of a single queue when all three of
`--to`, `--from` and `--content` is present on the command line.
This PR migrates the ScyllaDB end-user documentation from the [scylla-docs](https://github.com/scylladb/scylla-docs/) repository, according to the [migration plan](https://docs.google.com/document/d/15yBf39j15hgUVvjeuGR4MCbYeArqZrO1ir-z_1Urc6A/edit?usp=sharing). All the files are added to the `docs` subfolder.
**This PR does not cover any content changes.**
How to test this PR:
1. Go to `scylla/docs`.
2. Run `make preview`. The docs should build without any warnings.
3. Open http://127.0.0.1:5500/ in your browser. You should see the documentation landing page:

Closes#10976
* github.com:scylladb/scylla:
doc: fix errors -fix the indent in the conf.py file
doc: fix the path to Alternator
doc: fix errors - add Alternator to the toctree
doc: fix errors- update the conf.py file
doc: fix errors - remove the CNAME file
doc: add the CNAME and robots files
doc: move index and README from scylla-docs repo
doc: move the documentation from the scylla-docs repo
doc: remove the old index file
Currently scylla smp-queues has O(count(vobjects)) time complexity as it
works by scanning all objects with a vptr and searching them for a
pointer to one of the smp message queues. This is very inefficient and
unnecessary. It much better to just look at the queues themselves and
sum up the number of items in them. This completes in 1-2 seconds on a
core where the old algorithm didn't complete in 2h+.
std_deque implementation was broken, with __len__() and __iter__()
disagreeing about the size of the container. Turns out both are wrong in
certain situations. Fix the iteration logic and re-base both __len__()
and __iter__() on the same node iteration code to prevent future
disagreements.
This PR gets rid of exception throws/rethrows on the replica side for writes and single-partition reads. This goal is achieved without using `boost::outcome` but rather by replacing the parts of the code which throw with appropriate seastar idioms and by introducing two helper functions:
1.`try_catch` allows to inspect the type and value behind an `std::exception_ptr`. When libstdc++ is used, this function does not need to throw the exception and avoids the very costly unwind process. This based on the "How to catch an exception_ptr without even try-ing" proposal mentioned in https://github.com/scylladb/scylla/issues/10260.
This function allows to replace the current `try..catch` chains which inspect the exception type and account it in the metrics.
Example:
```c++
// Before
try {
std::rethrow_exception(eptr);
} catch (std::runtime_exception& ex) {
// 1
} catch (...) {
// 2
}
// After
if (auto* ex = try_catch<std::runtime_exception>(eptr)) {
// 1
} else {
// 2
}
```
2. `make_nested_exception_ptr` which is meant to be a replacement for `std::throw_with_nested`. Unlike the original function, it does not require an exception being currently thrown and does not throw itself - instead, it takes the nested exception as an `std::exception_ptr` and produces another `std::exception_ptr` itself.
Apart from the above, seastar idioms such as `make_exception_future`, `co_await as_future`, `co_return coroutine::exception()` are used to propagate exceptions without throwing. This brings the number of exception throws to zero for single partition reads and writes (tested with scylla-bench, --mode=read and --mode=write).
Results from `perf_simple_query`:
```
Before (719724e4df):
Writes:
Normal:
127841.40 tps ( 56.2 allocs/op, 13.2 tasks/op, 50042 insns/op, 0 errors)
Timeouts:
94770.81 tps ( 53.1 allocs/op, 5.1 tasks/op, 78678 insns/op, 1000000 errors)
Reads:
Normal:
138902.31 tps ( 65.1 allocs/op, 12.1 tasks/op, 43106 insns/op, 0 errors)
Timeouts:
62447.01 tps ( 49.7 allocs/op, 12.1 tasks/op, 135984 insns/op, 936846 errors)
After (d8ac4c02bfb7786dc9ed30d2db3b99df09bf448f):
Writes:
Normal:
127359.12 tps ( 56.2 allocs/op, 13.2 tasks/op, 49782 insns/op, 0 errors)
Timeouts:
163068.38 tps ( 52.1 allocs/op, 5.1 tasks/op, 40615 insns/op, 1000000 errors)
Reads:
Normal:
151221.15 tps ( 65.1 allocs/op, 12.1 tasks/op, 43028 insns/op, 0 errors)
Timeouts:
192094.11 tps ( 41.2 allocs/op, 12.1 tasks/op, 33403 insns/op, 960604 errors)
```
Closes#10368
* github.com:scylladb/scylla:
database: avoid rethrows when handling exceptions from commitlog
database: convert throw_commitlog_add_error to use make_nested_exception_ptr
utils: add make_nested_exception_ptr
storage_proxy: don't rethrow when inspecting replica exceptions on write path
database: don't rethrow rate_limit_exception
storage_proxy: don't rethrow the exception in abstract_read_resolver::error
utils/exceptions.cc: don't rethrow in is_timeout_exception
utils/exceptions: add try_catch
utils: add abi/eh_ia64.hh
storage_proxy: don't rethrow exceptions from replicas when accounting read stats
message: get rid of throws in send_message{,_timeout,_abortable}
database/{query,query_mutations}: don't rethrow read semaphore exceptions
The default refresh period for permissions in both Scylla and Cassandra
is 2 seconds, which is usually perfectly fine for production
environments, but it introduces a significant delay in automatic
test cases. The refresh period is hereby set to 100ms, which allows
test_permissions.py cases to run in around 1s for Scylla instead of
tens of seconds.
Recently we noticed a regression where with certain versions of the fmt
library,
SELECT value FROM system.config WHERE name = 'experimental_features'
returns string numbers, like "5", instead of feature names like "raft".
It turns out that the fmt library keep changing their overload resolution
order when there are several ways to print something. For enum_option<T> we
happen to have to conflicting ways to print it:
1. We have an explicit operator<<.
2. We have an *implicit* convertor to the type held by T.
We were hoping that the operator<< always wins. But in fmt 8.1, there is
special logic that if the type is convertable to an int, this is used
before operator<<()! For experimental_features_t, the type held in it was
an old-style enum, so it is indeed convertible to int.
The solution I used in this patch is to replace the old-style enum
in experimental_features_t by the newer and more recommended "enum class",
which does not have an implicit conversion to int.
I could have fixed it in other ways, but it wouldn't have been much
prettier. For example, dropping the implicit convertor would require
us to change a bunch of switch() statements over enum_option (and
not just experimental_features_t, but other types of enum_option).
Going forward, all uses of enum_option should use "enum class", not
"enum". tri_mode_restriction_t was already using an enum class, and
now so does experimental_features_t. I changed the examples in the
comments to also use "enum class" instead of enum.
This patch also adds to the existing experimental_features test a
check that the feature names are words that are not numbers.
Fixes#11003.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11004
Fix two cql-pytest that have been "XPASS"ing (unexpectedly passing)
by removing the "xfail" (expecting failure) mark from them:
One test was for an issue that has already been fixed (refs #10081).
The second test was a translated Cassandra test that should never
have failed because it doesn't trigger the issue that supposedly failed
it (that test sets a large value for a non-indexed column, so doesn't
trigger the problem we have with large values in an indexed column).
Closes#11006
When running test/cql-pytest, pytest prints one warning at the end:
/home/nyh/scylla/test/cql-pytest/test_secondary_index.py:82: DeprecationWarning: ResultSet indexing support will be removed in 4.0.
Consider using ResultSet.one() to get a single row.
assert any([index_name in event.description for event in cql.execute(query, trace=True).get_query_trace().events])
So in this patch I do exactly what the warning recommends - use one().
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11002
Python has deprecated the distutils package. In several places in the
Alternator and Redis test suites, we used distutils.version to check if
the library is new enough for running the test (and skip the test if
it's too old). On new versions of Python, we started getting deprecation
warnings such as:
DeprecationWarning: The distutils package is deprecated and slated for
removal in Python 3.12. Use setuptools or check PEP 632 for potential
alternatives
PEP 632 recommends using package.version instead of distutils.version,
and indeed it works well. After applying this patch, Alternator and
Redis test runs no long end in silly deprecation warnings.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11007
This new test suite is expected to gather all kinds of permissions
tests - granting, revoking, authorizing, and so on.
Right now it contains a single minimal test which ensures that
the default superuser can be granted applicable permissions,
which they already have anyway.
The test suite added in this pull request will also be useful
when developing #10633 - permissions for UDF/UDA infrastructure.
Closes#10991
* github.com:scylladb/scylla:
cql-pytest: add initial permissions test suite
cql-pytest: enable CassandraAuthorizer for Scylla and Cassandra
There was a bug which caused incorrect results of limits()
for columns with reversed clustering order.
Such columns have reversed_type as their type and this
needs to be taken into account when comparing them.
It was introduced in 6d943e6cd0.
This commit replaced uses of get_value_comparator
with type_of. The difference between them is that
get_value_comparator applied ->without_reversed()
on the result type.
Because the type was reversed, comparisons like
1 < 2 evaluated to false.
This caused the test testIndexOnKeyWithReverseClustering
to fail, but sadly it wasn't caught by CI because
the CI itself has a bug that makes it skip some tests.
The test passes now, although it has to be run manually
to check that.
Fixes: #10918
Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
Closes#10994
Scylla's coding standard requires that each header is self-sufficient,
i.e., it includes whatever other headers it needs - so it can be included
without having to include any other header before it.
We have a test for this, "ninja dev-headers", but it isn't run very
frequently, and it turns out our code deviated from this requirement
in a few places. This patch fixes those places, and after it
"ninja dev-headers" succeeds again.
Fixes#10995
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10997
Currently, applying schema mutations involves flushing all schema
tables so that on restart commit log replay is performed on top of
latest schema (for correctness). The downside is that schema merge is
very sensitive to fdatasync latency. Flushing a single memtable
involves many syncs, and we flush several of them. It was observed to
take as long as 30 seconds on GCE disks under some conditions.
This patch changes the schema merge to rely on a separate commit log
to replay the mutations on restart. This way it doesn't have to wait
for memtables to be flushed. It has to wait for the commitlog to be
synced, but this cost is well amortized.
We put the mutations into a separate commit log so that schema can be
recovered before replaying user mutations. This is necessary because
regular writes have a dependency on schema version, and replaying on
top of latest schema satisfies all dependencies. Without this, we
could get loss of writes if we replay a write which depends on the
latest schema on top of old schema.
Also, if we have a separate commit log for schema we can delay schema
parsing for after the replay and avoid complexity of recognizing
schema transactions in the log and invoking the schema merge logic.
I reproduced bad behavior locally on my machine with a tired (high latency)
SSD disk, load driver remote. Under high load, I saw table alter (server-side part) taking
up to 10 seconds before. After the patch, it takes up to 200 ms (50:1 improvement).
Without load, it is 300ms vs 50ms.
Fixes#8272Fixes#8309Fixes#1459Closes#10333
* github.com:scylladb/scylla:
config: Introduce force_schema_commit_log option
config: Introduce unsafe_ignore_truncation_record
db: Avoid memtable flush latency on schema merge
db: Allow splitting initiatlization of system tables
db: Flush system.scylla_local on change
migration_manager: Do not drop system.IndexInfo on keyspace drop
Introduce SCHEMA_COMMITLOG cluster feature
frozen_mutation: Introduce freeze/unfreeze helpers for vectors of mutations
db/commitlog: Improve error messages in case of unknown column mapping
db/commitlog: Fix error format string to print the version
db: Introduce multi-table atomic apply()
Convert most use sites from `co_return coroutine::make_exception`
to `co_await coroutine::return_exception{,_ptr}` where possible.
In cases this is done in a catch clause, convert to
`co_return coroutine::exception`, generating an exception_ptr
if needed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10972
This new test suite is expected to gather all kinds of permissions
tests - granting, revoking, authorizing, and so on.
Right now it contains a single minimal test which ensures that
the default superuser can be granted applicable permissions,
which they already have anyway.
In order to be able to test permissions, an authorizer different
than AllowAllAuthorizer (default) must be set.
CassandraAuthorizer is thus enabled - it works on default user/password
pair, so it doesn't introduce any regressions to the test suite.
"
The option controlls the IO bandwidth of the compaction sched class.
It's not set to be 16MB/s, but is unused. This set makes it 0 by
default (which means unlimited), live-updateable and plugs it to the
seastar sched group IO throttling.
branch: https://github.com/xemul/scylla/tree/br-compaction-throttling-3
tests: unit(dev),
v2: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1010/ ,
v2: manual config update
"
* 'br-compaction-throttling-3-a' of https://github.com/xemul/scylla:
compaction_manager: Add compaction throughput limit
updateable_value: Support dummy observing
serialized_action: Allow being observer for updateable_value
config: Tune the config option
The node now refuses to boot if schema tables were truncated.
This adds a config option to ignore truncation records as a
workaround if user truncated them manually.
Currently, applying schema mutations involves flushing all schema
tables so that on restart commit log replay is performed on top of
latest schema (for correctness). The downside is that schema merge is
very sensitive to fdatasync latency. Flushing a single memtable
involves many syncs, and we flush several of them. It was observed to
take as long as 30 seconds on GCE disks under some conditions.
This patch changes the schema merge to rely on a separate commit log
to replay the mutations on restart. This way it doesn't have to wait
for memtables to be flushed. It has to wait for the commitlog to be
synced, but this cost is well amortized.
We put the mutations into a separate commit log so that schema can be
recovered before replaying user mutations. This is necessary because
regular writes have a dependency on schema version, and replaying on
top of latest schema satisfies all dependencies. Without this, we
could get loss of writes if we replay a write which depends on the
latest schema on top of old schema.
Also, if we have a separate commit log for schema we can delay schema
parsing for after the replay and avoid complexity of recognizing
schema transactions in the log and invoking the schema merge logic.
One complication with this change is that replay_position markers are
commitlog-domain specific and cannot cross domains. They are recorded
in various places which survive node restart: sstables are annotated
with the maximum replay position, and they are present inside
truncation records. The former annotation is used by "truncate"
operation to drop sstables. To prevent old replay positions from being
interpreted in the context in the new schema commitlog domain, the
change refuses to boot if there are truncation records, and also
prohibits truncation of schema tables.
The boot sequence needs to know whether the cluster feature associated
with this change was enabled on all nodes. Fetaures are stored in
system.scylla_local. Because we need to read it before initializing
schema tables, the initialization of tables now has to be split into
two phases. The first phase initializes all system tables except
schema tables, and later we initialize schema tables, after reading
stored cluster features.
The commitlog domain is switched only when all nodes are upgraded, and
only after new node is restarted. This is so that we don't have to add
risky code to deal with hot-switching of the commitlog domain. Cold
switching is safer. This means that after upgrade there is a need for
yet another rolling restart round.
Fixes#8272Fixes#8309Fixes#1459
It's not needed anymore because system.IndexInfo is a virtual table
calculated from view info.
The drop accesses a table which is outside system_schema keyspace
so crosses commit log domain. This will trigger an internal from
database::apply() on schema merge once the code switches to use
the schema commit log and require that all mutations which are
part of the schema change belong to a single commit log domain.
We could theoretically move system.IndexInfo to the schema commit log
domain. It's not easy though because table initialization at boot
needs to be split, and current functions for initailization work
at keyspace granularity, not table granularity.
We should separate the scheduling groups used for major compaction
from the the regular compaction scheduling group so that
the latter can be affected by the backlog tracker in case
backlog accumulates during a long running major compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
After acquiring the _compaction_state write lock,
select all sstables using get_candidates and register them
as compacting, then unlock the _compaction_state lock
to let regular compaction run in parallel.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Leader which ceases to be a leader as a result of a
execute_modify_config cannot wait for a dummy record to be
committed because io_fiber aborts current waiters as soon as it
detects a lost of leadership.
This commit excludes dummy entries from the configuration change
procedure. A special promise is set on io_fiber when it gets a
non-joint configuration, and set_configuration just waits for
the corresponding future instead of a dummy record.
Fixes: #10010Closes#10905
This reverts commit aa8f135f64, reversing
changes made to 9a88bc260c. The patch
causes hangs during flush.
Also reverts parts of 411231da75 that impacted the unit test.
Fixes#10897.
In a large cluster, a node would receive frequent and periodic gossip
application state updates like CACHE_HITRATES or VIEW_BACKLOG from peer
nodes. Those states are not critical. They should not be counted for the
_msg_processing counter which is used to decide if gossip is settled.
This patch fixes the long settle on every restart issue reported by
users.
Refs #10337Closes#10892
Re-use eisting compaction_throughput_mb_per_sec option, push it down to
compaction manager via config and update the nderlying compaction sched
class when the option is (live)updated.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
An updateable_value() may come without source attached. One of the
options how this can happen is if the value sits on a service config.
It's a good option to make the config have some default initialization
for the option, but in this case observe()ing an option by the service
would step on null pointer dereference.
Said that, if a value without source is tried to be observed -- assume
that it's OK, but the value would never change, so a dummy observer is
to be provided.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Live-updating an option may involve running some action when the option
changes, not just getting its new value into somewhere. The action is
nice to be run as serialized action to batch config updates.
Said that, here's a sugar to write
serialized_action _foo = [this] { return foo(); };
observer<> _o = option.observe(_foo.make_observer());
instead of
serialized_action _foo = [this] { return foo(); };
observer<> _o = option.observe([this] {
// waited with .join on stop
(void)_foo.trigger();
});
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Native types were parsed directly to data_type, where varchar and text were
parsed to utf8_type. To get the name of the type there was a call to
the data_type method thus getting the name of the varchar type returns "text".
To fix this, added new nonterminal type_unreserved_keyword, which parse native
types to their names. It replaced native_or_internal_type in unreserved_function_keyword.
unreserved_function_keyword is also used to parse usernames, keyspace names, index names,
column identifieres, service levels and role names, so this bug was repaired also in them.
Fixes: #10642Closes#10960
Now, throw_commitlog_add_error is renamed to throw_commitlog_add_error.
Instead of wrapping the currently executing exception and rethrowing it,
it takes an std::exception_ptr, wraps it and also returns
std::exception_ptr.
The utils::make_nested_exception_ptr function works similar to
std::throw_with_nested, but instead of storing the currently thrown
exception as the nested exception and then immediately throwing the new
exception, it receives the nested exception as an std::exception_ptr and
also returns an std::exception_ptr.
If the standard library supports it, the function does not perform any
throws. Otherwise the fallback logic performs two throws.
Introduces a utility function which allows obtaining a pointer to the
exception data held behind an std::exception_ptr if the data matches the
requested type. It can be used to implement manual but concise
try..catch chains.
The `try_catch` has the best performance when used with libstdc++ as it
uses the stdlib specific functions for simulating a try..catch without
having to actually throw. For other stdlibs, the implementation falls
back to a throw surrounded by an actual try..catch.
Recently a change to Scylla's expression implementation changed the standard
error message copied from Cassandra:
Cannot execute this query as it might involve data filtering and thus
may have unpredictable performance. If you want to execute this query
despite the performance unpredictability, use ALLOW FILTERING
In the special case where the filter is on the partition key, we changed
the message to:
Only EQ and IN relation are supported on the partition key (unless you
use the token() function or allow filtering)
We had a cql-pytest test translated from Cassandra's unit test that checked
the old message, and started to fail. Unfortunately nobody noticed because
a bug in test.py caused it to stop running these translated unit tests.
So in this patch, we trivially fix the test to pass again. Instead of
insisting on the old message, we check jsut for the string "allow
filtering", in lowercase or uppercase. After this patch, the tests
passes as expected on both Scylla and Cassandra.
Refs #10918 (this test failing is one of the failures reported there)
Refs #10962 (test.py stopped running this test)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10964
Current debug log is bit difficult to collect in CI, to find the debug log
we must know which script caused Exception.
Because the filename does not include prefix, and also specified
directory is shared with other programs.
To make things more easily, let's change debug log directory to /var/tmp/scylla.
Closes#10730
This mini-series adds an _async_gate to storage_service that is closed on stop()
and it performs restore_replica_count under this gate so it can be orderly waited on in stop()
Fixes#10672Closes#10922
* github.com:scylladb/scylla:
storage_service: handle_state_removing: restore_replica_count under _async_gate
storage_service: add async_gate for background work
A recent change added `--security-opt label:disable` to the docker
options. There are examples of this syntax on the web, but podman
and docker manuals don't mention it and it doesn't work on my machine.
Fix it into `--security-opt label=disable`, as described by the manuals.
Closes#10965
Adds a header for utility functions/structures, based on the Itanium ABI
for C++, necessary for us to inspect exceptions behind
std::exception_ptr without having to actually rethrow the exception.
Adds measuring the apparent delta vector of footprint added/removed within
the timer time slice, and potentially include this (if influx is greater
than data removed) in threshold calculation. The idea is to anticipate
crossing usage threshold within a time slice, so request a flush slightly
earlier, hoping this will give all involved more time to do their disk
work.
Obviously, this is very akin to just adjusting the threshold downwards,
but the slight difference is that we take actual transaction rate vs.
segment free rate into account, not just static footprint.
Note: this is a very simplistic version of this anticipation scheme,
we just use the "raw" delta for the timer slice.
A more sophisiticated approach would perhaps do either a lowpass
filtered rate (adjust over longer time), or a regression or whatnot.
But again, the default persiod of 10s is something of an eternity,
so maybe that is superfluous...
Closes#10651
* github.com:scylladb/scylla:
commitlog: Add (internal) measurement of byte rates add/release/flush-req
commitlog: Add counters for # bytes released/flush requested
commitlog: Keep track of last flush high position to avoid double request
commitlog: Fix counter descriptor language
"
On stop there's a rather long log-less gap in the middle of
storage_service::drain_on_shutdown(). This set adds log in
interesting places and while at it tosses the patched code.
refs: #10941
"
* 'br-shutdown-logging' of https://github.com/xemul/scylla:
batchlog_manager: Add drain and stop logging
batchlog_manager: Coroutinize drain and stop
batchlog_manager: Drain it with shared future
commitlog: Add shutdown message
database: Move flushing logging
compaction_manager: Add logging around drain
compaction_manager: Coroutinize drain
storage_service: Sanitize stop_transport()
Google group started replacing sender email with the group email
recently. Here's the list of spoiled entries combined from seastar
and scylla repos
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220701160252.11967-1-xemul@scylladb.com>
This is not identical change, if drain() resolves with exception we end
up skipping the gate closing, but since it's stop why bother
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The .drain() method can be called from several places, each needs to
wait for its completion. Now this is achieved with the help of a gate,
but there's a simpler way
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It happens in database::drain(), we know when it starts after keyspaces
are flushed, now it's good to know when it completes
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now it happens before calling database::drain() but drain is not only
flushing it does lots of other things. More elaborated logging is better
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
with several Java versions
The test/cql-pytest/run-cassandra script runs our cql-pytest tests against
Cassandra. Today, Cassandra can only run correctly on Java 8 or 11
(see https://issues.apache.org/jira/browse/CASSANDRA-16895) but recent
Linux distributions have switched to newer versions of Java - e.g., on
my Fedora 36 installation, the default "java" is Java 17. Which can't
run Cassandra.
So what I do in this patch is to check if "java" has the right version,
and if it doesn't, it looks at several additional locations if it can
find a Java of the right version. By the way, we are sure that Java 8
must be installed because our install-dependencies.sh installs it.
After this patch, test/cql-pytest/run-cassandra resumes working on
Fedora 36.
Fixes#10946
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10947
Method reponsible for creating a token of given values is not meant to be
used with empty optionals. Thus, having requested a token of the columns
containing null values resulted with an exception being thrown. This kind
of behaviour was not compatible with the one applied in cassandra.
To fix this, before the computation of a token, it is checked whether
no null value is contained. If any value in the processed vector is null,
null value is returned.
Fixes: #10594Closes#10942
If a single-patch pull request fails cherry-picking, it's still possible
to recover it (if it's a simple conflict). Give the maintainer the option
by opening a subshell and instructing them to either complete the cherry-pick
or abort it.
Closes#10949
* seastar 9c016aeebf...7d8d846b26 (16):
> Merge 'coroutine: exception: retain exception_ptr type' from Benny Halevy
> core: log in on_internal_error even when throwing
> sched_group: Report the sched group that exceeded the limit
Fixes#8226.
> Add .mailmap
> prometheus: make the help string optional
> core: lw_shared_ptr: allow defining `lw_shared_ptr<T>` class member without knowing the definition of `T`
> ci: build and test in debug and dev modes
> Merge 'Added summaries, remove empty, and aggregation to Prometheus' from Amnon Heiman
> Merge 'net/tls: vec_push: call on_internal_error if _output_pending already failed' from Benny Halevy
Fixes#10127
> Merge 'CI: build and test with both gcc and clang ' from Beni Peled
> Merge "Initialize lowres_clock::_now earlier" from Pavel E
Ref #10743
> reactor: don't count make_exception_future etc. in cpp_exceptions metric
> file: Deprecate file lifetime hint calls
> foreign_ptr: fix doc.
> cmake: fix mention of FindLibUring.cmake in install target
> semaphore: derive named_semaphore_aborted exception from semaphore_aborted
Fixes#10666.
Closes#10951
In order to allow our Scylla OSS customers the ability to select a version for their documentation, we are migrating the Scylla docs content to the Scylla OSS repository. This PR covers the following points of the [Migration Plan](https://docs.google.com/document/d/15yBf39j15hgUVvjeuGR4MCbYeArqZrO1ir-z_1Urc6A/edit#):
1. Creates a subdirectory for dev docs: /docs/dev
2. Moves the existing dev doc content in the scylla repo to /docs/dev, but keep Alternator docs in /docs.
3. Flattens the structure in /docs/dev (remove the subfolders).
4. Adds redirects from `scylla.docs.scylladb.com/<version>/<document>` to `https://github.com/scylladb/scylla/blob/master/docs/dev/<document>.md`
5. Excludes publishing docs for /docs/devs.
1. Enter the docs folder with `cd docs`.
2. Run `make redirects`.
3. Enter the docs folder and run `make preview`. The docs should build without warnings.
4. Open http://127.0.0.1:5500 in your browser. You shoul donly see the alternator docs.
5. Open http://127.0.0.1:5500/stable/design-notes/IDL.html in your browser. It should redirect you to https://github.com/scylladb/scylla/blob/master/docs/dev/IDL.md and raise a 404 error since this PR is not merged yet.
6. Surf the `docs/dev` folder. It should have all the scylla project internal docs without subdirectories.
Closes#10873
* github.com:scylladb/scylla:
Update docs/conf.py
Update docs/dev/protocols.md
Update docs/dev/README.md
Update docs/dev/README.md
Update docs/conf.py
Fix broken links
Remove source folder
Add redirections
Move dev docs to docs/dev
After compiling to WASM, UDFs become much larger than the
source code. When they're included in test_wasm.py, it
becomes difficult to navigate in the file. Moving them
to another place does not make understanding the test
scripts harder, because the source code is still included.
This problem will become even more severe when testing
UDFs using WASI.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Closes#10934
By default, Docker uses SELinux to prevent malicious code in the container
from "escaping" and touching files outside the container: The container
is only allowed to touch files with a special SELinux label, which the
outside files simply do not have. However, this means that if you want
to "mount" outside files into the container, Docker needs to add the
special label to them. This is why one needs to use the ":z" option
when mounting an outside file inside docker - it asks docker to "relabel"
the directory to be usable in Docker.
But this relabeling process is slow and potentially harmful if done to
large directories such as your home directory, where you may theoretically
have SELinux labels for other reasons. The relabling is also unnecessary -
we don't really need the SELinux protection in dbuild. Dbuild was meant
to provide a common toolchain - it was never meant to protect the build
host from a malicious build script.
The alternative we use in this patch is "--security-opt label=disable".
This allows the container to access any file in the host filesystem,
but as usual - only if it's explicitly "mounted" into the container.
All ":z" we added in the past can be removed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10945
This PR removes some restrictions classes and replaces them with expression.
* `single_column_restriction` has been removed altogether.
* `partition_key_restrictions` field inside `statement_restrictions` has been replaced with `expression`
`clustering_key_restrictions` are not replaced yet, but this PR already has 30 commits so it's probably better to merge this before adding any more changes.
Luckily most of these commits are implementations of small helper functions.
`single_column_restriction` was pretty easy to remove. This class holds the `expression` that describes the restriction and `column_definition` of the restricted column.
It inherits from `restriction` - the base class of all restrictions.
I wasn't able to replace it with plain `expression` just yet, because a lot of times a `shared_ptr<single_column_restriction>` is being cast to `shared_ptr<restriction>`.
Instead I replaced all instances of `single_column_restriction` with `restriction`.
To decide if a `restriction` is a `single_column_restriction` we can use a helper method that works on expressions.
Same with acquiring the restricted `column_definition`.
This change has two advantages:
* One less restriction class -> moving towards 0
* Preparing towards one generic `restriction/expression` type and using functions to distinguish the type of expression that we're dealing with.
`partition_key_restrictions` is a class used to keep restrictions on the partition key inside `statement_restrictions`.
Removing it required two major steps.
First I had to implement taking all the binary operators and making sure that they are valid together.
Before the change this was the `merge_to` method. It ensures that for example there are no token and regular restrictions occurring at the same time.
This has been implemented as `statement_restrictions::add_restriction`.
It detects which case it's dealing with and mimics `merge_to` from the right restrictions class.
Then I implemented all methods of `partition_key_restrictions` but operating on plain `expressions`.
While doing that I was able to gradually shift the responsibility to the brand new functions.
Finally `partition_key_restrictions` wasn't used anywhere at all and I was able to remove it.
Here's the inheritance tree of all restriction classes for context:

For now this is marked as a draft.
I just put all this together in a readable way and wanted to put it out for you to see.
I will have another look at the code and maybe do some improvements.
Closes#10910
* github.com:scylladb/scylla:
cql3: Remove _new from _new_partition_key_restrictions
cql3: Remove _partition_key_restrictions from statement_restrictions
cql3: Use expression for index restrictions
cql3: expr: Add contains_multi_column_restriction
cql3: Add expr::value_for
cql3: Use the new restrictions map in another place
cql3: use the new map in get_single_column_partition_key_restrictions
cql3: Keep single column restrictions map inside statement restrictions
cql3: Use expression instead of _partition_key_restrictions in the remaining code
cql3: Replace partition_key_restrictions->has_supporting_index()
cql3: Replace statement_restrictions->get_column_defs()
cql3: Replace partition_key_restrictions->needs_filtering()
cql3: Replace partition_key_restrictions->size()
cql3: Replace partition_key_restrictions->is_all_eq()
cql3: Replace parition_key_restriction->has_unrestricted_components()
cql3: Replace parition_key_restrictions->empty()
cql3: Keep restrictions as expressions inside statement_restrictions
cql3: Handle single value INs inside prepare_binary_operator
cql3: Add get_columns_in_commons
cql3: expr: Add is_empty_restriction
cql3: Replicate column sorting functionality using expressions
cql3: Remove single_column_restriction class
cql3: Replace uses of single_column_restriction with restriction
cql3: expr: Add get_the_only_column
cql3: expr: Add is_single_column_restriction
cql3: expr: Add for_each_expression
cql3: Remove some unsued methods
_new_partition_key_restrictions was a temporary name
used during the transition from restrictions to expressions.
Now that restrictions aren't used anymore it can be changed
back to _partition_key_restrictions.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Now that all functionality of partition_key_restrictions
has been implemented using expressions we can remove
this field from statement_restrictions.
_new_partition_key_restrictions will be used for
everything instead.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Restrictions that might be used by an index
are currently being kept as shared_ptr<restrictions>.
This stand in the way of replacing _parition_key_restrictions
with an expression as an expression can't be cast to
shared_ptr<restriction>.
Change shared_ptr<restriction> to expression everywhere
where necessary in index operations.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
value_for is a method from the restriction class
which finds the value for a given column.
Under the hood it makes use of possible_lhs_values.
It will be needed to implement some functionality
that was implemented using restrictions before.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Some parts of the code make use of a map keeping single column restrictions
for each partition key column. One of this places is inside do_filter,
so it could be a performance problem to create such a map from scratch
each time.
After adding all restrictions from the where clause the new
map is created and can be used for various purposes.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
There are still some places that use partition_key_restrictions
instead of _new_partition_key_restrictions in statement_restrictions.
Change them to use the new representation
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
To remove partition_key_restrictions all of its
methods have to be implemented using the new expression
representation.
The first to go is empty() as it's easy to implement.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Currently restrictions on partition, clustering and nonprimary columns
are kept inside special purpose restriction objects.
We want to remove all the restrictions classes so these objects
will be removed as well.
In the future each of these restrictions will be kept in
an expression.
Add new fields to statement_restrictions class which
will keep the right restrictions.
Currently restrictions from where clause are
added one by one using merge_to method of
the restrictions class.
This functionality will be replaced by statement_restrictions::add_restriction.
Functions for adding restrictions perform validation and
add new restrictions to the right field inside the class.
The checks that are done in add_*_restriction methods
correspond to the checks performed by merge_to
in respective restriction classes.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Currently expr::to_restriction is the only place where
prepare_binary_operator is called.
In case of a single-value IN restriction like:
mycol IN (1)
this expression is converted to
mycol = 1
by expr::to_restriction.
Once restriction is removed expr::to_restriction
will be removed as well so its functionality has to
be moved somewhere else.
Move handling single value INs inside
prepare_binary_operator.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add a function that finds common columns
between two expressions.
It's used in error messages in the original
restrictions code so it must be included
in the new code as well for compatibility.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Restrictions code keeps restrictions for each column
in a map sorted by their position in the schema.
Then there are methods that allow to access
the restricted column in the correct order.
To replicate this in upcoming code
we need functions that implement this functionality.
The original comparator can be found in:
cql3/restrictions/single_column_restrictions.hh
For primary key columns this comparator compares their
positions in the schema.
For non-primary columns the position is assumed to
be clustering_key_size(), which seems pretty random.
To avoid passing the schema to the comparator
for nonprimary columns I just assume the
position is u32::max(). This seems to be
as good of a choice as clustering_key_size().
Orignally Cassandra used -1:
bc8a260471/src/java/org/apache/cassandra/config/ColumnDefinition.java (L79-L86)
We never end up comparing columns of different kind using this comparator anyway.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Now that all uses of this class have been
replaced by the generic restriction
the class is not used anywhere and can be removed.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
single_column_restriction is a class used to represent
restrictions in a single column.
The class is very simple - it's basically an expression
with some additional information.
As a step towards removing all restriction classes
all uses of this class are replaced by uses of
the generic restriction class.
All functionality of this class has been implemented
using free standing functions operating on expressions.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add a function that gets the only column
from a single column restriction expression.
The code would be very similiar to
is_single_column_restriction, so a new
function is introducted to reduce duplication.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add a function that checks whether an expression
contains restrictions on exactly one column.
This a "single_column_restriction"
in the same way that instances of
"class single_column_restriction" are.
It will be used later to distinguish cases
later once this class is removed
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
To call a UDF that is using WASI, we need to properly
configure the wasmtime instance that it will be called
on. The configuration was missing from udf_cache::load(),
so we add it here.
The free function does not return any value, so we should use
a calling method that does not expect any returns.
This patch adds such a method and uses it.
A test that did not pass without this fix and does pass after
is added.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Closes#10935
cookie only when reading CQL tables' from Botond Dénes
Recently, we added full position-in-partition support to alternator's
paging cookie, so it can support stopping at arbitrary positions. This
support however is only really needed when tables have range tombstones
and alternator tables never have them. So to avoid having to make the
new fields in 'ExclusiveStartKey' reserved, we avoid filling these in
when reading an alternator table, as in this case it is safe to assume
the position is `after_key($clustring_key)`. We do include these new
members however when reading CQL tables through alternator. As this is
only supported for system tables, we can also be sure that the elaborate
names we used for these fields are enough to avoid naming clashes.
Fixes: https://github.com/scylladb/scylla/issues/10903Closes#10920
* github.com:scylladb/scylla:
alternator: use position-in-partition in paging cookie only when reading CQL tables
alternator: make is_alternator_keyspace() a standalone method
test/scylla-gdb tests Scylla's gdb debugging tools, and cannot work if
Scylla was compiled without debug information (i.e, the "dev" build mode).
In the past, test/scylla-gdb/run detected this case and printed a clear error:
Scylla executable was compiled without debugging information (-g)
so cannot be used to test gdb. Please set SCYLLA environment variable.
Unfortunately, since recently this detection fails, because even when
Scylla is compiled without debug information we link into it a library
(libwasmtime.a) which has *some* debug information. As a result, instead
of one clear error message, we get all scylla-gdb tests running -
and each of them failing separately. This is ugly and unhelpful.
Each of the tests fail because our "gdb" test fixture tries to load
scylla-gdb.py and fails when the symbols it needs (e.g., "size_t")
cannot be found. So in this patch, we check once for the existance
of this symbol - and if missing we exit pytest instead of failing each
individual test.
Moreover, if loading scylla-gdb.py fails for some other unexpected
reason, let's exit the test as well, instead of failing each individual
test.
Fixes#10863.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10937
Closes#10930
* github.com:scylladb/scylla:
test: perf_row_cache_update: Flush std output after each line
test: perf_row_cache_update: Drain background cleaner before starting the test
test: perf_row_cache_update: Measure memtable filling time
test: perf_row_cache_update: Respect preemption when applying mutations
test: perf_row_cache_update: Drop unused pk variable
Recently, we added full position-in-partition support to alternator's
paging cookie, so it can support stopping at arbitrary positions. This
support however is only really needed when tables have range tombstones
and alternator tables never have them. So to avoid having to make the
new fields in 'ExclusiveStartKey' reserved, we avoid filling these in
when reading an alternator table, as in this case it is safe to assume
the position is `after_key($clustring_key)`. We do include these new
members however when reading CQL tables through alternator. As this is
only supported for system tables, we can also be sure that the elaborate
names we used for these fields are enough to avoid naming clashes.
The condition in the code implementing this is actually even more
general: it only includes the region/weight members when the position
differs from that of a normal alternator one.
for_each_expression is a function that
can be used to iterate over all expressions
inside an expression recursively and perform
some operation on each of them.
For example:
for_each_expression<column_vaue>(e, [](const column_value& cval) {std::cout << cval << '\n';});
Will print all column values in an expression
It's awkward to do this using recurse_until or find_in_expression
because these functions are meant for slightly different purposes.
Having a dedicated function for this purpose will make the code
cleaner and easier to understand.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The option is used, but is not implemented. If attaching implementation
to it right a once the compaction will slow down to 16MB/s on all nodes.
Make it zero (unbound) by default and mard live-updateable while at it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Before this patch, the test cql-pytest/test_tools.py left behind
a temporary file in /tmp. It used pytest's "tmp_path_factory" feature,
but it doesn't remove temporary files it creates.
This patch removes the temporary file when the fixture using it ends,
but moreover, it puts the temporary file not in /tmp but rather next
to Scylla's data directory. That directory will be eventually removed
entirely, so even if we accidentally leave a file there, it will
eventually be deleted.
Fixes#10924
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10929
There is a bug introduced in e74c3c8 (4.6.0) which makes memtable
reader skip one a range tombstone for a certain pattern of deletions
and under certain sequence of events.
_rt_stream contains the result of deoverlapping range tombstones which
had the same position, which were sipped from all the versions. The
result of deoverlapping may produce a range tombstone which starts
later, at the same position as a more recent tombstone which has not
been sipped from the partition version yet. If we consume the old
range tombstone from _rt_stream and then refresh the iterators, the
refresh will skip over the newer tombstone.
The fix is to drop the logic which drains _rt_stream so that
_rt_stream is always merged with partition versions.
For the problem to trigger, there have to be multiple MVCC versions
(at least 2) which contain deletions of the following form:
[a, c] @ t0
[a, b) @ t1, [b, d] @ t2
c > b
The proper sequence for such versions is (assuming d > c):
[a, b) @ t1,
[b, d] @ t2
Due to the bug, the reader will produce:
[a, b) @ t1,
[b, c] @ t0
The reader also needs to be preempted right before processing [b, d] @
t2 and iterators need to get invalidated so that
lsa_partition_reader::do_refresh_state() is called and it skips over
[b, d] @ t2. Otherwise, the reader will emit [b, d] @ t2 later. If it
does emit the proper range tombstone, it's possible that it will violate
fragment order in the stream if _rt_stream accumulated remainders
(possible with 3 MVCC versions).
The problem goes away once MVCC versions merge.
Fixes#10913Fixes#10830Closes#10914
The commits here were extracted from PR https://github.com/scylladb/scylla/pull/10835 which implements upgrade procedure for Raft group 0.
They are mostly refactors which don't affect the behavior of the system, except one: the commit 4d439a16b3 causes all schema changes to be bounced to shard 0. Previously, they would only be bounced when the local Raft feature was enabled. I do that because:
1. eventually, we want this to be the default behavior
2. in the upgrade PR I remove the `is_raft_enabled()` function - the function was basically created with the mindset "Raft is either enabled or not" - which was right when we didn't support upgrade, but will be incorrect when we introduce intermediate states (when we upgrade from non-raft-based to raft-based operations); the upgrade PR introduces another mechanism to dispatch based on the upgrade state, but for the case of bouncing to shard 0, dispatching is simply not necessary.
Closes#10864
* github.com:scylladb/scylla:
service/raft: raft_group_registry: add assertions when fetching servers for groups
service/raft: raft_group_registry: remove `_raft_support_listener`
service/raft: raft_group0: log adding/removing servers to/from group 0 RPC map
service/raft: raft_group0: move group 0 RPC handlers from `storage_service`
service/raft: messaging: extract raft_addr/inet_addr conversion functions
service: storage_service: initialize `raft_group0` in `main` and pass a reference to `join_cluster`
treewide: remove unnecessary `migration_manager::is_raft_enabled()` calls
test/boost: memtable_test: perform schema operations on shard 0
test/boost: cdc_test: remove test_cdc_across_shards
message: rename `send_message_abortable` to `send_message_cancellable`
message: change parameter order in `send_message_oneway_timeout`
There effectively are several test-cases in this test, each calls the
scylla_sstable() to prepare, thus each creates a type in the same scylla
instance. The 2nd attempt ends up with the "already exists" error:
E cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="A user type of name cql_test_1656396925652.type1 already exists"
tests: unit(dev)
https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1075/fixes: #10872
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220628081459.12791-1-xemul@scylladb.com>
A number of improvements in test.py as requested by maintainers:
* don't capture pytest output
* stick to the specific server in control connections
* support --log-level option and pass it to logging module
* when checking if CQL is up, ignore timeout errors
* no longer force schema migration when starting the server
* use test uname, not id, in log output
* improve logging of ScyllaServer
* log what cluster is used for a test
* extend xml output with logs
On the same token, remove mypy warnings and make linter pass on test.py, as well as add some type checking.
Fixes#10871Fixes#10785Closes#10902
* github.com:scylladb/scylla:
test.py: extend xml output with logs
test.py: log what cluster is used for a test
test.py: improve logging of ScyllaServer
test.py: use test uname, not id, in log output
test.py: support --log-level option and pass it to logging module
test.py: make ScyllaServer more reliable and fast
test.py: don't capture pytest output
test.py: add type annotations
test.py: convert log_filename to pathlib
test.py: please linter
test.py: remove mypy warnings
Currently, for users who have permissions_cache configs set to very high
values (and thus can't wait for the configured times to pass) having to restart
the service every time they make a change related to permissions or
prepared_statements cache (e.g. Adding a user and changing their permissions)
can become pretty annoying.
This patch series make permissions_validity_in_ms, permissions_update_interval_in_ms
and permissions_cache_max_entries live updateable so that restarting the
service is not necessary anymore for these cases.
It also adds an API for flushing the cache to make it easier for users who
don't want to modify their permissions_cache config.
branch: https://github.com/igorribeiroduarte/scylla/tree/make_permissions_cache_live_updateable
CI: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1005/
dtests: https://github.com/igorribeiroduarte/scylla-dtest/tree/test_permissions_cache
* https://github.com/igorribeiroduarte/scylla/make_permissions_cache_live_updateable:
loading_cache_test: Test loading_cache::reset and loading_cache::update_config
api: Add API for resetting authorization cache
authorization_cache: Make permissions cache and authorized prepared statements cache live updateable
auth_prep_statements_cache: Make aut_prep_statements_cache accept a config struct
utils/loading_cache.hh: Add update_config method
utils/loading_cache.hh: Rename permissions_cache_config to loading_cache_config and move it to loading_cache.hh
utils/loading_cache.hh: Add reset method
Track the background restore_replica_count fiber
so it be awaited on in stop() by closing the
_async_gate.
Fixes#10672
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Validate that the size of the cache is zero after calling the
reset method and that the config is being updated correctly
after calling update_config.
Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
For cases where we have very high values set to permissions_cache validity and
update interval (E.g.: 1 day), whenever a change to permissions is made it's
necessary to update scylla config and decrease these values, since waiting for
all this time to pass wouldn't be viable.
This patch adds an API for resetting the authorization cache so that changing
the config won't be mandatory for these cases.
Usage:
$ curl -X POST http://localhost:10000/authorization_cache/reset
Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
Currently, for users who have permissions_cache configs set to very high
values (and thus can't wait for the configured times to pass) having to restart
the service every time they make a change related to permissions or
prepared_statements cache(e.g.: Adding a user) can become pretty annoying.
This patch make permissions_validity_in_ms, permissions_update_interval_in_ms
and permissions_cache_max_entries live updateable so that restarting the
service is not necessary anymore for these cases.
Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
This patch makes authorized_prepared_statements_cache acccept a config struct,
similarly to permissions_cache. This will make it easier to make this cache
live updateable on the next patch.
Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
This patch adds an update_config method in order to allow live updating the
config for permissions_cache. This method is going to be used in the next
patches after making permissions_cache config live updateable.
Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
This patch renames the permissions_cache_config struct to loading_cache_config
and moves it to utils/loading_cache.hh. This will make it easier to handle
config updates to the authorization caches on the next patches
Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
Change tests to use async mode and add helpers and tests for schema changes.
These test series will be expanded with topology changes.
Closes#10550
* github.com:scylladb/scylla:
test.py topology: repro for issue #1207
test.py: port fixture fails_without_raft
test.py topology: table methods to add/remove index
test.py topology: add/drop table column helpers
test.py topology: insert sequential row
test.py: remove deprecated test test_null
test.py: managed random tables
test.py: test_keyspace fixture async
test.py: rename fixture test_keyspace to keyspace
test.py topology: test with asyncio
This PR adds necessary modifications to perf_simple_query so that it can be used to test performance of the timeout handling path. With an appropriate combination of flags, it is possible to consistently trigger timeouts on every operation.
The following flags are added:
- `--stop-on-error` - if true (which is the default), the test stops after encountering the first exception and reports it; otherwise it causes errors to be counted and reported at the end.
- `--timeout <x>` - allows to use `USE TIMEOUT <x>` in the benchmark query/statement.
- `--bypass-cache` - uses `BYPASS CACHE` in the benchmark query (relevant only to reads).
Examples:
```
./build/release/test/perf/perf_simple_query --smp=1 --operations-per-shard=1000000 --write
131023.65 tps ( 56.2 allocs/op, 13.2 tasks/op, 49784 insns/op, 0 errors)
./build/release/test/perf/perf_simple_query --smp=1 --operations-per-shard=1000000 --write --stop-on-error=false --timeout=0s
97163.73 tps ( 53.1 allocs/op, 5.1 tasks/op, 78687 insns/op, 1000000 errors)
./build/release/test/perf/perf_simple_query --smp=1 --operations-per-shard=1000000
154060.36 tps ( 63.1 allocs/op, 12.1 tasks/op, 42998 insns/op, 0 errors)
./build/release/test/perf/perf_simple_query --smp=1 --operations-per-shard=1000000 --stop-on-error=false --flush --bypass-cache --timeout=0s
30127.43 tps ( 48.2 allocs/op, 14.3 tasks/op, 312416 insns/op, 1000000 errors)
```
Refs: #2363Closes#10899
* github.com:scylladb/scylla:
test: perf: add bypass cache argument
test: perf: add timeout argument
test: perf: count errors and report the count in results
test: perf: add stop-on-error argument
test: perf: coroutinize run_worker()
test: perf: fix crash on exception in time_parallel_ex
This fixes a quadratic behavior in case lots of snapshots with range
tombstones are queued for merging. Before the change, new snapshots
were inserted at the front, which is also where the worker looks
at. Merging a version has a linear component in complexity function
which depends on the number of range tombstones. If we merge snapshots
starting from the latest to oldest then the whole process becomes
quadratic because the version which is merged accumulates an
increasing amont of tombstones, ones which were already merged
before. We should instead merge starting from the oldest snapshots,
this way each tombstone is applied exactly once during merge.
This bug got wose after 4bd4aa2e88,
which makes merging tombstones more expensive.
Closes#10916
When the run scripts for tests of cql-pytest, alternator, redis, etc.,
run Scylla, they should set the UBSAN_OPTIONS and ASAN_OPTIONS so that
if the executable is built with sanitizers enabled, it will ignore false
positives that we know about, and fail on real errors.
The change in this patch affects all test/*/run scripts which use the
this shared Scylla-starting code. test.py already had the same settings,
and it affected the tests that it knows to run directly (unit tests,
cql-pytest, etc.).
Fixes#10904
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10915
Add test and server logs, as well as the unidiff, to
XML output. This makes jenkins reports nicer.
While on it, debug & fix bugs in handling of flaky tests:
- the reset would reset a flaky test even after the last attempt
fails, so it would be impossible to see what happened to it
- the args needed to be reset as well, since execution modifies
them
- we would say that we're going to retry the flaky test when in
fact it was the last attempt to run it and no more retries were
planned
1) Stick to the specific server in control connections.
It could happen that, when starting a cluster and checking
if a specific node is up, the check would actually execute
against an already running node. Prevent this from happening
by setting a white list connection balancing policy for control
connections.
2) When checking if CQL is up, ignore timeout errors
Scylla in debug mode can easily time out on a DDL query,
and the timeout error at start up would lead to the entire cluster
marked as broken. This is too harsh, allow timeouts at start.
3) No longer force schema migration when starting the server
By default, Raft is on, so the nodes are getting schema
through Raft leader. Schema migration significantly slows
down cluster start in debug mode (60 seconds -> 100 seconds),
and even though it was a great test that helped discover
several bugs in Scylla, it shouldn't be part of normal
cluster boot, so disable it.
Repro for bug in concurrent schema changes for many tables and indexing
involved.
Do alter tables by doing in parallel new table creation, alter a table
(_alter), and index other tables (_index).
Original repro had sets of 20 of those and slept for 20 seconds to
settle. This repro does it for Scylla with just 1 set and 1 second.
This issue goes away once Raft is enabled.
https://github.com/scylladb/scylla/issues/1207
Originally at https://issues.apache.org/jira/browse/CASSANDRA-10250
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Port fails_without_raft to higher level conftest file for future use in
topology pytests.
While there, make it async.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
For each table keep a counter and insert rows with sequential values
generated correspondingly by each column's type.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Helpers to create keyspace and manange randomized tables.
Fixture drops all created tables still active after the test finishes.
Includes helper methods to verify schema consistency.
These helpers will be used in Raft schema changes tests coming later.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Run test async using a wrapper for Cassandra python driver's future.
The wrapper was suggested by a user and brought forward by @fruch.
It's based on https://stackoverflow.com/a/49351069 .
Redefine pytest event_loop fixture to avoid issues with fixtures with
scope bigger than function (like keyspace).
See https://github.com/pytest-dev/pytest-asyncio/issues/68
Convert sample test_null to async. More useful test cases will come
afterwards.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
They are removed because they are not used anywhere
and they contain code that would have to be modified
in the following commits.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
- Use `sstables::generation_type` in more places
- Enforce conceptual separation of `sstables::generation_type` and `int64_t`
- Fix `extremum_tracker` so that `sstables::generation_type` can be non-default-constructible
Fixes#10796.
Closes#10844
* github.com:scylladb/scylla:
sstables: make generation_type an actual separate type
sstables: use generation_type more soundly
extremum_tracker: do not require default-constructible value types
Fixes#9367
The CL counters pending_allocations and requests_blocked_memory are
exposed in graphana (etc) and often referred to as metrics on whether
we are blocking on commit log. But they don't really show this, as
they only measure whether or not we are blocked on the memory bandwidth
semaphore that provides rate back pressure (fixed num bytes/s - sortof).
However, actual tasks in allocation or segment wait is not exposed, so
if we are blocked on disk IO or waiting for segments to become available,
we have no visible metrics.
While the "old" counters certainly are valid, I have yet to ever see them
be non-zero in modern life.
Closes#9368
Currently in docs/alternator/compatibility.md experimental features
and unimplemented features are bunched together under one heading
("unimplemented features"). In this patch we separate them into two
sections. This makes the "unimplemented features" section shorter,
and also allows us to link to the new "experimental features" section
separately.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10893
A scan over range tombstones will ignore preemption, which may cause
reactor stalls or read failure due to std::bad_alloc.
This is a regression introduced in
5e97fb9fc4. _lower_bound_changed was
always set to false, which is later checked at preemption point and
inhibits yielding.
Closes#10900
Adds the "--timeout" argument which allows specifying a timeout used in
all operations. It works by inserting "USING <timeout>" in appropriate
place in the query.
The flag is most useful when set to zero - with an appropriate
combination of other flags (flush, bypass cache) it guarantees that each
operation will time out and performance of the timeout handling logic
can be measured.
Adds the "--stop-on-error" argument to perf_simple_query. When enabled
(and it is enabled by default), the benchmark will propagate exceptions
if any occur in the tested function. Otherwise, errors will be ignored.
Converts the executor::run_worker() method to a coroutine. This will
allow extending the function in further commits without having to
allocate continuations.
The `time_parallel_ex` function creates a sharded<executor> and uses it
to run the benchmark on multiple shards in parallel. However, if the
benchmarking function throws an exception, the sharded<executor> will be
destroyed without being stopped, which triggers an assertion in
sharded<T> destructor.
This commit makes sure that the executor is stopped before being
destroyed by putting `exec.stop()` into a `seastar::defer`.
* seastar ff46af9ae0...9c016aeebf (8):
> Merge "Handle overflow in token bucket replenisher" from Pavel E
Fixes#10743Fixes#10846
> abort_source: request_abort: restore legacy no-args method
> configure.py: do not use distutils
> configure.py: drop unused "import sys"
> Revert "Use recv syscall instead of read in do_read_some()"
> Use recv syscall instead of read in do_read_some()
> Merge 'Add initial support for websocket protocol' from Andrzej Stalke
> Merge 'abort_source: request_abort: allow passing exception to subscribers' from Benny Halevy
Closes#10898
While we're iterating over the fetched keyspace names, some of these
keyspaces may get dropped. Handle that by checking if the keyspace still
exists.
Also, when retrieving the replication strategy from the keyspace, store
the pointer (which is an `lw_shared_ptr`) to the strategy to keep it
alive, in case the keyspace that was holding it gets dropped.
Closes#10861
Consider this:
- User starts a repair job with http api
- User aborts all repair
- The repair_info object for the repair job is created
- The repair job is not aborted
In this patch, the repair uuid is recorded before repair_info object is
created, so that repair can now abort repair jobs in the early stage.
Fixes#10384Closes#10428
Otherwise cql_transport::additional_options_for_proto_ext() complains
about inability to format the enum class value
Introduced by efc3953c (transport: add rate_limit_error)
Fmt version 8.1.1-5.fc35, fresher one must have it out of the box
Fixes#10884
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220627052703.32024-1-xemul@scylladb.com>
Currently, we use the last row in the query result set as the position where the query is continued from on the next page. Since only live rows make it into query result set, this mandates the query to be stopped on a live row on the replica, lest any dead rows or tombstones processed after the live rows, would have to be re-processed on the next page (and the saved reader would have to be thrown away due to position mismatch). This requirement of having to stop on a live row is problematic with datasets which have lots of dead rows or tombstones, especially if these form a prefix. In the extreme case, a query can time out before it can process a single live row and the data-set becomes effectively unreadable until compaction gets rid of the tombstones.
This series prepares the way for the solution: it allows the replica to determine what position the query should continue from on the next page. This position can be that of a dead row, if the query stopped on a dead row. For now, the replica supplies the same position that would have been obtained with looking at the last row in the result set, this series merely introduces the infrastructure for transferring a position together with the query result, and it prepares the paging logic to make use of this position. If the coordinator is not prepared for the new field, it will simply fall-back to the old way of looking at the last row in the result set. As I said for now this is still the same as the content of the new field so there is no problem in mixed clusters.
Refs: https://github.com/scylladb/scylla/issues/3672
Refs: https://github.com/scylladb/scylla/issues/7689
Refs: https://github.com/scylladb/scylla/issues/7933
Tests: manual upgrade test.
I wrote a data set with:
```
./scylla-bench -mode=write -workload=sequential -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -clustering-row-size=8096 -partition-count=1000
```
This creates large, 80MB partitions, which should fill many pages if read in full. Then I started a read workload:
```
./scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100
```
I confirmed that paging is happening as expected, then upgraded the nodes one-by-one to this PR (while the read-load was ongoing). I observed no read errors or any other errors in the logs.
Closes#10829
* github.com:scylladb/scylla:
query: have replica provide the last position
idl/query: add last_position to query_result
mutlishard_mutation_query: propagate compaction state to result builder
multishard_mutation_query: defer creating result builder until needed
querier: use full_position instead of ad-hoc struct
querier: rely on compactor for position tracking
mutation_compactor: add current_full_position() convenience accessor
mutation_compactor: s/_last_clustering_pos/_last_pos/
mutation_compactor: add state accessor to compact_mutation
introduce full_position
idl: move position_in_partition into own header
service/paging: use position_in_partition instead of clustering_key for last row
alternator/serialization: extract value object parsing logic
service/pagers/query_pagers.cc: fix indentation
position_in_partition: add to_string(partition_region) and parse_partition_region()
mutation_fragment.hh: move operator<<(partition_region) to position_in_partition.hh
Change 8f39547d89 added
`handle_exception_type([] (const semaphore_aborted& e) {})`,
but it turned out that `named_semaphore_aborted` isn't
derived from `semaphore_aborted`, but rather from
`abort_requested_exception` so handle the base exception
instead.
Fixes#10666
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10881
As reported in #10867, newer versions of the fmt library
format %Y using 4-characters width, 0-padding the prefix
when needed, while older versions don't do that.
This change moves away from using %Y and friends
fmt specifiers to using explicit numeric-based formatting
conforming to ISO 8601 and making sure the year field
has at least 4 digits and is zero padded. When
negative, the width is upped to 5 so it would show as -0001
rather than -001.
The unit test was updated respectively.
Fixes#10867
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10870
Currently, the `_reader` member is explicitly
initialized with the result of the call to `make_reader`.
And `make_reader`, as a side effect, assigns a value
to the `_reader_handle` member.
Since C++ initializes class members sequentially,
in the order they are defined, the assignment to `_reader_handle`
in `make_reader()` happens before `_reader_handle` is initialized.
This patch fixes that by changing the definition order,
and consequently, the member initialization order
in the constructor so that `_reader_handle` will be (default-)initialized
before the call to `make_reader()`, avoiding the undefined behavior.
Fixes#10882
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10883
The API uses the http server to serve two directories: the api_ui_dir
where the swagger-ui directory is found and the api_doc_dir where the
swagger definition files are found.
Internally, the API uses the httpd::directory_handler that append the
files it gets from the path to the base directory name.
A user can override the default configuration and set a directory name
that will not end with a backslash. This will result with files not
found.
This patch check if that backslash is missing, and if it is, adds it to
the API configuration.
Fixes#10700
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Closes#10877
Evaluating Python code from within gdb is priceless,
especially that all helper classes and functions sourced from
scylla-gdb.py can be used in there. This commit adds a paragraph
in debugging.md mentioning this tool.
Closes#10869
Static columns are not currently allowed in a materialized view. If the
base table has a static column and one tries to create a view with a
"SELECT *", the following error message is printed today:
Unable to include static column 'ColumnDefinition{name=s,
type=org.apache.cassandra.db.marshal.Int32Type, kind=STATIC,
componentIndex=null, droppedAt=-9223372036854775808}' which would
be included by Materialized View SELECT * statement
It is completely unnecessary to include all these details about the
column definition - just its name would have sufficed. In other words,
we should print def.name_as_text(), not the entire def. This is what
other error messages in the same file do as well.
After this patch the error message becomes nicer and clearer:
Unable to include static column 's' which would be included by
Materialized View SELECT * statement
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10854
This series decouples the staging sstables from the table's sstable set.
The current behavior keeps the sstables in the staging directory until view building is done. They are readable as any other sstable, but fenced off from compaction, so they don't go away in the meanwhile.
Currently, when views are built, the sstables are moved into the main table directory where they will then be compacted normally.
The problem with this design is that the staging sstables are never compacted, in particular they won't get cleaned up or scrubbed.
The cleanup scenario open a backdoor for data resurrection when the staging sstables are moved after view building while possibly containing stale partitions (#9559) which will not be cleaned up until next time cleanup compaction is performed.
With this series, SSTables that are created in or moved to the staging sub-directory are "cloned" into the base table directory by hard-linking the components there and creating a new sstable object which loads the cloned files.
The former, in the staging directory is used solely for view building and is not added to the table's sstable set, while the latter, its clone, behaves like any other sstable and is added either to the regular or maintenance set and is read and compacted normally.
When view building is done, instead of moving the staging sstable into the table's base directory, it is simply unlinked.
If its "clone" wasn't compacted away yet, then it will just remain where it is, exactly like it would be after it was moved there in the present state of things. If it was already compacted and no longer exists, then unlinking will then free its storage.
Note that snapshot is based on the sstables listed by the table, which do not include the staging sstables with this change.
But that shouldn't matter since even today, the sstables in the snapshot has no notion of "staging" directory and it is expected that the MV's are either updated view `nodetool refresh` if restoring sstables from snapshot using the uploads dir, or if restoring the whole table from backup - MV's are effectively expected to be rebuilt from scratch (they are not included in automatic snapshots anyway since we don't have snapshot-coherency across tables).
A fundamental infrastructure change was done to achieve that which is to change the sstable_list which was a std::unordered_set<shared_sstable> into a std::unordered_map<generation_type, shared_sstable> that keeps the shared_sstable objects indexed by generation number (that must be unique). With this model, sstables are supposed to be searched by the generation number, not by their pointer, since when the staging sstable is clones, there will be 2 shared_sstable objects with the same generation (and different `dir()`) and we must distinguish between them.
Special care was taken to throw a runtime_error exception if when looking up a shared sstable and finding another one with the same generation, since they must never exist in the same sstable_map.
Fixes#9559Closes#10657
* github.com:scylladb/scylla:
table: clone staging sstables into table dir
view_update_generator: discover_staging_sstables: reindent
table: add get_staging_sstables
view_update_generator: discover_staging_sstables: get shared table ptr earlier
distributed_loader: populate table directory first
sstables: time_series_sstable_set: insert: make exception safe
sstables: move_to_new_dir: fix debug log message
This series moves the logic to not perform off-strategy compaction if the maintenance set is empty from the table layer down to the compaction_manager layer since it is the one that needs to make the decision.
With that compaction_manager::perform_offstrategy will return a future<bool> which resolves to true
iff off-strategy compaction was required and performed.
The sstable_compaction_test was adjusted and a new compaction_manager_for_testing class was added
to make sure the compaction manager is enabled when constructed (it wasn't so test_offstrategy_sstable_compaction didn't perform any off-strategy compactions!) and stopped before destroyed.
Closes#10848
* github.com:scylladb/scylla:
table: perform_offstrategy_compaction: move off-strategy logic to compaction_manager
compaction_manager: offstrategy_compaction_task: refactor log printouts
test: sstable_compaction: compaction_manager_for_testing
Due to its sharded and token-based architecture, Scylla works best when the user workload is more or less uniformly balanced across all nodes and shards. However, a common case when this assumption is broken is the "hot partition" - suddenly, a single partition starts getting a lot more reads and writes in comparison to other partitions. Because the shards owning the partition have only a fraction of the total cluster capacity, this quickly causes latency problems for other partitions within the same shard and vnode.
This PR introduces per-partition rate limiting feature. Now, users can choose to apply per-partition limits to their tables of choice using a schema extension:
```
ALTER TABLE ks.tbl
WITH per_partition_rate_limit = {
'max_writes_per_second': 100,
'max_reads_per_second': 200
};
```
Reads and writes which are detected to go over that quota are rejected to the client using a new RATE_LIMIT_ERROR CQL error code - existing error codes didn't really fit well with the rate limit error, so a new error code is added. This code is implemented as a part of a CQL protocol extension and returned to clients only if they requested the extension - if not, the existing CONFIG_ERROR will be used instead.
Limits are tracked and enforced on the replica side. If a write fails with some replicas reporting rate limit being reached, the rate limit error is propagated to the client. Additionally, the following optimization is implemented: if the coordinator shard/node is also a replica, we account the operation into the rate limit early and return an error in case of exceeding the rate limit before sending any messages to other replicas at all.
The PR covers regular, non-batch writes and single-partition reads. LWT and counters are not covered here.
Results of `perf_simple_query --smp=1 --operations-per-shard=1000000`:
- Write mode:
```
8f690fdd47 (PR base):
129644.11 tps ( 56.2 allocs/op, 13.2 tasks/op, 49785 insns/op)
This PR:
125564.01 tps ( 56.2 allocs/op, 13.2 tasks/op, 49825 insns/op)
```
- Read mode:
```
8f690fdd47 (PR base):
150026.63 tps ( 63.1 allocs/op, 12.1 tasks/op, 42806 insns/op)
This PR:
151043.00 tps ( 63.1 allocs/op, 12.1 tasks/op, 43075 insns/op)
```
Manual upgrade test:
- Start 3 nodes, 4 shards each, Scylla version 8f690fdd47
- Create a keyspace with scylla-bench, RF=3
- Start reading and writing with scylla-bench with CL=QUORUM
- Manually upgrade nodes one by one to the version from this PR
- Upgrade succeeded, apart from a small number of operations which failed when each node was being put down all reads/writes succeeded
- Successfully altered the scylla-bench table to have a read and write limit and those limits were enforced as expected
Fixes: #4703Closes#9810
* github.com:scylladb/scylla:
storage_proxy: metrics for per-partition rate limiting of reads
storage_proxy: metrics for per-partition rate limiting of writes
database: add stats for per partition rate limiting
tests: add per_partition_rate_limit_test
config: add add_per_partition_rate_limit_extension function for testing
cf_prop_defs: guard per-partition rate limit with a feature
query-request: add allow_limit flag
storage_proxy: add allow rate limit flag to get_read_executor
storage_proxy: resultize return type of get_read_executor
storage_proxy: add per partition rate limit info to read RPC
storage_proxy: add per partition rate limit info to query_result_local(_digest)
storage_proxy: add allow rate limit flag to mutate/mutate_result
storage_proxy: add allow rate limit flag to mutate_internal
storage_proxy: add allow rate limit flag to mutate_begin
storage_proxy: choose the right per partition rate limit info in write handler
storage_proxy: resultize return types of write handler creation path
storage_proxy: add per partition rate limit to mutation_holders
storage_proxy: add per partition rate limit info to write RPC
storage_proxy: add per partition rate limit info to mutate_locally
database: apply per-partition rate limiting for reads/writes
database: move and rename: classify_query -> classify_request
schema: add per_partition_rate_limit schema extension
db: add rate_limiter
storage_proxy: propagate rate_limit_exception through read RPC
gms: add TYPED_ERRORS_IN_READ_RPC cluster feature
storage_proxy: pass rate_limit_exception through write RPC
replica: add rate_limit_exception and a simple serialization framework
docs: design doc for per-partition rate limiting
transport: add rate_limit_error
It did nothing.
It will be readded in `raft_group0` and it will do something, stay
tuned.
With this we can remove the `feature_service` reference from
`raft_group_registry`.
`raft_group0` was constructed at the beginning of `join_cluster`, which
required passing references to 3 additional services to `join_cluster`
used only for that purpose (group 0 client, raft group registry, and
query processor).
Now we initialize `raft_group0` in main - like all other services - and
pass a reference to `join_cluster` so `storage_service` can store a
pointer to group 0.
We initialize `raft_group0` before we start listening for RPCs in
`messaging_service`. In a later commit we'll move the initialization
of group 0 related verbs to the constructor of `raft_group0` from
`storage_service`, so they will be initialized before we start
listening for RPCs.
In schema_altering_statement: we will bounce statements to shard 0
whether Raft is enabled or not.
In migration_manager, when we're sending a group 0 snapshot: well, if
we're sending a group 0 snapshot, Raft must be enabled; the check is
redundant.
The test checked if creating a table with CDC enabled on shard other
than 0 would create the CDC log table as well; it was a regression test
for #5582. However we will soon bounce all schema change requests to
shard 0, so the test's purpose is gone.
I need to remove this test because `cquery_nofail` does not handle the
bouncing correctly: it silently accepts the bounce message, assumes that
the query was successful and returns. So after we change the code to
start bouncing all requests to shard 0, if a query was ran inside test
code using `cquery_nofail` on a shard different than 0 it would do
nothing and following queries executed on shard 0 would fail because they
depended on the effect of the aforementioned query.
It's not possible to abort an RPC call entirely, since the remote part
continues running (if the message got out). Calling the provided abort
source does the following:
1. if the message is still in the outgoing queue, drop it,
2. resolve waiter callbacks exceptionally.
Using the word "cancellable" is more appropriate.
Also write a small comment at `send_message_cancellable`.
Make it consistent with the other 'send message' functions.
Simplify code generation logic in idl-compiler.
Interestingly this function is not used anywhere so I didn't have to fix
any call sites.
clone staging sstables so their content may be compacted while
views are built. When done, the hard-linked copy in the staging
subdirectory will be simply unlinked.
Fixes#9559
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We don't have to go over all sstables in the table to select the
staging sstables out of them, we can get it directly from the
_sstables_staging map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It's potentially a bit more efficient since
t.get_sstables is called only once, while
t.shared_from_this() is called per staging sstable.
Also, prepare for the following patches that modify
this function further.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Need to erase the shared sstable from _sstables
if insertion to _sstables_reversed fails.
Fixes#10787
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Make it account for all the changes done in the compaction manager
recently. 5.0 is not affected. So does not merit a backport.
(gdb) scylla compaction-tasks
1 type=sstables::compaction_type::Reshard, state=compaction_manager::task::state::active, "keyspace1"."standard1"
Total: 1 instances of compaction_manager::task
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220621225600.20359-1-raphaelsc@scylladb.com>
The command is quite straightforward, but it didn't offer
any documentation when calling `help scylla shard`, so it's
hereby added. As a small bonus, a more comprehensive message
is printed when the argument is not an integer.
Message-Id: <9b958a4befce1c7baa6f86504ab74b93840b37e9.1655984258.git.sarna@scylladb.com>
`scylla thread` command is extended with a non-intrusive
option for dumping saved registers from the jmp_buf structure
in an unmangled form.
It can later be useful, e.g. for peeking into thread's instruction
pointer or reasoning about its stack.
Example debugging session:
(gdb) scylla threads
[shard 1] (seastar::thread_context*) 0x6010000d9e00, stack: 0x601004f00000
[shard 1] (seastar::thread_context*) 0x6010000daf00, stack: 0x601004e00000
(gdb) scylla thread --print-regs 0x6010000d9e00
rbx: 0x601004f1fd00
rbp: 0x601004f1fc20
r12: 0x6010000d9e20
r13: 0x6010002a3190
r14: 0x601004f1fd08
r15: 0x6010000d9e10
rsp: 0x601004f1fbb0
rip: 0x2f0aea6
(gdb) disassemble 0x2f0aea6
Dump of assembler code for function _ZN7seastar12jmp_buf_link10switch_outEv:
0x0000000002f0ae90 <+0>: push %rax
0x0000000002f0ae91 <+1>: mov 0xc8(%rdi),%rax
0x0000000002f0ae98 <+8>: mov %rax,%fs:0xfffffffffffe5dc8
0x0000000002f0aea1 <+17>: call 0x30333d0 <_setjmp@plt>
0x0000000002f0aea6 <+22>: test %eax,%eax
0x0000000002f0aea8 <+24>: je 0x2f0aeac <_ZN7seastar12jmp_buf_link10switch_outEv+28>
0x0000000002f0aeaa <+26>: pop %rax
0x0000000002f0aeab <+27>: ret
0x0000000002f0aeac <+28>: mov %fs:0xfffffffffffe5dc8,%rdi
0x0000000002f0aeb5 <+37>: mov $0x1,%esi
0x0000000002f0aeba <+42>: call 0x30333c0 <longjmp@plt>
End of assembler dump.
Message-Id: <553c1ed76987776916d5261ed13866650e84df34.1655984258.git.sarna@scylladb.com>
In order to cover more code paths, the test case
now places filtering on various combinations of base columns,
including both primary keys and regular columns.
It also makes the test scylla_only, as filtering is an extension
not supported in Cassandra right now.
Closes#10860
Use the recently introduced query-result facility to have the replica
set the position where the query should continue from. For now this is
the same as what the implicit position would have been previously (last
row in result), but it opens up the possibility to stop the query at a
dead row.
To be used to allow the replica to specify the last position in the
stream, where the query was left off. Currently this is always
the same as the implicit position -- the last row in the result-set --
but this requires only stopping the read on a live row, which is a
requirement we want to lift: we want to be able to stop on a tombstone.
As tombstones are not included in the query result, we have to allow the
replica to overwrite the last seen position explicitly.
This patch introduces the new field in the query-result IDL but it is
not written to yet, nor is it read, that is left for the next patches.
Currently the result builder is created two frames above the method in
which actually needed. Push down a factory method instead and create it
where actually used. This allows us to pass it arguments that are
present only in the method which uses it.
A simple struct containing a full position, including a partition key
and a position in partition. Two variants are introduced: an owning
version and a view. This is to replace all the ad-hoc structures
introduced for the same purpose: std::pair() and std::tuple() of
partition key and clustering key, and other similar small structs
scattered around the code.
This patch does not replace any of the above mentioned construcs with
the new full_position, it merely introduces it to enable incremental
standardization.
The former allows for expressing more positions, like a position
before/after a clustering key. This practically enables the coordinator
side paging logic, for a query to be stopped at a tombstone (which can
have said positions).
chunked_vector was headed by short comment which didn't really explain
why it exists and how and why it really differs from std::dequeue.
Moreover, it made the vague claim that it "limits" contiguous
allocations, which it really doesn't (at least not in the asymptotic
sense).
In this patch I wrote a much longer comment, which I hope will clearly
explain exactly what chunked_vector is, how it really differs in its
contiguous allocations from std::deque, and what it guarantees and
doesn't guarantee.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10857
To make scylla setup scripts easier to handle in Ansible, stop deleting
perftune.yaml and detect cpuset.conf changes by mtime of the file.
Also, skip update cpuset.conf when same parameter specified.
Fixes#10121Closes#10312
"
The way dc/rack info is maintained is very intricate.
The dc/rack strings originate at snitch, get propagated via gossiper,
get notified to storage service which, in turn, stores them into the
system keyspace and token metadata. Code that needs to get dc/rack
for a given endpoint calls snitch which tries to get the data from
gossiper and if failed goes and loads it from system keyspace cache.
Also there's "internal IP" thing hanging arond that loops messaging
service in both -- updating and getting the info.
The plan is to make topology (that currently sits on token metadata)
stay the only "source of truth" regarding the endpoints' dc/rack and
internal IP info. The dc/rack mappings are put into topology already,
but it cannot yet fully replace snitch for two reasons:
- it doesn't map internal IP to endpoint
- it doesn't get data stored in system keyspace
So what this patch set does is patches most of the dc/rack getters
to call topology methods. The topology is temporarily patched to
just call the respective snitch methods. This removes a big portion
of calls for global snitch instance.
After the set the places that still explicitly rely on snitch to
provide dc/rack are
- messaging service: needs internal IP knowledge on topology
- db/consistency_level: is all "global", needs heavier patching
- tests: just later
"
* 'br-get-dc-rack-from-topology-2' of https://github.com/xemul/scylla:
proxy stats: Get rack/datacenter from topology
proxy stats: Push topology arg to get_ep_stats
api: Get rack/datacenter from topology
hints: Remove snitch dependency
hints: Get rack/datacenter from topology
alternator: Get rack/datacenter from topology
range_streamer: Get rack/datacenter from topology
repair: Get rack/datacenter from topology
view: Get rack/datacenter from topology
storage_service: Get rack/datacenter from topology
proxy: Get rack/datacenter from topology
topology: Add get_rack/_datacenter methods
compaction_manager needs to decide about running off-strategy
compaction or not based on the maintenance_set, not partly
in table::trigger_offstrategy_compaction and part in
the compaction_manager layer as it is done today.
So move the logic down to performa_offstrategy
that now returns future<bool> to return true
iff it performed offstrategy compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move logging from run_offstrategy_compaction to do_run
so that in the next patch we can skip run_offstrategy_compaction
if the maintenance set is empty (but still log it,
for the sake of dtests.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Make the compaction manager for testing using
this class.
Makes sure to enable the compaction manager
and to stop it before it's destroyed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch adds a reset method which is going to be used in the next patches
for updating the loadind_cache config and also to be able to flush the cache
without having to update scylla config
Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
Adds a metric "read_rate_limited" which indicates how many times a read
operation was rejected due to per-partition rate limiting. The metric
differentiates between reads rejected by the coordinator and reads
rejected by replicas.
Adds a metric "write_rate_limited" which indicates how many times a
write operation was rejected due to per-partition rate limiting. The
metric differentiates between writes rejected by the coordinator and
writes rejected by replicas.
Adds statistics which count how many times a replica has decided to
reject a write ("total_writes_rate_limited") or a read
("total_reads_rate_limited").
Adds the per_partition_rate_limit_test.cc file. Currently, it only
contains a test which verifies that the feature correctly switches off
rate limiting for internal queries (!allow_limit || internal sg).
The per-partition rate limit feature requires all nodes in the cluster
to support it in order to work well. This commit adds a check which
disallows creating/altering tables with per-partition rate limit until
the node is sure that all nodes in the cluster support it.
Adds a flag to get_read_executor which decides whether the read should
be rate limited or not. The read executors were modified to choose the
appropriate per partition rate limit info parameter and send it to the
replicas.
Now, get_read_executor is able to return coordinator exceptions without
throwing them. In an upcoming commit, it will start returning rate limit
exception in some cases and it is preferable to return them without
throwing.
The query_result_local and query_result_local_digest methods were
updated to accept db::per_partition_rate_limit::info structure and pass
it on to database::accept.
Now, mutate/mutate_result accept a flag which decides whether the write
should be rate limited or not.
The new parameter is mandatory and all call sites were updated.
The mutate_prepare and create_write_response_handler(_helper) functions
are modified to be able to return exceptions without throwing them. In
an upcoming commit, create_write_response_handler will sometimes return
rate limit exception, and it is preferable to return them without
throwing.
Adds the `db::rate_limiter` to the `database` class and modifies the
`query` and `apply` methods so that they account the read/write
operations in the rate limiter and optionally reject them.
Moves the classify_query higher and renames it to classify_request. The
function will be reused in further commits to protect non-user queries
from accidentally being rate limited.
Adds the new `per_partition_rate_limit` schema extension. It has two
parameters: `max_writes_per_second` and `max_reads_per_second`.
In the future commits they will control how many operations of given
type are allowed for each partition in the given table.
Introduces the rate_limiter, a replica-side data structure meant for
tracking the frequence with which each partition is being accessed
(separately for reads and writes) and deciding whether the request
should be accepted and processed further or rejected.
The limiter is implemented as a statically allocated hashmap which keeps
track of the frequency with which partitions are accessed. Its entries
are incremented when an operation is admitted and are decayed
exponentially over time.
If a partition is detected to be accessed more than its limit allows,
requests are rejected with a probability calculated in such a way that,
on average, the number of accepted requests is kept at the limit.
The structure currently weights a bit above 1MB and each shard is meant
to keep a separate instance. All operations are O(1), including the
periodic timer.
This commit modifies the read RPC and the storage_proxy logic so that
the coordinator knows whether a read operation failed due to rate limit
being exceeded, and returns `exceptions::rate_limit_exception` if that
happens.
We would like to extend the read RPC to return an optional, second value
which indicates an exception - seastar type-erases exception on the RPC
handler boundary and we need to differentiate rate_limit_exception from
others. However, it may happen that a replica with an up-to-date version
of Scylla tries to return an exception in this way to a coordinator with
an old version and the coordinator will drop the error, thinking that
the request succeeded.
In order to protect from that, we introduce the
`TYPED_ERROR_IN_READ_RPC` feature. Only after it is enabled replicas
will start returning exceptions in the new way, and until then all
exceptions will be reported using seastar's type-erasure mechanism.
This commit modifies the storage_proxy logic so that the coordinator
knows whether a write operation failed due to rate limit being exceeded,
and returns `exceptions::rate_limit_exception` when that happens.
Introduces `replica::rate_limit_exception` - an exceptions that is
supposed to be thrown/returned on the replica side when the request is
rejected due to the exceeding the per-partition rate limit.
Additionally, introduces the `exception_variant` type which allows to
transport the new exception over RPC while preserving the type
information. This will be useful in later commits, as the coordinator
will have to know whether a replica has failed due to rate limit being
exceeded or another kind of error.
The `exception_variant` currently can only either hold "other exception"
(std::monostate) or the aforementioned `rate_limit_exception`, but can
be extended in a backwards-compatible way in the future to be able to
hold more exceptions that need to be handled in a different way.
Adds a CQL protocol extension which introduces the rate_limit_error. The
new error code will be used to indicate that the operation failed due to
it exceeding the allowed per-partition rate limit.
The error code is supposed to be returned only if the corresponding CQL
extension is enabled by the client - if it's not enabled, then
Config_error will be returned in its stead.
The code which applied view filtering (i.e. a condition placed
on a view column, e.g. "WHERE v = 42") erroneously used a wildcard
selection, which also assumes that static columns are needed,
if the base table contains any such columns.
The filtering code currently assumes that no such columns are fetched,
so the selection is amended to only ask for regular columns
(primary key columns are sent anyway, because they are enabled
via slice options, so no need to ask for them explicitly).
Fixes#10851Closes#10855
The time point is multiplied by an adjustment factor of 1000
for boost::posix_time::time_duration::ticks_per_second() = 1000000
when calling boost::posix_time::milliseconds(count)
and that may lead to integer overflow as reported
by the UndefinedBehaviorSanitizer.
See https://github.com/scylladb/scylla/issues/10830#issuecomment-1158899187
This change checks for possible overflow in advance and
prints the raw counter value in this case, along with
an explanation.
Refs #10830
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10831
* github.com:scylladb/scylla:
test: types: add test cases for timestamp_type to_string format
types: time_point_to_string: harden against out of range timestamps
The reference is already at hand. The get_ep_stats() calls another
helper that also maps endpoint to datacenter, but it can get the
obtained dc sstring via argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The latter will need it to get dc info from. All the callers are either
storage proxy or have storage proxy pointer/reference to get topology
from.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After previous patch hints manager class gets unused dependency on
snitch. While removing it it turns out that several unrelated places
get needed headers indirectly via host_filter.hh -> snitsh_base.hh
inclusion.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's needed in source filter classes so range-streamer passes the
topology reference into its methods.
Nice side effect -- snitch header goes away from range-streamer one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Repair gets token metadata from its local database reference. Not
perfect, repair should better have its own private token meta reference,
but it's OK for now.
The change obsoletes static get_local_dc helper.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The view code already gets token metadata from global proxy instance. Do
the same to get topology object.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Proxy has shared token metadata from which it can get the topology.
This change obsoletes static get_local_dc() helper.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
For now they just forward the request to snitch. Once topology is
properly updated boot-time dc/rack info and knows internal IP
it will be able to serve request on its own.
For convenience overloads without arguments return dc/rack for
current node.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
* seastar 443e6a9b77...ff46af9ae0 (15):
> rpc: Take care of client::send() future in send_helper
> test: futures: add test_get_on_exceptional_promise
> compile_commands.json generation in configure
> condition-variable: use an empty loop for spinning CPU
> byteorder: use boost::endian to do the conversion.
> Merge "Replace RPC outgoing queue with continuation chain" from Pavel E
> test_runner: use std::endl to ensure messages are flushed
> memory: realloc: defer to malloc if ptr is null
> cmake: require boost 1.73 for building with C++20
> reactor: backend: io_uring: disable on old kernels if RAID devices exist
> Move function in invoke_on_all
> core/loop: drop unused parameters
> net/api: add connected_socket::operator bool()
> fix cpuset count is zero after shift
> docker: add pandoc package
Closes#10845
Now that `generation_type` is used properly (at least in some places),
we turn to the compiler to help keep the generation/value separation
intact.
Fixes#10796.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Following the previous patch that changed time_point_to_string
we should cement the different edge cases for the next
time this function changes.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The grammar now checks that UPDATEs don't clash (for example,
updates to the same column). The checks are good, but the grammar
isn't the right place for them - better to concentrate all the checks
in the prepare() code so it's easy to see all the checks.
Move the checks to raw::update_statement::prepare_internal(). This
exposes that the checks are quadratic, so add a comment. It could be
fixed with a stable_sort() first, but that is left to later.
Closes#10820
If the len2 argument to crc32_combine() is zero, then the crc2
argument must also be zero.
fast_crc32_combine() explicitly checks for len2==0, in which case it
ignores crc2 (which is the same as if it were zero).
zlib's crc32_combine() used to have that check prior to version
1.2.12, but then lost it, making its necessary for callers to be more
careful.
Also add the len2==0 check to the dummy fast_crc32_combine()
implementation, because it delegates to zlib's.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Closes#10731
Due to implementation details, all `deletable_row`s used in `row()` are copied twice, even though the only need to be copied/applied once.
This is unnecessary work.
`perf_simple_query_g --enable-cache=1 --flush --smp 1 --duration 30`
Before:
median 158516.17 tps ( 64.1 allocs/op, 12.1 tasks/op, 45010 insns/op)
After:
median 164307.76 tps ( 62.1 allocs/op, 12.1 tasks/op, 43220 insns/op)
Closes#10509
* github.com:scylladb/scylla:
partition_snapshot_row_cursor: construct the clustering_row directly in row()
mutation_fragment: add a "from deletable_row" constructor to clustering_row
mutation_fragment: pass the applied row by reference in clustering_row::apply()
"
In order to wire-in the compaction_throughput_mb_per_sec the compaction
creation and stopping will need to be patched. Right now both places are
quite hairy, this set coroutinizes stop() for simpler adding of stopping
bits, unifies all the compaction manager constructors and adds the
compaction_manager::config for simpler future extending.
As a side effect the backlog_controller class gets an "abstract" sched
group it controlls which in turn will facilitate seastar sched groups
unification some day.
"
* 'br-compaction-manager-start-stop-cleanup' of https://github.com/xemul/scylla:
compaction_manager: Introduce compaction_manager::config
backlog_controller: Generalize scheduling groups
database: Keep compound flushing sched group
compaction_manager: Swap groups and controller
compaction_manager: Keep compaction_sg on board
compaction_manager: Unify scheduling_group structures
compaction_manager: Merge static/dynamic constructors
compaction_manager: Coroutinuze really_do_stop()
compaction_manager: Shuffle really_do_stop()
compaction_manager: Remove try-catch around logger
The sstable set param isn't being used anywhere, and it's also buggy
as sstable run list isn't being updated accordingly. so it could happen
that set contains sstables but run list is empty, introducing
inconsistency.
we're fortunate that the bug wasn't activated as it would've been
a hard one to catch. found this while auditting the code.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220617203438.74336-1-raphaelsc@scylladb.com>
The time point is multiplied by an adjustment factor of 1000
for boost::posix_time::time_duration::ticks_per_second() = 1000000
when calling boost::posix_time::milliseconds(count).
That may lead to integer overflow as reported by the
UndefinedBehaviorSanitizer.
See https://github.com/scylladb/scylla/issues/10830#issuecomment-1158899187
This change uses gmtime_r to convert seconds since unix epoch
to std::tm and the fmt library to format the iso representation
of the time_point to avoid exceptions and undefined behavior.
gmtime_r may still detect an overflow "when the year does not fit into
an integer" (see ctime(3)). In this case we return a backward
compatible representation of "{count} milliseconds (out of range)".
Refs #10830
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
`generation_type` is (supposed to be) conceptually different from
`int64_t` (even if physically they are the same), but at present
Scylla code still largely treats them interchangeably.
In addition to using `generation_type` in more places, we
provide (no-op) `generation_value()` and `generation_from_value()`
operations to make the smoke-and-mirrors more believable.
The churn is considerable, but all mechanical. To avoid even
more (way, way more) churn, unit test code is left untreated for
now, except where it uses the affected core APIs directly.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Adds measuring the apparent delta vector of footprint added/removed within
the timer time slice, and potentially include this (if influx is greater
than data removed) in threshold calculation. The idea is to anticipate
crossing usage threshold within a time slice, so request a flush slightly
earlier, hoping this will give all involved more time to do their disk
work.
Obviously, this is very akin to just adjusting the threshold downwards,
but the slight difference is that we take actual transaction rate vs.
segment free rate into account, not just static footprint.
Note: this is a very simplistic version of this anticipation scheme,
we just use the "raw" delta for the timer slice.
A more sophisiticated approach would perhaps do either a lowpass
filtered rate (adjust over longer time), or a regression or whatnot.
But again, the default period of 10s is something of an eternity,
so maybe that is superfluous...
Adds "bytes_released" and "bytes_flush_requested", representing
total bytes released from disk as a result of segment release
(as allocation bytes + overhead - not counting unused "waste"),
resp. total size we've requested flush callbacks to release data,
also counted as actual used bytes in segments we request be made
released.
These counters, together with bytes_written, should in ideal use
cases be at an equilibrium (actually equal), thus observing them
should give an idea on whether we are imbalanced in managing to
release bytes in same rate as they are allocated (i.e. transaction
rate).
Apparent mismerge or something. We already have an unused "_flush_position",
intended to keep track of the last requested high rp.
Now actually update and use it. The latter to avoid sending requests for
segments/cf id:s we've already requested external flush of. Also enables
us to ensure we don't do double bookkeep here.
The current maintainer.md lacks any guidelines on what patches to accept/reject. Instead maintainers are expected to observe the unwritten rules as exercised by more senior maintainers, as well as use their own judgement or ask when in doubt. This has worked well as maintainers are all people who either worked at the company for a long time and hence had time to observe how things work, and/or have previous experience maintaining open-source projects. Nevertheless, many times I have wished we had a guideline I could glance at to make sure I considered all the angles and to make sure I did not forget some important unwritten rule.
This series attempts to concisely summarize these unwritten rules in the form of a checklist, without attempting to cover all exceptions and corner-cases. This should already be enough for a maintainer-in-doubt to be able to quickly go over the checklist and see if they forgot to check anything (especially when evaluating backports).
/cc @scylladb/scylla-maint
Closes#10806
* github.com:scylladb/scylla:
docs/contribute/maintainer.md: add merging and backporting guidelines
docs/contribute/CONTRIBUTING.md: add reference to review checklist:
docs/contribute/review-checklist.md: add section about patch organization
docs/contribute/maintainer.md: expand section on git submodule sync
Currently row() creates an empty clustering_row, then applies deletable_rows
from the cursor to the empty clustering_row.
But the apply logic is unnecessary for the first apply(), and it's cheaper
to simply copy the row.
Currently, construction of clustering_row from deletable_row is done by
applying the deletable_row to an empty clustering_row.
Direct construction is a slightly cheaper alternative.
Currently, clustering_row::apply() takes deletable_row by reference, but
copies it before passing it to deletable_row::apply(). This is more expensive
than passing the reference down (by about 1800 instructions for
perf_simple_query rows).
If there are zero leaving nodes, no need to calculate anything. This
saves time for calculating pending ranges in large clusters
significantly to avoid unnecessary calculation.
Refs #10337Closes#10822
from Nadav Har'El
This small series improves Alternator's BatchGetItem performance by
grouping requests to the same partition together (Fixes#10753) and also
improves error checking when the same item is requested more than once
(Fixes#10757).
Closes#10834
* github.com:scylladb/scylla:
alternator: make BatchGetItem group reads by partition
test/alternator: additional test for BatchGetItem
Today, if you want to reproduce a rare condition using the same RNG seed
reported, you cannot use test.py which provides useful infrastructure
and will have to run the tests manually instead.
So let's extend test.py to allow optional forwarding of RNG seed to
boost tests only, as other suites don't support the seed option.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220615223657.142110-1-raphaelsc@scylladb.com>
DynamoDB API's BatchGetItem invokes a number (up to 25) of read requests
in parallel, returning when all results are available. Alternator naively
implemented this by sending all read requests in parallel, no matter which
requests these were.
That implementation was inefficient when all the requests are to different
items (clustering rows) of the same partition. In a multi-node setup this
will end up sending 25 separate requests to the same remote node(s). Even
on a single-node setup, this may result in reading from disk more than
once, and even if the partition is cached - doing an O(logN) search in
each multiple times.
What we do in this patch, instead, is to group all the BatchGetItem
requests that aimed at the same partition into a single read request
asking for a (sorted) list of clustering keys. This is similar to an
"IN" request in CQL.
As an example of the performance benefit of this patch, I tried a
BatchGetItem request asking for 20 random items from a 10-million item
partition. I measured the latency of this request on a single-node
Scylla. Before this patch, I saw a latency of 17-21 ms (the lower number
is when the request is retried and the requested items are already in
the cache). After this patch, the latency is 10-14 ms. The performance
improvement on multi-node clusters are expected to be even higher.
Unfortunately the patch is less trivial than I hoped it would be,
because some of the old code was organized under the assumption that
each read request only returned one item (and if it failed, it means
only one item failed), so this part of the code had to be reorganized
(and, for making the code more readable, coroutinized).
An unintended benefit of the code reorganization is that it also gave
me an opportunity to fail an attempt to ask BatchGetItem the same
item more than once (issue #10757).
The patch also adds a few more corner cases in the tests, to be even
more sure that the code reorganization doesn't introduce a regression
in BatchGetItem.
Fixes#10753Fixes#10757
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Before it was possible for a race condition to happen where the failure_detector_loop is started before the gossiper._enabled is set to true on every shard.
This change ensure that _enabled is set to true before moving forward
Closes#10548
Commit e739f2b779 ("cql3: expr: make evaluate() return a
cql3::raw_value rather than an expr::constant") introduced
raw_value::view() as a synonym to raw_value::to_view() to reduce
churn. To fix this duplication, we now remove raw_value::to_view().
raw_value::to_view() was picked for removal because is has fewer
call sites, reducing churn again.
Closes#10819
A named bind-variable can be reused:
SELECT * FROM tab
WHERE a = :var AND b = :var
Currently, the grammar just ignores the possibility and creates
a new variable with the same name. The new variable cannot be
referenced by name since the first one shadows it.
Catch variable reuse by maintaining a map from bind variable names
to indexed, and check that when reusing a bind variable the types
match.
A unit test is added.
Fixes#10810Closes#10813
When evaluating an LWT condition involving both static and non-static
cells, and matching no regular row, the static row must be used UNLESS
the IF condition is IF EXISTS/IF NOT EXISTS, in which case special rules
apply.
Before this fix, Scylla used to assume a row doesn't exist if there is
no matching primary key. In Cassandra, if there is a
non-empty static row in the partition, a regular row based
on the static row' cell values is created in this case, and then this
row is used to evaluate the condition.
This problem was reported as gh-10081.
The reason for Scylla behaviour before the patch was that when
implementing LWT I tried to converge Cassandra data model (or lack of
thereof) with a relational data model, and assumed a static row is a
"shared" portion of a regular row, i.e. a storage level concept intended
to save space, and doesn't have independent existence.
This was an oversimplification.
This patch fixes gh-10081, making Scylla semantics match the one of
Cassandra.
I will now list other known examples when a static row has an own
independent existence as part of a table, for cataloguing purposes.
SELECT * from a partition which has a partition key
and a static cell set returns 1 row. If later a regular row is added
to the partition, the SELECT would still return 1 row, i.e.
the static row will disappear, and a regular row will appear instead.
Another example showing a static row has an independent existence below:
CREATE TABLE t (p int, c int, s int static, PRIMARY KEY(p, c));
INSERT INTO t (p, c) VALUES(1, 1);
INSERT INTO t (p, s) VALUES(1, 1) IF NOT EXISTS;
In Cassandra (and Scylla), IF NOT EXISTS evaluates to TRUE, even though both
the regular row and the partition exist. But the static cells are not
set, and the insert only provides a partition key, so the database assumes the
insert is operating against a static row.
It would be wrong to assume that a static row exists when the partition
key exists:
INSERT INTO t (p, c, s) VALUES(1, 1, 1) IF NOT EXISTS;
[applied] | p | c | s
-----------+---+---+------
False | 1 | 1 | null
evaluates to False, i.e. the regular row does exist when p and c exist.
Issue
CREATE TABLE t (p INT, c INT, r INT, s INT static, PRIMARY KEY(p, c))
INSERT INTO t (p, s) VALUES (1, 1);
UPDATE t SET s=2, r=1 WHERE p=1 AND c=1 IF s=1 and r=null;
- in this case, even though the regular row doesn't exist, the static
row does, and should be used for condition evaluation.
In other words, IF EXISTS/IF NOT EXISTS have contextual semantics.
They apply to the regular row if clustering key is used in the WHERE
clause, otherwise they apply to static row.
One analogy for static rows is that it is like a static member of C++ or
Java class. It's an attribute of the class (assuming class = partition),
which is accessible through every object of the class (object = regular
row). It is also present if there are no objects of the class, but the
class itself exists: i.e. a partition could have no regular rows, but
some static cells set, in this case it has a static row.
*Unlike C++/Java static class members* a static row is an optional
attribute of the partition. A partition may exist, but the static row
may be absent (e.g. no static cell is set). If the static row does exist,
all regular rows share its contents, *even if they do not exist*.
A regular row exists when its clustering key is present
in the table. A static row exists when at least one static cell is set.
Tests are updated because now when no matching row is found
for the update we show the value of the static row as the previous
value, instead of a non-matching clustering row.
Changes in v2:
- reworded the commit message
- added select tests
Closes#10711
Our simple test for BatchGetItem on a table with sort keys still has
requests with just one sort key per partition, so if BatchGetItem has
a bug with requesting multiple sort keys from the same partition,
such bug won't be caught by the simple tests. So in this test we add a
test that does. This will be useful for the next patch, we are planning
to refactor BatchGetItem's handling of multiple sort keys in the same
partition - so it will be useful to have more regression tests.
The tests test_batch_get_item_large and test_batch_get_item_partial
would actually also catch such bugs, but they are more elaborate tests
and it's nice to have smaller tests more focused on checking specific
features.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This is to make it constructible in a way most other services are -- all
the "scalar" parameters are passed via a config.
With this it will be much shorter to add compaction bandwidth throttling
option by just extending the config itself, not the list of constructor
arguments (and all its callers).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make struct scheduling_group be sub-class of the backlog controller. Its
new meaning is now -- the group under controller maintenance. Both
database and compaction manager derive their sched groups from this one.
This makes backlog controller construction simpler, prepares the ground
for sched groups unification in seastar and facilitates next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similar to previous patch that made the same for compaction manager. The
newly introduced private scheduling_group class is temporary and will go
away in next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is mainly to make next patch simpler. Also this makes the backlog
controller API smaller by removing its sg() method.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The only difference between those two are in the way backlog controller
is created. It's much simpler to have the controller construction logic
in compaction manager instead. Similar "trick" is used to construct
flush controller for the database.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This way it's more compact and easier to extend.
Also it's small enough to fix indentation right at once.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it the future-returning method and setup the _stop_future in its
only caller. Makes next patch much simpler
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch set
- adds log before and after batch log replay
- removes a duplicated call to trigger batch log replay
- removes obsoletes log
Closes#10800
* github.com:scylladb/scylla:
storage_service: Remove obsolete log
storage_service: Do not call do_batch_log_replay again in unbootstrap
storage_service: Add log for start and stop of batchlog replay
A pre-scrub view snapshot cannot be attributed to user error, so no
call to bail out.
Closes#10760.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Closes#10783
* github.com:scylladb/scylla:
api-doc: correct spelling
allow pre-scrub snapshots of materialized views and secondary indices
When memtable receives a tombstone it can happen under some workloads
that it covers data which is still in the memtable. Some workloads may
insert and delete data within a short time frame. We could reduce the
rate of memtable flushes if we eagerly drop tombstoned data.
One workload which benefits is the raft log. It stores a row for each
uncommitted raft entry. When entries are committed they are
deleted. So the live set is expected to be short under normal
conditions.
Fixes#652.
Closes#10807
* github.com:scylladb/scylla:
memtable: Add counters for tombstone compaction
memtable, cache: Eagerly compact data with tombstones
memtable: Subtract from flushed memory when cleaning
mvcc: Introduce apply_resume to hold state for partition version merging
test: mutation: Compare against compacted mutations
compacting_reader: Drop irrelevant tombstones
mutation_partition: Extract deletable_row::compact_and_expire()
mvcc: Apply mutations in memtable with preemption enabled
test: memtable: Make failed_flush_prevents_writes() immune to background merging
This patch adds an extensive array of tests for the Cassandra feature
that Scylla hasn't implemented yet (issues #2962, #8745, #10707) of
indexing the keys, values or entries of a collection column.
The goal of these tests is to explicitly exercise every corner case
I could think of by looking at the documentation of this feature and
considering its possible implementation - and as usual, making sure
that the tests actually pass on Cassandra.
These tests overlap some of the existing unit tests that we translated
from Cassandra, as well as some randomized tests that do not necessarily
cover the same edge cases as these tests cover.
All tests added in this patch pass on Cassandra, but currently fail
on Scylla due to the above issues.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10771
The most time-consuming part is invoking "ninja -t compdb", and there
is no need to repeat that for every mode.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Closes#10733
`sstable_directory::process_sstables_dir` may hit an exception when calling `handle_component`.
In this case we currently destroy the `sstable_dir_lister` variable without closing the `directory_lister` first -
leading to terminate in `~directory_lister` as seen in #10697.
This mini-series handles this exception and always closes the `directory_lister`.
Add unit test to reproduce this issue.
Fixes#10697Closes#10754
* github.com:scylladb/scylla:
sstable_directory: process_sstable_dir: fixup indentation
sstable_directory: process_sstable_dir: close directory_lister on error
Before the change, the test artificiallu set the soft pressure
condition hoping that the background flusher will flush the
memtable. It won't happen if by the time the background flusher runs
the LSA region is updated and soft pressure (which is not really
there) is lifted. Once apply() becomes preemptibe, backgroun partition
version merging can lift the soft pressure, making the memtable flush
not occur and making the test fail.
Fix by triggering soft pressure on retries.
Fixes#10801
Refs #10793
(cherry picked from commit 0e78ad50ea)
Closes#10802
If we reach a situation where flush rate exceeds compaction rate, we may
end up with arbitrarily large number of sstables on disk. If a read is
executed in such case, the amount of memory required is proportional to
the number of sstables for the given shard, which in extreme cases can
lead to OOM.
In the wild, this was observed in 2 scenarios:
- A node with >10 shards creates a keyspace with thousands of tables,
drops the keyspace and shuts down before compaction finishes. Dropping
keyspace drops tables, and each dropped table is smp::count writes to
system.local table with flush after write, which creates tens of
thousands of sstables. Bootstrap read from system.local will run OOM.
- A failure to agree on table schema (due to a code bug) between nodes
during repair resulted in excessive flushing of small sstables which
compaction couldn't keep up with.
In the unit test introduced in this patch series it can be proved that
even hard setting maximum shares for compaction and minimum shares for
flushing doesn't tilt the balance towards compaction enough to prevent
the problem. Since it's a fast producer, slow consumer problem, the
remaining solution is to block producer until the consumer catches up.
If there are too many table runs originating from memtable, we block the
current flush until the number of sstables is reduced (via ongoing
compaction or a truncate operation).
Fixes https://github.com/scylladb/scylla/issues/4116
Changelog:
v5:
- added a nicer way of timing the stalls caused by waiting for flush
- added predicate on signal when waiting for reduction of the number of sstables to correctly handle spurious wake ups
- added comment why we trigger compaction before waiting for sstable count reduction
- removed unnecessary cv.signal from table::stop
v4:
- removed conversion of table::stop to coroutines. It's an orthogonal change and doesn't need to go into this patchset
v3:
- removed unnecessary change to scheduling groups from v2
- moved sstables_changed signalling to suggested place in table::stop
- added log how long the table flush was blocked for
- changed the threshold to max(schema()->max_compaction_threshold(), 32) and comparison to <=
v2:
- Reimplemented waiting algorithm based on reviewers' feedback. It's confined to the table class and it waits in a loop until the number of sstable runs goes below threshold. It uses condition variable which is signaled on sstable set refresh. It handles node shutdown as well.
- Converted table::stop to coroutines.
- Reordered commits so that test is committed after fix, so it doesn't trip up bisection.
Closes#10717
* github.com:scylladb/scylla:
table: Add test where compaction doesn't keep up with flush rate.
random_mutation_generator: Add option to specify ks_name and cf_name
table: Prevent creating unbounded number of sstables
Otherwise, if we don't consume all lister's entries,
~directory_lister terminates since the
directory_lister is destroyed without being closed.
Add unit test to reproduce this issue.
Fixes#10697
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When memtable receives a tombstone it can happen under some workloads
that it covers data which is still in the memtable. Some workloads may
insert and delete data within a short time frame. We could reduce the
rate of memtable flushes if we eagerly drpo tombstoned data.
One workload which benefits is the raft log. It stores a row for each
uncommitted raft entry. When entries are committed they are
deleted. So the live set is expected to be short under normal
conditions.
Fixes#652.
This patch prevents virtual dirty from going negative during memtable
flush in case partition version merging erases data previously
accounted by the flush reader. There is an assert in
~flush_memory_accounter which guards for this.
This will start happening after tombstones are compacted with rows on
partition version merging.
This problem is prevented by the patch by having the cleaner notify
the memtable layer via callback about the amount of dirty memory released
during merging, so that the memtable layer can adjust its accounting.
Partition version merging is preemptable. It may stop in the middle
and be resumed later. Currently, all state is kept inside the versions
themselves, in the form of elements in the source version which are
yet to be moved. This will change once we add compaction (tombstones
with rows) into the merging algorithm. There, state cannot be encoded
purley within versions. Consider applying a partition tombstone over
large number of rows.
This patch introduces apply_rows object to hold the necessary state to
make sure forward progress in case of preemption.
No change in behavior yet.
Memtables and cache will compact eagerly, so tests should not expect
readers to produce exact mutations written, only those which are
equivalant after applying copmaction.
The compacting reader created using make_compacting_reader() was not
dropping range_tombstone_change fragments which were shadowed by the
partition tombstones. As a result the output fragment stream was not
minimal.
Lack of this change would cause problems in unit tests later in the
series after the change which makes memtables lazily compact partition
versions. In test_reverse_reader_reads_in_native_reverse_order we
compare output of two readers, and assume that compacted streams are
the same. If compacting reader doesn't produce minimal output, then
the streams could differ if one of them went through the compaction in
the memtable (which is minimal).
Preerequisite for eagerly applying tombstones, which we want to be
preemptible. Before the patch, apply path to the memtable was not
preemptible.
Because merging can now be defered, we need to involve snapshots to
kick-off background merging in case of preemption. This requires us to
propagate region and cleaner objects, in order to create a snapshot.
Before the change, the test artificiallu set the soft pressure
condition hoping that the background flusher will flush the
memtable. It won't happen if by the time the background flusher runs
the LSA region is updated and soft pressure (which is not really
there) is lifted. Once apply() becomes preemptibe, backgroun partition
version merging can lift the soft pressure, making the memtable flush
not occur and making the test fail.
Fix by triggering soft pressure on retries.
The test simulates a situation where 2 threads issue flushes to 2
tables. Both issue small flushes, but one has injected reactor stalls.
This can lead to a situation where lots of small sstables accumulate on
disk, and, if compaction never has a chance to keep up, resources can be
exhausted.
If we reach a situation where flush rate exceeds compaction rate, we may
end up with arbitrarily large number of sstables on disk. If a read is
executed in such case, the amount of memory required is proportional to
the number of sstables for the given shard, which in extreme cases can
lead to OOM.
In the wild, this was observed in 2 scenarios:
- A node with >10 shards creates a keyspace with thousands of tables,
drops the keyspace and shuts down before compaction finishes. Dropping
keyspace drops tables, and each dropped table is smp::count writes to
system.local table with flush after write, which creates tens of
thousands of sstables. Bootstrap read from system.local will run OOM.
- A failure to agree on table schema (due to a code bug) between nodes
during repair resulted in excessive flushing of small sstables which
compaction couldn't keep up with.
In the unit test introduced in this patch series it can be proved that
even hard setting maximum shares for compaction and minimum shares for
flushing doesn't tilt the balance towards compaction enough to prevent
the problem. Since it's a fast producer, slow consumer problem, the
remaining solution is to block producer until the consumer catches up.
If there are too many table runs originating from memtable, we block the
current flush until the number of sstables is reduced (via ongoing
compaction or a truncate operation).
The series fixes a couple of crashes that were found during starting and
stopping Scylla with raft while doing ddl operations. Most of them
related to shutdown order between different components.
Also in scylla-dev gleb/group0-fixes-v1
CI https://jenkins.scylladb.com/job/releng/job/Scylla-CI/749/
* origin-dev/gleb/group0-fixes-v1:
migration manager: remove unused code
db/system_distributed_keyspace: do not announce empty schema
main: stop raft before the migration manager
storage_service: do not pass the raft group manager to storage_service constructor
main: destroy the group0_client after stopping the group0
Previously, any attempt to take a materialized view or secondary index
snapshot was considered a mistake and caused the snapshot operation to
abort, with a suggestion to snapshot the base table instead.
But an automatic pre-scrub snapshot of a view cannot be attributed to
user error, so the operation should not be aborted in that case.
(It is an open question whether the more correct thing to do during
pre-scrub snapshot would be to silently ignore views. Or perhaps they
should be ignored in all cases except when the user explicitly asks to
snapshot them, by name)
Closes#10760.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
An expr::constant is an expression that happens to represent a constant,
so it's too heavyweight to be used for evaluation. Right now the extra
weight is just a type (which causes extra work by having to maintain
the shared_ptr reference count), but it will grow in the future to include
source location (for error reporting) and maybe other things.
Prior to e9b6171b5 ("Merge 'cql3: expr: unify left-hand-side and
right-hand-side of binary_operator prepares' from Avi Kivity"), we had
to use expr::constant since there was not enough type infomation in
expressions. But now every expression carries its type (in programming
language terms, expressions are now statically typed), so carrying types
in values is not needed.
So change evaluate() to return cql3::raw_value. The majority of the
patch just changes that. The rest deals with some fallout:
- cql3::raw_value gains a view() helper to convert to a raw_value_view,
and is_null_or_unset() to match with expr::constant and reduce further
churn.
- some helpers that worked on expr::constant and now receive a
raw_value now need the type passed via an additional argument. The
type is computed from the expression by the caller.
- many type checks during expression evaluation were dropped. This is
a consequence of static typing - we must trust the expression prepare
phase to perform full type checking since values no longer carry type
information.
Closes#10797
This reverts commit e0670f0bb5, reversing
changes made to 605ee74c39. It causes failures
in debug mode in
database_test.test_database_with_data_in_sstables_is_a_mutation_source_plain,
though with low probability.
Fixes#10780Reopens#652.
active_memtable().empty() becomes true once seal_active_memtable
succeeds with _memtables->add_memtable(), not when it is able
to flush the (once active) memtable.
In contrast, min_memtable_timestamp() returns api::max_timestamp
only if there is no data in any memtable.
Fixes#10793
Backport notes:
- Introduced in f6d9d6175f (currently in
branch-5.0)
- backport requires also 0e78ad50ea
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10798
and LSIs' from Nadav Har'El
This series includes three small fixes (and of course, tests) for
various edge cases of GSI and LSI handling in Alternator:
1. We add the IndexArn that were missing in DescribeTable for indexes
(GSI and LSI)
2. We forbid the same name to be used for both GSI and LSI (allowing it
was a bug, not a feature)
3. We improve the error handling when trying to tag a GSI or LSI, which
is not currently allowed (it's also not allowed in DynamoDB).
Closes#10791
* github.com:scylladb/scylla:
alternator: improve error handling when trying to tag a GSI or LSI
alternator: forbid duplicate index (LSI and GSI) names
alternator: add ARN for indexes (LSI and GSI)
The left-hand-side of a binary_operator is currently evaluated via
a get_value() function that receives the row values. On the other hand,
the right hand side is evaluated via evaluate(), which receives query_options
in order to resolve bind variables.
This series unifies the two paths into evaluate(), and standardizes the different
inputs into a new evaluation_inputs struct. The old hacks column_value_eval_bag
and column_maybe_subscripted are removed.
Closes#10782
* github.com:scylladb/scylla:
cql3: expr: drop column_maybe_subscripted
cql3: expr: possible_lhs_values(): open-code get_value_comparator()
cql3: expr: rationalize lhs/rhs argument order
cql3: expr: don't rely on grammar when comparing tuples
cql3: expr: wire column_value and subscript to evaluate()
cql3: get_value(subscript): remove gratuitous pointer
cql3: expr: reindent get_value(subscript)
cql3: expr: extract get_value(subscript) from get_value(column_maybe_subscripted)
cql3: raw_value: add missing conversion from managed_bytes_opt&&
cql3: prepare_expr: prepare subscript type
cql3: expr: drop internal 'column_value_eval_bag'
cql3: expr: change evalute() to accept evaluation_inputs
cql3: expr: make evaluate(<expression subtype>) static
cql3: expr: push is_satisfied_by regular and static column extraction to callers
cql3: expr: convert is_satisfied_by() signature to evaluation_inputs
cql3: expr: introduce evaluation_inputs
In issue #10786, we raised the idea of maybe allowing to tag (with
TagResource) GSIs and LSIs, not just base tables. However, currently,
neither DynamoDB nor Syclla allows it. So in this patch we add a
test that confirms this. And while at it, we fix Alternator to
return the same error message as DynamoDB in this case.
Refs #10786.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Adding an LSI and GSI with the same name to the same Alternator table
should be forbidden - because if both exists only one of them (the GSI)
would actually be usable. DynamoDB also forbids such duplicate name.
So in this patch we add a test for this issue, and fix it.
Since the patch involves a few more uses of the IndexName string,
we also clean up its handling a bit, to use std::string_view instead
of the old-style std::string&.
Fixes#10789
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
DynamoDB gives an ARN ("Amazon Resource Name") to LSIs and GSIs. These
look like BASEARN/index/INDEXNAME, where BASEARN is the ARN of the base
table, and INDEXNAME is the name of the LSI or the GSI.
These ARNs should be returned by DescribeTable as part of its
description of each index, and this patch adds that missing IndexArn
field.
The ARN we're adding here is hardly useful (e.g., as explained in
issue #10786, it can't be used to add tags to the index table),
but nevertheless should exist for compatibility with DynamoDB.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Older versions of the Python Cassandra driver had a bug where a single empty
page aborts a scan.
The test test_secondary_index.py::test_filter_and_limit uses filtering and deliberately
tiny pages, so it turns out that some of them are empty, so the test breaks on buggy
versions of the driver, which cause the test to fail when run by developers who happen
to have old versions of the driver.
So in this small series we skip this test when running on a buggy version of the driver.
Fixes#10763Closes#10766
* github.com:scylladb/scylla:
test/cql-pytest: skip another test on older, buggy, drivers
test/cql-pytest: de-duplicate code checking for an old buggy driver
update_history can take a long time compared to compaction, as a call
issued on shard S1 can be handled on shard S2. If the other shard is
under heavy load, we may unnecessarily block kicking off a new
compaction. Normally it isn't a problem, as compactions aren't super
frequent, but there were edge cases where the described behaviour caused
compaction to fail to keep up with excessive flushing, leading to too
many sstables on disk and OOM during a read.
There is no need to wait with next compaction until history is updated,
so release the weight earlier to remove unnecessary serialization.
Changelog:
v3:
- explicitly call deregister instead of moving the weight RAII object to release weight
- mark compaction as finished when sstables are compacted, without waiting for history to update
v2:
- Split the patches differently for easier review
- Rebased agains newer master, which contains fixes that failed the debug version of the test
- Removed the test, as it will be provided by [PR#10717](https://github.com/scylladb/scylla/pull/10717)
Closes#10507
* github.com:scylladb/scylla:
compaction: Release compaction weight before updating history.
compaction: Inline compact_sstables_and_update_history call.
compaction: Extract compact_sstables function
compaction: Rename compact_sstables to compact_sstables_and_update_history
compaction: Extract update_history function
compaction: Extract should_update_history function.
compaction: Fetch start_size from compaction_result
compaction: Add tracking start_size in compaction_result.
On 48b6aec16a we mistakenly allowed
check=True on systemd_unit.is_active(), it should be check=False.
We check unit's status by "systemctl is-active" output string,
it returns "active" or "inactive".
But systemctl command returns non-zero status when it returning
"inactive", so we are getting Exception here.
To fix this, we need new option "ignore_error=True" for out(),
and use it in systemd_unit.is_active().
Fixes#10455Closes#10467
After 93b765f655, our pull_github_pr.sh script tries to detect
a non-orthodox remote repo name, but it also adds an assumption
which breaks on some configurations (by some I mean mine).
Namely, the script tries to parse the repo name from the upstream
branch, assuming that current HEAD actually points to a branch,
which is not the way some users (by some I mean me) work with
remote repositories. Therefore, to make the script also work
with detached HEAD, it now has two fallback mechanisms:
1. If parsing @{upstream} failed, the script tries to parse
master@{upstream}, under the assumption that the master branch
was at least once used to track the remote repo.
2. If that fails, `origin/master` is used as last resort solution.
This patch allows some users (guess who) to get back to using
scripts/pull_github_pr.sh again without using a custom patched version.
Closes#10773
Clang up to version 13 supports the coroutines technical specification
(in std::experimental). 15 and above support standard coroutines (in
namespace std). Clang 14 supports both, but with a warning for the
technical specification coroutines.
To avoid the warning, change the threshold for selecting standard
coroutines from clang 15 to clang 14. This follow seastar commit
070ab101e2.
Closes#10647
column_maybe_subscripted is a variant<column_value*, subscript*> that
existed for two reasons:
1. evaluation of subscripts and of columns took different paths.
2. calculation of the type of column or column[sub] took different paths.
Now that all evaluations go through evaluate(), and the types are
present in the expression itself, there is no need for column_maybe_subscripted
and it is replaced with plain expressions.
Some functions accept the right-hand-side as the first argument
and the left-hand-side as the second argument. This is now confusing,
but at least safe-ish, as the arguments have different types. It's
going to become dangerous when we switch to expressions for both sides,
so let's rationalize it by always starting with lhs.
Some parameters were annotated with _lhs/_rhs when it was not clear.
The grammar only allows comparing tuples of clustering columns, which
are non-null, but let's not rely on that deep in expression evaluation
as it can be relaxed.
is_satisfied_by() used an internal column_value_eval_bag type that
was more awkwardly named (and more awkward to use due to more nesting)
than evaluation_inputs. Drop it and use evaluation_inputs throughout.
The thunk is_satisified_by(evaluation_inputs) that just called
is_satisified_by(column_value_eval_bag) is dropped.
Currently, evaluate() accepts only query_options, which makes
it not useful to evaluate columns. As a result some callers
(column_condition) have to call it directly on the right-hand-side
of binary expressions instead of evaluating the binary expression
itself.
Change it to accept evaluation_input as a parameter, but keep
the old signature too, since it is called from many places that
don't have rows.
is_satisfied_by() rearranges the static and regular columns from
query::result_row_view form (which is a use-once iterator) to
std::vector<managed_bytes_opt> (which uses the standard value
representation, and allows random access which expression
evaluation needs). Doing it in is_saitisfied_by() means that it is
done every time an expression is evaluated, which is wasteful. It's
also done even if the expression doesn't need it at all.
Push it out to callers, which already eliminates some calls.
We still pass cql3::expr::selection, which is a layering violation,
but that is left to another time.
Note that in view.cc's check_if_matches(), we should have been
able to move static_and_regular_columns calculation outside the
loop. However, we get crashes if we do. This is likely due to a
preexisting bug (which the zero iterations loop avoids). However,
in selection.cc, we are able to avoid the computation when the code
claims it is only handling partition keys or clustering keys.
Callers are converted, but the internals are kept using the old
conventions until more APIs are converted.
Although the new API allows passing no query_options, the view code
keeps passing dummy query_options and improvement is left as a FIXME.
An expression may refer to values provided externally: the partition
and clusterinng keys, the static and regular row (all providing
column values), and the query options (providing values for bind
variables). Currently, different evaluation functions
(evaluate(), get_value(), and is_satisfied_by()) receive different
subsets of these values.
As a first step towards unifying the various ways to evaluate an
expression, collect the parameters in a single structure. Since
different evaluation contexts have different subsets, make everything
optional (via a pointer). Note that callers are expected to verify
using the grammar or prepare phase that they don't refer to values
that are not provided.
The cql3::selection::selection parameter is provided to translate
from query::result_row_view to schema column indexes. This is pretty
bad since it means the translation needs to be done for every
evaluation and is therefore a candidate for removal, but is kept here
since that's how it's currently done.
Marking a test as flaky allows to keep running it in CI rather than disable it when it's discovered that a test is flaky.
Flaky tests, if they fail, show up as flaky in the output, but don't fail the CI.
```
kostja@hulk:~/work/scylla/scylla$ ./test.py cdc_with --repeat=30 --verbose
Found 30 tests.
================================================================================
[N/TOTAL] SUITE MODE RESULT TEST
------------------------------------------------------------------------------
[1/30] cql debug [ FLKY ] cdc_with_lwt_test.2 9.36s
[2/30] cql debug [ FLKY ] cdc_with_lwt_test.1 9.53s
[3/30] cql debug [ PASS ] cdc_with_lwt_test.7 9.37s
[4/30] cql debug [ PASS ] cdc_with_lwt_test.8 9.41s
[5/30] cql debug [ PASS ] cdc_with_lwt_test.10 9.76s
[6/30] cql debug [ FLKY ] cdc_with_lwt_test.9 9.71s
```
Closes#10721
* github.com:scylladb/scylla:
test.py: add support for flaky tests
test.py: make Test hierarchy resettable
test.py: proper suite name in the log
test.py: shutdown cassandra-python connection before exit
The idea is that a flaky test can be marked as flaky
rather than disabled to make sure it passes in CI.
This reduces chances of a regression being added
while the flakiness is being resolved and the number
of disabled tests doesn't grow.
Introduce reset() hierarchy, which is similar to __init__(),
i.e. allows to reset test execution state before retrying it.
Useful for retrying flaky tests.
Use a nice suite name rather than an internal Python
object key in the log. Fixes a regression introduced
when addressing a style-related review remark.
Shutdown cassandra-python connections before exit, to avoid
warnings/exceptions at shutdown.
Cassandra-python runs a thread pool and if
connections are not shut down before exit, there could
be a warning that the thread pool is not destroyed
before exiting main.
CDC tables use a custom partitioner, which is not reflected in schema dumps (`CREATE TABLE ...`) and currently it is not possible to fix this properly, as we have no syntax to set the partitioner for a table. To work around this, the schema loader determines whether a table is a cdc table based on its name (does it end with `_scylla_cdc_table`) and sets the partitioner manually if it is the case.
Fixes: https://github.com/scylladb/scylla/issues/9840Closes#10774
* github.com:scylladb/scylla:
tools/schema_loader: add support for CDC tables
cdc/log.hh: expose is_log_name()
Change port type passed to Cassandra Python driver to int to avoid format errors in exceptions.
Manually shutdown connections to avoid reconnects after tests are done (required by upcoming async pytests).
Tests: (dev)
Closes#10722
* github.com:scylladb/scylla:
test.py: shutdown connection manually
test.py: fix port type passed to Cassandra driver
CDC tables use a custom partitioner, which is not reflected in schema
dumps (`CREATE TABLE ...`) and currently it is not possible to fix this
properly, as we have no syntax to set the partitioner for a table.
To work around this, the schema loader determines whether a table is a
cdc table based on its name (does it end with `_scylla_cdc_table`) and
sets the partitioner manually if it is the case.
Allow outside code to use it to determine whether a table is cdc or not.
This is currently the most reliable method if the custom partitioner is
not set on the schema of the investigated table.
"
This series is a consequence of the work started by:
"compaction: LCS: Fix inefficiency when pushing SSTables to higher levels"
9de7abdc80
"Redefine Compaction Backlog to tame compaction aggressiveness" d8833de3bb
The backlog definition for leveled is incorrectly built on the assumption that
the world must reach the state of zero amplification, i.e. everything in the
last level. The actual goal is space amplification of 1.1.
In reality, LCS just wants that for every level L, level L is fan_out=10 times
larger than L-1. See more in commit 9de7abdc80 which adjusts LCS to conform
to this goal.
If level 3 = 1000G, level 2 = 100G, level 1 = 10G, level 0 = 1G, that should
return zero backlog as space amplification is (1000+100+10+1)/1000 = ~1.1
But today, LCS calculates high backlog for the layout above, as it will only be
satisfied once everything is promoted to the maximum level. That's completely
disconnected from what the strategy actually wants. Therefore, a mismatch.
With today's definition, the backlog for any SSTable is:
sizeof(sstable) * (Lmax - levelof(sstable)) * fan_out
where Lmax = maximum level,
and fan_out = LCS' fan out which is 10 by default
That's essentially calculating the total cost for data in the SSTable to climb
up to the maximum level. Of course, if a SSTable is at the maximum level,
(Lmax - levelof(sstable)) returns zero, therefore backlog for it is zero.
Take a look at this example:
If L0 sstable is 0.16G, then its backlog = 0.16G * (3 - 0) * 10 = 4.8G
0.16G = LCS' default fragment size
Maximum level (Lmax in formula) can be easily 3 as:
log10 of (30G/0.16G=~187 sstables)) = ~2.27
~2.27 means that data has exceeded level 2 capacity and so needs 3 levels.
So 3 L0 sstables could add ~15G of backlog. With 1G memory per shard (30:1 disk
memory ratio), that's normalized backlog of ~15, which translates into
additional ~500 shares. That's halfway to full compaction speed.
With more files in higher levels, we can easily get to a normalized backlog
above 30, resulting in 1k shares.
The suboptimal backlog definition causes either table using LCS or coexisting
tables to run with more shares than needed, causing compaction to steal
resources, resulting in higher latency and reduced throughput.
To solve this problem, a new formula is used which will basically calculate
the amount of work needed to achieve the layout goal. We no longer want to
promote everything to the last level, but instead we'll incrementally calculate
the backlog in each level L, which is the amount of work needed such that the
next level L + 1 is at least fan_out times bigger.
Fixes#10583.
Results
=====
image:
https://user-images.githubusercontent.com/1409139/168713675-d5987d09-7011-417c-9f91-70831c069382.png
The patched version correctly clears the backlog, meaning that once LCS is
satisfied, backlog is 0. Therefore, next compaction either from this table or
another won't run unnecessarily aggressive.
p99 read and write latency have clearly improved. throughput is also more
stable.
"
* 'LCS_backlog_revamp' of https://github.com/raphaelsc/scylla:
tests: sstable_compaction_test: Adjust controller unit test for LCS
compaction: Redefine Leveled compaction backlog
The controller unit test for LCS was only creating level 0 SSTables.
As level 0 falls back to STCS controller, it means that we weren't actually
testing LCS controller.
So let's adjust the unit test to account for LCS fan_out, which is 10
instead of 4, and also allow creation of SSTables on higher levels.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The backlog definition for leveled is incorrectly built on the assumption that
the world must reach the state of zero amplification, i.e. everything in the
last level. The actual goal is space amplification of 1.1.
In reality, LCS just wants that for every level L, level L is fan_out=10 times
larger than L-1. See more in commit 9de7abdc80 which adjusts LCS to conform
to this goal.
If level 3 = 1000G, level 2 = 100G, level 1 = 10G, level 0 = 1G, that should
return zero backlog as space amplification is (1000+100+10+1)/1000 = ~1.1
But today, LCS calculates high backlog for the layout above, as it will only be
satisfied once everything is promoted to the maximum level. That's completely
disconnected from what the strategy actually wants. Therefore, a mismatch.
With today's definition, the backlog for any SSTable is:
sizeof(sstable) * (Lmax - levelof(sstable)) * fan_out
where Lmax = maximum level,
and fan_out = LCS' fan out which is 10 by default
That's essentially calculating the total cost for data in the SSTable to climb
up to the maximum level. Of course, if a SSTable is at the maximum level,
(Lmax - levelof(sstable)) returns zero, therefore backlog for it is zero.
Take a look at this example:
If L0 sstable is 0.16G, then its backlog = 0.16G * (3 - 0) * 10 = 4.8G
0.16G = LCS' default fragment size
Maximum level (Lmax in formula) can be easily 3 as:
log10 of (30G/0.16G=~187 sstables)) = ~2.27
~2.27 means that data has exceeded level 2 capacity and so needs 3 levels.
So 3 L0 sstables could add ~15G of backlog. With 1G memory per shard (30:1 disk
memory ratio), that's normalized backlog of ~15, which translates into
additional ~500 shares. That's halfway to full compaction speed.
With more files in higher levels, we can easily get to a normalized backlog
above 30, resulting in 1k shares.
The suboptimal backlog definition causes either table using LCS or coexisting
tables to run with more shares than needed, causing compaction to steal
resources, resulting in higher latency and reduced throughput.
To solve this problem, a new formula is used which will basically calculate
the amount of work needed to achieve the layout goal. We no longer want to
promote everything to the last level, but instead we'll incrementally calculate
the backlog in each level L, which is the amount of work needed such that the
next level L + 1 is at least fan_out times bigger.
Fixes#10583.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Since commit 3dc9a81d02 (repair: Repair
table by table internally), a table is always repaired one after
another. This means a table will be repaired in a continuous manner.
Unlike before a table will be repaired again after other tables have
finished the same range.
```
for range in ranges
for table in tables
repair(range, table)
```
The wait interval can be large so we can not utilize the assumption if
there is no repair traffic, the whole table is finished.
After commit 3dc9a81d02, we can utilize
the fact that a table is repaired continuously property and trigger off
strategy automatically when no repair traffic for a table is present.
This is especially useful for decommission operation with multiple
tables. Currently, we only notify the peer node the decommission is done
and ask the peer to trigger off strategy compaction. With this
patch, the peer node will trigger automatically after a table is
finished, reducing the number of temporary sstables on disk.
Refs #10462Closes#10761
messaging_service.hh is a switchboard - it includes many things,
and many things include it. Therefore, changes in the things it
includes affect many translation units.
Reduce the dependencies by forward-declaring as much as possible.
This isn't pretty, but it reduces compile time and recompilations.
Other headers adjusted as needed so everything (including
`ninja dev-headers`) still compile.
Closes#10755
Currently, preparing the left-hand-side of a binary operator and the
right-hand-side use different code paths. The left-hand-side derives
the type of the expression from the expression itself, while the
right-hand-side imposes the type on the expression (allowing the types
of bind variables to be inferred).
This series unifies the two, by making the imposed type (the "receiver")
optional, and by allowing prepare to fail gracefully if we were not able
to infer the type. The old prepare_binop_lhs() is removed and replaced
with prepare_expression, already used for the right hand side.
There is one step remaining, and that is to replace prepare_binary_operator
with prepare_expression, but that is more involved and is left for a follow-up.
Closes#10709
* github.com:scylladb/scylla:
cql3: expr: drop prepare_binop_lhs()
cql3: expr: move implementation of prepare_binop_lhs() to try_prepare_expression()
cql3: expr: use recursive descent when preparing subscripts
cql3: expr: allow prepare of tuple_constructor with no receiver
cql3: expr: drop no longer used printable_relation parameter from prepare_binop_lhs()
cql3: expr: print only column name when failing to resolve column
cql3: expr: pass schema to prepare_expression
cql3: expr: prepare_binary_operator: drop unused argument ctx
cql3: expr: stub type inference for prepare_expression
cql3: expr: introduce type_of() to fetch the type of an expression
cql3: expr: keep type information in casts
cql3: expr: add type field to subscript, field_selection, and null expressions
cql3: expr: cast: use data_type instead of cql3_type for the prepared form
Older versions of the Python Cassandra driver had a bug, detected by
the driver_bug_1 fixture, where a single empty page aborts a scan.
The test test_secondary_index.py::test_filter_and_limit uses filtering
and deliberately tiny pages, so it turns out that some of them are
empty, so the test breaks on buggy versions of the driver, which causes
the test to fail when run by developers who happen to have old versions
of the driver.
So in this patch we use the driver_bug_1 fixture, to skip this test
when running on a buggy version of the driver.
Fixes#10763
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We have in test_filtering.py two tests which fail when running on an old
version of the Python driver which has a specific bug, so we skip those
tests if the buggy driver is installed.
But the code to check the driver version is duplicated twice, so in this
patch we move the version-checking-and-skipping code to a fixture, which
we can use twice.
The motivation is that in the next patch we will want to introduce a
third use of the same code - and a fixture is cleaner than a third
duplicate.
This patch is supposed to be code-movement only, without functional
changes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Reduce the storage_service's dependency on the raft group manager. The
group manager is needed only during bootstrap and in an rpc handler, so
pass it to those functions directly.
The group0_client uses the group0 internally and cannot be destroyed
until the group0 is stopped to guaranty no ongoing calls into it by the
group0_client.
It turns out that DynamoDB forbids requesting the same item more than
once in a GetBatchItem request. Trying to do it would obviously be a
waste, but DynamoDB outright refuses it - and Alternator currently
doesn't (refs #10757).
The test currently passes on DynamoDB and fails on Alternator, so it
is marked xfail.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10758
As seen in https://github.com/scylladb/scylla/issues/10738,
compaction::setup might stall when processing a large number of sstables.
Make it a coroutine and maybe_yield to prevent those stalls.
Closes#10750
* github.com:scylladb/scylla:
compaction: setup: reserve space for _input_sstable_generations
compaction: coroutinize setup and maybe yield
This series converts try/catch blocks in coroutines for multishard_mutation_query to use coroutine::as_future to get and handle errors, reducing exception handling costs (that are expected on timeouts).
It was previously sent to the mailing list.
This version (v2) is just a rebase of the v1 series,
with one patch dropped as it was already merged to master independentally.
Closes#10727
* github.com:scylladb/scylla:
multishard_mutation_query: do_query: couroutinize save_readers lambda
multishard_mutation_query: do_query: prevent exceptions using coroutine::as_future
multishard_mutation_query: read_page: prevent exceptions using coroutine::as_future
multishard_mutation_query: save_readers: fixup indentation
multishard_mutation_query: coroutinize save_readers
multishard_mutation_query: lookup_readers: make noexcept
multishard_mutation_query: optimize lookup_readers
We know in advance the maximum number of
sstable generations to track, so reserve space for it
to prevent vector reallocation for large number of sstables.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Optimize error handling by preventing
exception try/catch using coroutine::as_future
to get query::consume_page's result.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And use smp::invoke_on_all rather than a home-brewed
version of parallel_for_each over all shard ids.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Sot it can be co_awaited efficiently using coroutine::as_future,
othwise, any exceptions will escape `as_future`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
No need to call _db.invoke_on inside a parallel_for_each
loop over all shards. Just use _db.invoke_on_all instead.
Besides that, there's no need for a .then continuation
for assigning the per-shard reader in _readers[shard].
It can be done by the functor running on each db shard.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Found with scylla --blocked-reactor-notify-ms 1 during replace operation with rbno turned on.
The stalls showed without this patch were gone after this path set.
Closes#10737
* github.com:scylladb/scylla:
repair: Avoid stall in working_row_hashes
repair: Avoid stall in apply_rows_on_master_in_thread
* seastar 2be9677d6e...1424d34c93 (22):
> Use tls socket to retrieve distinguished name
> perftune.py: remove duplicates in 'append' parameters when we dump an options file
> rpc: add an option for an asynchronous connection isolation function
> Merge "Add more facilities to RPC tester" from Pavel E
> json: wait for writing final characters of a json document
> Revert "Use tls socket to retrieve distinguished name"
> future.hh: drop unused parameters
> core/scollected: initialize _buf explicitly
> rpc: remove recursion in do_unmarshall()
> coroutine: Fix generator clang compilation
> core: Reduce the default blocked-reactor-notify-ms to 25ms
> build: group "CMAKE_CXX_*" options together
> doc: s/c++dialect/c++-standard/
> test: coroutines: adjust coroutine generator test for gcc
> Use tls socket to retrieve distinguished name
> coroutine: add an async generator
> net/api: s/server_socket::is_listening()/operator bool()/
> net/api: let "server_socket::local_address()" always return an addr
> tls_test: Remove unsupported prio string from test case
> Merge 'abort_source: assert request_abort called exactly once' from Benny Halevy
> coroutines/all: stop using std::aligned_union_t
> coroutines/all: ensure the template argument deduction work with clang-15
Closes#10739
update_history can take a long time compared to compaction, as a call
issued on shard S1 can be handled on shard S2. If the other shard is
under heavy load, we may unnecessarily block kicking off a new
compaction. Normally it isn't a problem, as compactions aren't super
frequent, but there were edge cases where the described behaviour caused
compaction to fail to keep up with excessive flushing, leading to too
many sstables on disk and OOM during a read.
There is no need to wait with next compaction until history is updated,
so release the weight earlier to remove unnecessary serialization.
Compaction is marked as finished as soon as sstables are compacted
(without waiting for history update).
The issue is about handling errors when the user specifies something strange instead of a type, e.g. CREATE TABLE try1 (a int PRIMARY KEY, b list<zzz>):
* the error message only talks about collections, while zzz could also be an UDT;
* the same error message is given even when zzz is not a valid collection or UDT name.
The first point has already been fixed, now Scylla says 'Non-frozen user types or collections are not allowed inside collections: list<zzz>'. This commit fixes the second.
Whether the type is a valid UDT or not is checked in cql3_type::raw_ut::prepare_internal, but 'non-frozen' check triggers first in cql3_type::raw_collection::prepare_internal, before we recursively get to the argument types of the collection. The patch reverses the order here, first thing we recurse and ensure that the collection argument types are valid, and only then we apply the collection checks. A side effect of this is that the error messages of the checks in raw_collection will include the keyspace name, because it will now be assigned in raw_ut::prepare_internal before them.
The patch affects the validation order, so in case of list<zzz<xxx>> the message could be different, but it doesn't seem to be possible according to the Cql grammar.
Examples:
create type ut2 (a int, b list<ut1>); --> error('Unknown type ks.ut1')
create type ut1 (a int);
create type ut2 (a int, b list<ut1>); --> error('Non-frozen user types or collections are not allowed inside collections: list<ks.ut1>')
create type ut2 (a int, b list<frozen<ut1>>); --> OK
Fixes: scylladb#3541
Closes#10726
In 69af7a830b ("tools: toolchain: prepare: build arch images in parallel"),
we added parallel image generation. But it turns out that buildah can
do this natively (with the --platform option to specify architectures
and --jobs parameter to allow parallelism). This is simpler and likely
has better error handling than an ad-hoc bash script, so switch to it.
Closes#10734
Storage field of "coredumpctl info" changed at systemd-v248, it added
"(present)" on the end of line when coredump file available.
Fixes#10669Closes#10714
When memtable receives a tombstone it can happen under some workloads
that it covers data which is still in the memtable. Some workloads may
insert and delete data within a short time frame. We could reduce the
rate of memtable flushes if we eagerly drop tombstoned data.
One workload which benefits is the raft log. It stores a row for each
uncommitted raft entry. When entries are committed they are
deleted. So the live set is expected to be short under normal
conditions.
Fixes#652.
Closes#10612
* github.com:scylladb/scylla:
memtable: Add counters for tombstone compaction
memtable, cache: Eagerly compact data with tombstones
memtable: Subtract from flushed memory when cleaning
mvcc: Introduce apply_resume to hold state for partition version merging
test: mutation: Compare against compacted mutations
compacting_reader: Drop irrelevant tombstones
mutation_partition: Extract deletable_row::compact_and_expire()
mvcc: Apply mutations in memtable with preemption enabled
test: memtable: Make failed_flush_prevents_writes() immune to background merging
When memtable receives a tombstone it can happen under some workloads
that it covers data which is still in the memtable. Some workloads may
insert and delete data within a short time frame. We could reduce the
rate of memtable flushes if we eagerly drpo tombstoned data.
One workload which benefits is the raft log. It stores a row for each
uncommitted raft entry. When entries are committed they are
deleted. So the live set is expected to be short under normal
conditions.
Fixes#652.
This patch prevents virtual dirty from going negative during memtable
flush in case partition version merging erases data previously
accounted by the flush reader. There is an assert in
~flush_memory_accounter which guards for this.
This will start happening after tombstones are compacted with rows on
partition version merging.
This problem is prevented by the patch by having the cleaner notify
the memtable layer via callback about the amount of dirty memory released
during merging, so that the memtable layer can adjust its accounting.
Partition version merging is preemptable. It may stop in the middle
and be resumed later. Currently, all state is kept inside the versions
themselves, in the form of elements in the source version which are
yet to be moved. This will change once we add compaction (tombstones
with rows) into the merging algorithm. There, state cannot be encoded
purley within versions. Consider applying a partition tombstone over
large number of rows.
This patch introduces apply_rows object to hold the necessary state to
make sure forward progress in case of preemption.
No change in behavior yet.
Memtables and cache will compact eagerly, so tests should not expect
readers to produce exact mutations written, only those which are
equivalant after applying copmaction.
The compacting reader created using make_compacting_reader() was not
dropping range_tombstone_change fragments which were shadowed by the
partition tombstones. As a result the output fragment stream was not
minimal.
Lack of this change would cause problems in unit tests later in the
series after the change which makes memtables lazily compact partition
versions. In test_reverse_reader_reads_in_native_reverse_order we
compare output of two readers, and assume that compacted streams are
the same. If compacting reader doesn't produce minimal output, then
the streams could differ if one of them went through the compaction in
the memtable (which is minimal).
Preerequisite for eagerly applying tombstones, which we want to be
preemptible. Before the patch, apply path to the memtable was not
preemptible.
Because merging can now be defered, we need to involve snapshots to
kick-off background merging in case of preemption. This requires us to
propagate region and cleaner objects, in order to create a snapshot.
Before the change, the test artificiallu set the soft pressure
condition hoping that the background flusher will flush the
memtable. It won't happen if by the time the background flusher runs
the LSA region is updated and soft pressure (which is not really
there) is lifted. Once apply() becomes preemptibe, backgroun partition
version merging can lift the soft pressure, making the memtable flush
not occur and making the test fail.
Fix by triggering soft pressure on retries.
To prevent async scheduling issues of reconnection after tests are done,
manually close the connection after fixture ends.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
To provide a reasonably-definitive answer to "what exact version of
Scylla wrote this?".
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Closes#10712
* github.com:scylladb/scylla:
docs: document recently-added Scylla sstable metadata sections
sstables: save Scylla version & build id in metadata
scylla_sstable: generalize metadata visitor for disk_string
build_id: cache the value
- Introduce a simpler substitute for `flat_mutation_reader`-resulting-from-a-downgrade that is adequate for the remaining uses but is _not_ a full-fledged reader (does not redirect all logic to an `::impl`, does not buffer, does not really have `::peek()`), so hopefully carries a smaller performance overhead. The name `mutation_fragment_v1_stream` is kind of a mouthful but it's the best I have
- (not tests) Use the above instead of `downgrade_to_v1()`
- Plug it in as another option in `mutation_source`, in and out
- (tests) Substitute deliberate uses of `downgrade_to_v1()` with `mutation_fragment_v1_stream()`
- (tests) Replace all the previously-overlooked occurrences of `mutation_source::make_reader()` with `mutation_source::make_reader_v2()`, or with `mutation_source::make_fragment_v1_stream()` where deliberate or still required (see below)
- (tests) This series still leaves some tests with `mutation_fragment_v1_stream` (i.e. at v1) where not called for by the test logic per se, because another missing piece of work is figuring out how to properly feed `mutation_fragment_v2` (i.e. range tombstone changes) to `mutation_partition`. While that is not done (and I think it's better to punt on it in this PR), we have to produce `mutation_fragment` instances in tests that `apply()` them to `mutation_partition`, thus we still use downgraded readers in those tests
- Remove the `flat_mutation_reader` class and things downstream of it
Fixes#10586Closes#10654
* github.com:scylladb/scylla:
fix "ninja dev-headers"
flat_mutation_reader ist tot
tests: downgrade_to_v1() -> mutation_fragment_v1_stream()
tests: flat_reader_assertions: refactor out match_compacted_mutation()
tests: ms.make_reader() -> ms.make_fragment_v1_stream()
repair/row_level: mutation_fragment_v1_stream() instead of downgrade_to_v1()
stream_transfer_task: mutation_fragment_v1_stream() instead of downgrade_to_v1()
sstables_loader: mutation_fragment_v1_stream() instead of downgrade_to_v1()
mutation_source: add ::make_fragment_v1_stream()
introduce mutation_fragment_v1_stream
tests: ms.make_reader() -> ms.make_reader_v2()
tests: remove test_downgrade_to_v1_clear_buffer()
mutation_source_test: fix indentation
tests: remove some redundant calls to downgrade_to_v1()
tests: remove some to-become-pointless ms.make_reader()-using tests
tests: remove some to-become-pointless reader downgrade tests
The test sometimes fails because the order of rows in the SELECT results
depends on how stream IDs for the different partition keys get generated.
In some runs the stream ID for pk=1 may go before the stream ID for
pk=4, in some runs the other way.
The fix is to use the same partition key but different clustering keys
for the different rows.
Refs: #10601Closes#10718
Replace:
Compressed chunk checksum mismatch at chunk {}, offset {}, for chunk of size {}: expected={}, actual={}
With:
Compressed chunk checksum mismatch at offset {}, for chunk #{} of size {}: expected={}, actual={}
This is a follow-up for #10693. Also bring the uncompressed chunk
checksum check messages up to date with the compressed one (which #10693
forgot to do).
Another change included is merging the advancement of the chunk index
with the iteration over the chunks, so we don't maintain two counters
(one in the iterator and an explicit one).
Closes#10715
Some metadata fields have interesting types, and some are just
strings. There can be more than one string field, which the visitor
would not be able to distinguish from one another by type alone, so no
reason to make `scylla_metadata::sstable_origin` special.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
The CPU cost of iterating over the relevant ELF structures is probably
negligible (despite the amount of code involved), but there is no need
to keep the containing page mapped in RAM when it doesn't have to be.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
To avoid a discrepancy about underlying generation type once something other than integer is allowed for the sstable generation.
Also simplifies one generic writer interface for sealing sstable statistics.
Closes#10703
* github.com:scylladb/scylla:
sstables: Use generation_type for compaction ancestors
sstables: Make compaction ancestors optional when sealing statistics
This unifies the left-hand-side and right-hand-side of expression preparation.
The contents of the visitor in prepare_binop_lhs() is moved to the
visitor in try_prepare_expression(). This usually replaces an
on_internal_error() branch.
An exception is tuple_constructor, which is valid in both the left-hand-side
and right-hand-side (e.g. WHERE (x, y) IN (?, ?, ?)). We previously
enhanced this case to support not having a a column_specification, so
we just delete the branch from prepare_binop_lhs.
When encountering a subscript as the left-hand-side of a binary operator,
we assume the subscripted value is a column and process it directly.
As a step towards de-specializing the left-hand-side of binary operators,
use recursive descent into prepare_binop_lhs() instead. This requires
generating a column_specification for arbitrary expressions, so we
add a column_specification_of() function for that. Currently it will
return a good representation for columns (the only input allowed by
the grammar) and a bad representation (the text representation of the
expression) for other expressions. We'll have to improve that when we
relax the grammar.
Currently the only expression form that can appear on both the left
hand side of an expression and the right hand side is a tuple constructor,
so consequently it must support both modes of type processing - either
deriving the type from the expression, or imposing a type on the expression.
As an example, in
WHERE (A, B) = (:a, :b)
the first tuple derives its type from the column types, while the
second tuple has the type of the first tuple imposed on it.
So, we adjust tuple_constructor_prepare_nontuple to support both forms.
This means allowing the receiver not to be present, and calculating the
tuple type if that is the case.
resolve_column() is part of the prepare stage, and tries to
resolve a column name in a query against the table's columns.
If it fails, it prints the containing binary_expression as
context. However, that's unnecessary - the unresolved
column name is sufficient context. So print that.
The motivation is to unify preparation of binary_operator
left-hand-side and right-hand-side - prepare_expression()
doesn't have the extra parameter and it wouldn't make sense
to add it, as expressions might not be children of binary_operators.
Currently prepare_expression is never used where a schema is needed -
it is called for the right-hand-side of binary operators (where we
don't accept columns) or for attributes like WRITETIME or TTL. But
when we unify expression preparation it will need to handle columns
too, and these need the schema to look up the column.
So pass the schema as a parameter. It is optional (a pointer) since
not all contexts will have a schema (for example CREATE AGGREGATE).
In CQL (and SQL) types flow in different directions in expression
components. In an expression
A[:x] = :y
The type of A is known, the type of :x is derived from the type of A,
and the type of :y is derived from the type of A[:x].
Currently prepare_expression() only supports the second mode - an
expression's type is dictated by its caller via the column_specification
parameter. But this means it can only be used to evaluate the
right-hand-side of binary expressions, since the left-hand-side uses
the first mode, where the type is derived from the column, not
imposed by the caller.
To support both modes, make the column_specification parameter optional
(it is already a pointer so just accept null) and also make the returned
expression optional, to indicate failure to infer the type if the
column_specification was not given.
This patch only arranges for the new calling convention (as a new
try_prepare_expression call), it does not actually implement anything.
For most types, we just return the type field. A few expressions have
other methods to access the type, and some expressions cannot survive
prepare and so calling type_of() on them is illegal.
Currently, preparing a cast drops the cast completely (as the
types are verified to be binary compatibile). This means we lose
the casted-to type. Since we wish to keep type infomation, keep the
cast in the prepared expression tree (and therefore the casted-to
type).
Once we do that, we must extend evaluate() to support cast
expressions.
Almost all expressions either already have a type field or
have an O(1) way of reaching the type (for example, column_value
can access the type via its column_definition).
Add a type field to the few expression types that don't already
have it. Since prepare_expr() doesn't yet generate these expressions,
we don't have any place to populate it, so it remains null.
A cast expression naturally includes a data type indicating what type
we are casting into. Right now the prepared form uses cql3_type.
Change it to data_type which is what other expressions use to reduce
friction. Since cql3_type is a thin wrapper around data_type, the
change is minimal.
The change propagates to selectable::with_cast, but again it is
minimal.
At this point, none of the remaining uses of
`flat_mutation_reader` (all of which are results of calling
`downgrade_to_v1()` anyway) actually need a full-featured flat
mutation reader with its own separate buffer etc.
`mutation_fragment_v1_stream` can only be constructed by wrapping a
`flat_mutation_reader_v2`, contains enough functionality for the
remaining consumers of `mutation_fragment_v1` sources and unit tests
and no more, and does not buffer.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
The projected limited replacement of downgraded v1 mutation reader
will not do its own buffering, so this test will be pointless.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
mutation_source are going to be created only from v2 readers and the
::make_reader() method family is scheduled for removal.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Let's also use generation_type for compaction ancestors, so once we
support something other than integer for SSTable generation, we
won't have discrepancy about what the generation type is.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Compaction ancestors is only available in versions older than mx,
therefore we can make it optional in seal_statistics(). The motivation
is that mx writer will no longer call sstable::compaction_ancestors()
which return type will be soon changed to type generation_type, so the
returned value can be something other than an integer, e.g. uuid.
We could kill compaction_ancestors in seal_statistics interface, but
given that most generic write functions still work for older versions,
if there were still a writer for them, I decided to not do it now.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Fixes#10489
Killing the CDC log table on CDC disable is unhelpful in many ways,
partly because it can cause random exceptions on nodes trying to
do a CDC-enabled write at the same time as log table is dropped,
but also because it makes it impossible to collect data generated
before CDC was turned off, but which is not yet consumed.
Since data should be TTL:ed anyway, retaining the table should not
really add any overhead beyond the compaction to eventually clear
it. And user did set TTL=0 (disabled), then he is already responsible
for clearing out the data.
This also has the nice feature of meshing with the alternator streams
semantics.
Closes#10601
The change
- adds a test which exposes a problem of a peculiar setup of
tombstones that trigger a mutation fragment stream validation exception
- fixes the problem
Applying tombstones in the order:
range_tombstone_change pos(ck1), after_all_prefixed, tombstone_timestamp=1
range_tombstone_change pos(ck2), before_all_prefixed, tombstone=NONE
range_tombstone_change pos(NONE), after_all_prefixed, tombstone=NONE
Leads to swapping the order of mutations when written and read from
disk via sstable writer. This is caused by conversion of
range_tombstone_change (in memory representation) to range tombstone
marker (on disk representation) and back.
When this mutation stream is written to disk, the range tombstone
markers type is calculated based on the relationship between
range_tombstone_changes. The RTC series as above produces markers
(start, end, start). When the last marker is loaded from disk, it's kind
gets incorrectly loaded as before_all_prefixed instead of
after_all_prefixed. This leads to incorrect order of mutations.
The solution is to skip writing a new range_tombstone_change with empty
tombstone if the last range_tombstone_change already has empty
tombstone. This is redundant information and can be safely removed,
while the logic of encoding RTCs as markers doesn't handle such
redundancy well.
Closes#10643
I noticed that `column_condition` (used in LWT `IF` clause) supports lists.
As part of the Grand Expression Unification we'll need to migrate that to
expressions, so we'll need to support list subscripts.
Use the opportunity to relax the normal filtering to allow filtering on
list subscripts: `WHERE my_list[:index] = :value`.
Closes#10645
* github.com:scylladb/scylla:
test: cql-pytest: add test for list subscript filtering
doc: document list subscripts usable in WHERE clause
cql3: expr: drop restrictions on list subscripts
cql3: expr: prepare_expr: support subscripted lists
cql3: expressions: reindent get_value()
cql3: expression: evaluate() support subscripting lists
coroutine::parallel_for_each avoids an allocation and is therefore preferred. The lifetime
of the function object is less ambiguous, and so it is safer. Replace all eligible
occurences (i.e. caller is a coroutine).
One case (storage_service::node_ops_cmd_heartbeat_updater()) needed a little extra
attention since there was a handle_exception() continuation attached. It is converted
to a try/catch.
Closes#10699
This two-patch series makes two improvements to configure.py:
The first patch fixes, yet again, issue #4706 where interrupting ninja's rebuild of build.ninja can leave it without any build.ninja at all. The patch uses a different approach from the previous pull-request #10671 that aimed to solve the same problem.
The second patch makes the output of configure.py more reproducible, not resulting in a different random order every time. This is useful especially when debugging configure.py and wanting to check if anything changed in its output.
Closes#10696
* github.com:scylladb/scylla:
configure.py: make build.ninja the same every time
configure.py: don't delete build.ninja when rebuild is interrupted
* seastar 96bb3a1b8...2be9677d6 (37):
> Merge 'stream_range_as_array: always close output stream' from Benny Halevy
Fixes#10592
> net/api: add "server_socket::is_listening()"
> src/net/proxy: remove unused variable
> coroutine: parallel_for_each: relax contraints
> native-stack: do not use 0 as ip address if !_dhcp
> coroutine: fix a typo in comment
> std-coroutine: include for LLVM-14
> tutorial: use non-variadic version of when_all_succeed()
> scripts: Fix build.sh to use new --c++-standard config option
> core/thread: initialize work::pr and work::th explicitly
> util/log-impl: remove "const" qualifier in return type
> map_reduce: remove redundant move() in return statement
> util: mark unused parameter with [[maybe_unused]]
> drop unused parameters
> build: use "20" for the default CMAKE_CXX_STANDARD
> build: make CMAKE_CXX_STANDARD a string
> utils: log: don't crash on allocation failure while extending log buffer
> tests: unix_domain_test: fix thread/future confusion in client_round()
> compat: do not use std::source_location if it is broken
> build: use CMAKE_CXX_STANDARD instead of Seastar_CXX_DIALECT
> Merge 'Add hello-world demo from tutorial' from Pavel
> rpc_tester: Put client/server sides into correct sched groups
> reactor_backend: Use _r reference, not engine() method
> future.hh: #include std-compat.hh for SEASTAR_COROUTINES_ENABLED
> Merge "Add more CPU-hog facilities to RPC-tester" from Pavel E
> Merge "io: Enlighten queued_request" from Pavel E
> Correct swapped AIO detection/setup calls
> sharded: De-duplicate map-reduce overloads
> file: don't trample on xfs flags when setting xfs size hint
> Merge "Per-class IO bandwidth limits" from Pavel E
> Merge 'sstring: fix format and optimize the performance of sstring::find().' from Jianyong Chen
> reactor_backend: Mark reactor_backend_aio::poll() private
> scripts/build.sh: Mind if not running on a terminal
> test, rpc: Don't work with large buffers
> test, futures: Don't expect ready future to resolve immediately
> source_location compatibility: Fix an unused private field error when treat warning as errors
> file: Remove try-catch around noexcept calls
Currently it works, but the newer version of seastar's map_reduce()
is compiled in a way to trigger use-after-free on accessing captured
value.
tests: unit(dev), unit.alternator(debug on v1)
Fixes#10689
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220523095409.6078-1-xemul@scylladb.com>
When logging a failed checksum on a compressed chunk.
Currently, only the offset is logged, but the index of the chunk whose
checksum failed to validate is also interesting.
Closes#10693
In several places, configure.py uses unsorted sets which results in
its output being in different order every time - both a different
order of targets, and a different order in dependencies of each
target.
This is both strange, and annoying when trying to debug configure.py
and trying to understand when, if at all, its output changes.
So in this patch, we use "sorted(...)" in the right places that
are needed to guarantee a fixed order. This fixed order is alphabetical,
but that's not the goal of this patch - the goal is to ensure a fixed
order.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In commit 9cc9facbea, I fixed issue #4706.
That issue about what happens when interrupting a rebuild of build.ninja
(which happens automatically when you run "ninja" after configure.py
changed). We don't want to leave behind a half-built build.ninja,
or leave it deleted.
The solution in that commit was for configure.py to build a temporary file
(build.ninja.tmp), and only as the very last step rename it build.ninja.
Unfortunately, since that time, we added more last steps after what
used to be that very last step :-(
If this new code running after the rename takes a noticable amount of
time, and if the user is unlucky enough to interrupt it during that
time, ninja will see a modified output file (build.ninja) and a failed
rule, and will delete the output file!
The solution is to move the rename out of configure.py. Instead, we
add a "--out=filename" option to configure.py which allows it to write
directly to a different file name, not build.ninja. When rebuilding
build.ninja, the rule will now call configure.py with "--out=build.ninja.new"
and then rename it back to build.ninja. Any failure or interrupt at any
stage of configure.py will leave build.ninja untouched, so ninja will
not delete it - it will just delete the temporary build.ninja.new.
Fixes#4706 (again)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We modify the `reconfigure` and `modify_config` APIs to take a vector of
<server_id, bool> pairs (instead of just a vector of server_ids), where
the bool indicates whether the server is a voter in the modified config.
The `reconfiguration` operation would previously shuffle the set of
servers and split it into two parts: members and non-members. Now it
partitions it into three parts: voters, non-voters, and non-members.
The PR also includes fixes for some liveness problems stumbled upon
during testing.
Closes#10640
* github.com:scylladb/scylla:
test: raft: randomized_nemesis_test: include non-voters during reconfigurations
raft: server: if `add_entry` with `wait_type::applied` successfully returns, ensure `state_machine::apply` is called for this entry
raft: tracker: fix the definition of `voters()`
raft: when printing `raft::server_address`, include `can_vote`
The messages only dumps the last sealed fragment, but it should dump
all the output fragments replacing the exhausted input ones.
Let's print origin of output fragments, so we can differ between
files with compaction and garbage-collection origin.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220524232232.119520-1-raphaelsc@scylladb.com>
"
Azure snitch tries to replicate db/rack info from all shards to all
other shards. This may lead to use-after-free when shard A gets "this"
from shard B, starts copying its _dc field and the shard A destructs
its _dc from under B because it's receiving one from shard C.
Also polish replication code a little bit while at it.
"
* 'br-azure-snitch-serialize' of https://github.com/xemul/scylla:
snitch: Use invoke_on_others() to replicate
snitch: Merge set_my_dc and set_my_rack into one
azure_snitch: Do nothing on non-io-cpu
Restriction validation forbids lists (somewhat oddly, it talks about
indexes; validation should make a soft check about indexes (since it
can fall back to filtering) and a hard check about supported filtering
expressions), and enforces a map in another place. Remove the first
restriction and relax the second to allow lists as well as maps as
subscript operands.
Some validation messages are adjusted to reflect that lists are supported.
Infer the type of a list index as int32_type.
The error message when a non-subscriptable type is provided is
changed, so the corresponding test is changed too.
We already support subscripting maps (for filtering WHERE m[3] = 6),
so adding list subscript support is easy. Most of the code is shared.
Differences are:
- internal list representation is a vector of values, not of key/values
- key type is int32_type, not defined by map
- need to check index bounds
is not supported' from Nadav Har'El
This small series implements the DescribeTimeToLive and
DescribeContinuousBackups operations in Alternator. Even if the
corresponding features aren't implemented, it can help applications that
we implement just the Describe operation that can say that this feature
is in fact currently disabled.
Fixes #10660Closes#10670
* github.com:scylladb/scylla:
alternator: remove dead code
alternator: implement DescribeContinuousBackups operation
alternator: allow DescribeTimeToLive even without TTL enabled
This patch contains five tests which reproduce three old bugs in
Scylla's handling of multi-column restrictions like (c1,c2)<(1,2).
These old bugs are:
Refs #64 (yes, a two-digit issue!)
Refs #4244
Refs #6200
The three github issues are closely intertwined, exposing the same
or similar bugs in our internal implementation, and I suspect that
eventually most of them could be fixed together.
In writing these tests, I carefully read all three issues and the
various failure scenarios described in them, tried to distill and
simplify the scenarios, and also consider various other broken
variants of the scenarios. The resulting tests are heavily commented,
explaining the motivation of each test and exactly which of the
above bugs it reproduces.
All five tests included in this patch pass on Cassandra and currently
fail on Scylla, so are marked "xfail".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10675
It's an `int64_t` that needs to be explicitly initialized, otherwise the
value is undefined.
This is probably the cause of #10639, although I'm not sure - I couldn't
reproduce it (the bug is dependent on how the binary is compiled, so
that's probably it). We'll see if it reproduces with this fix, and if
it will, close the issue.
Closes#10681
"
The table keeps references on sstables_ and compaction_ managers
(among other things), but the latter sits as a pointer on table's
config while the former -- as on-table direct reference.
This set unifies both by turning sstables manager on-config pointer
into on-table reference.
branch: https://github.com/xemul/scylla/tree/br-table-vs-sstables-manager
tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/574/
"
* 'br-table-vs-sstables-manager' of https://github.com/xemul/scylla:
tests: Remove sstables_manager& from column_family_test_config()
table: Move sstables_manager from config onto table itself
table, db, tests: Pass sstables_manager& into table constructor
`raft_group0` does not own the source and is not responsible for calling
`request_abort`. The source comes from top-level `stop_signal` (see
main.cc) and that's where it's aborted.
Fixes#10668.
Closes#10678
In issues #7944 and #10625 it was noticed that by assigning an empty
string to a non-string type (int, date, etc.) using INSERT or
INSERT JSON, some combinations of the above can create "empty" values
while they should produce a clear error.
The tests added in this patch explore the different combinations of
types and insert modes, and reproduce several buggy cases in Scylla
(resulting in xfail'ing tests) as well as Cassandra.
We feared that there might be a way using those buggy statements to
create a partition with an empty key - something which used to kill
older versions of Scylla. But the tests show that this is not possible -
while a user can use the buggy statements to create an empty value,
Scylla refuses it when it is used as a single-column partition key.
Refs #10625
Refs #7944
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10628
In 10dd08c9 ("messaging_service: supply and interpret rpc isolation_cookies",
4.2), we added a mechanism to perform rpc calls in remote scheduling groups
based on the connection identity (rather than the verb), so that
connection processing itself can run in the correct group (not just
verb processing), and so that one verb can run in different groups according
to need.
In 16d8cdadc ("messaging_service: introduce the tenant concept", 4.2), we
changed the way isolation cookies are sent:
scheduling_group
messaging_service::scheduling_group_for_verb(messaging_verb verb) const {
return _scheduling_info_for_connection_index[get_rpc_client_idx(verb)].sched_group;
@@ -665,11 +694,14 @@ shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::ge
if (must_compress) {
opts.compressor_factory = &compressor_factory;
}
opts.tcp_nodelay = must_tcp_nodelay;
opts.reuseaddr = true;
- opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie;
+ // We send cookies only for non-default statement tenant clients.
+ if (idx > 3) {
+ opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie;
+ }
This effectively disables the mechanism for the default tenant. As a
result some verbs will be executed in whatever group the messaging
service listener was started in. This used to be the main group,
but in 554ab03 ("main: Run init_server and join_cluster inside
maintenance scheduling group", 4.5), this was change to the maintenance
group. As a result normal read/writes now compete with maintenance
operations, raising their latency significantly.
Fix by sending the isolation cookie for all connections. With this,
a 2-node cassandra-stress load has 99th percentile increase by just
3ms during repair, compared to 10ms+ before.
Fixes#9505.
Closes#10673
The purpose of this series is to introduce infrastructure
for managed scylla processes into test.py,
switch some existing suites to use test.py managed processes
and introduce cluster tests.
All of this is expected to make possible to test Raft topology
changes and schema changes using an easy to use and fast tool
such as test.py. In general this will allow testing Scylla clusters
from within the development test harness.
Branch URL: kostja/test.py.v5
Closes#10406
* github.com:scylladb/scylla:
test: disable topology/test_null
test.py: disable cdc_with_lwt_test it's flaky in debug mode
test.py: workaround for a python bug
test: cleanup (drop keyspace) in two cql tests to support --repeat
test.py: respect --verbose even if output is a tty
test: remove tools/cql_repl
test.py: switch cql/ suite to pytest/tabular output
test: remove a flaky test case
test.py: implement CQL approval tests over pytest
test.py: implement cql_repl
test.py: add topology suite
test.py: add common utility functions to test/pylib
test.py: switch cql-pytest and rest_api suites to PythonTestSuite
test.py: introduce PythonTest and PythonTestSuite
test.py: use artifact registry
test.py: temporarily disable raft
test.py: (pylib) add Scylla Server and Artifact Registry
test.py: (pylib) add Host Registry to track used server hosts
test.py: (pylib) add a pool of scylla servers (or clusters)
The manager reference is already available in constructor and thus
can be copied to on-table member.
The code that chooses the manager (user/system one) should be moved
from make_column_family_config() into add_column_family() method.
Once this happens, the get_sstables_manager() should be fixed to
return the reference from its new location. While at it -- mark the
method in question noexcept and add it's mutable overload.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In core code there's only one place that constructs table -- in
database.cc -- and this place currently has the sstables_manager pointer
sitting on table config (despite it's a pointer, it's always non-null).
All the tests always use the manager from one of _env's out there.
For now the new contructor arg is unused.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We modify the `reconfigure` and `modify_config` APIs to take a vector of
<server_id, bool> pairs (instead of just a vector of server_ids), where
the bool indicates whether the server is a voter in the modified config.
The `reconfiguration` operation would previously shuffle the set of
servers and split it into two parts: members and non-members. Now it
partitions it into three parts: voters, non-voters, and non-members.
Previously it could happen that `add_entry` returned successfully but
`state_machine::apply` was never called by the server for this entry,
even though `wait_type::applied` was used, if the server loaded
a snapshot that contained this entry in just the right moment. Some
clients may find this behavior surprising, even though we may argue that
it's not technically incorrect.
For example, the nemesis test assumed that if `add_entry` returned
successfully (with `wait_type::applied`), the local state machine
applied the entry; the test uses `apply` to obtain an output - the
result of the command - from the state machine.
It's not a problem to give a stronger guarantee, so we do it in this
commit. In the scenario where a snapshot causes Raft to skip over the
entry, `add_entry` will finish exceptionally with
`commit_status_unknown`.
The previous implementation was weird, and it's not even clear if
the C++ standard defined what the result would be (because it used
`std::unordered_set::insert(iterator, iterator)`, where the iterators
pointed to a sequence of elements with elements that already had
equivalent elements in the set; cppreference does not specify which
elements end up in the set in this case).
In any case, in testing, the resulting set did not give the desired
result: if the configuration was joint, and a server was a voter in
the previous config but a non-voter in the current one, it would
not be a member of this set. This would cause the server to not vote for
itself when it became a candidate, which could lead to cluster
unavailability.
The new definition is simple: a server belongs to `voters()` iff it is
a voter in current or previous configuration. This fixes the problem
described above.
Fixes#10618.
This patch adds reproducing tests for wrong handling of LIMIT in a query
which uses a secondary index *and* filtering, described in issue #10649.
In that case, Scylla incorrectly limits the number of rows found in the
index *before* the filtering, while it should limit the number of rows
*after* the filtering.
The tests in this patch (which xfail on Scylla, and pass on Cassandra)
go beyond the minimum required to reproduce this bug. It turns out that
there are different sub-cases of this problem that go through different
code paths, namely whether the base table has clustering keys or just
partition keys, and whether the overall LIMITed result spans more than
one page. So these tests attempt to also cover all these sub-cases.
Without all these test sub-cases, an incomplete and incorrect fix of this
bug may, by chance, cause the original test to succeed.
Refs #10649
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10658
Introduce a new operation, `raft_read`, which calls `read_barrier`
on a server, reads the state of the server's state machine, and returns
that state.
Extend the generator in `basic_generator_test` to generate `raft_read`s.
Only do it if forwarding is enabled (although it may make sense to test
read barriers in non-forwarding scenario as well - we may think about it
and do it in a follow-up).
Check the consistency of the read results by comparing them with the model
and using the result to extend the model with any newly observed elements.
The patchset includes some fixes for correctness (#10578)
and liveness (handling aborts correctly).
Closes#10561
* github.com:scylladb/scylla:
test: raft: randomized_nemesis_test: check consistency of reads
test: raft: randomized_nemesis_test: perform linearizable reads using read_barriers
test: raft: randomized_nemesis_test: add flags for disabling nemeses
raft: server: in `abort()`, abort read barriers before waiting for rpc abort
raft: server: handle aborts correctly in `read_barrier`
raft: fsm: don't advance commit index further than match_idx during read_quorum
Remove the function make_keyspace_name() that was never used.
We *could* have used this function, but we didn't, and it had
had an inconvenient API. If we later want to de-duplicate the
several copies of "executor::KEYSPACE_NAME_PREFIX + table_name"
we have in the code, we can do it with a better API.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Although we don't yet support the DynamoDB API's backup features (see
issue #5063), we can already implement the DescribeContinuousBackups
operation. It should just say that continuous backups, and point-in-time
restores, and disabled.
This will be useful for client code which tries to inquire about
continuous backups, even if not planning to use them in practice
(e.g., see issue #10660).
Refs #5063
Refs #10660
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Scylla has a bug that only fires ones in a hundred runs in debug mode
when a schema change parallel to a topology change leads to a lost
keyspace and internal error. Disable the tests until Raft is enabled for
schema.
We still consider the TTL support in Alternator to be experimental, so we
don't want to allow a user to enable TTL on a table without turning on a
"--experimental-features" flag. However, there is no reason not to allow
the DescribeTimeToLive call when this experimental flag is off - this call
would simply reply with the truth - that the TTL feature is disabled for
the table!
This is important for client code (such as the Terraform module
described in issue #10660) which uses DescribeTimeToLive for
information, even when it never intends to actually enable TTL.
The patch is trivial - we simply remove the flag check in
DescribeTimeToLive, the code works just as before.
After this patch, the following test now works on Scylla without
experimental flags turned on:
test/alternator/run test_ttl.py::test_describe_ttl_without_ttl
Refs #10660
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
If output is a not a tty, verbose is set automatically.
If the output is a tty, one has to request --verbose.
However, a part of test.py verbosity was ignoring --verbose
and looking only at the terminal type.
When we append an entry to a list with the same user-defined
timestamp, the behaviour is actually undefined. If the append
is processed by the same coordinator as the one that accepted
the existing entry, then it gets the same timeuuid as the list key,
and replaces (potentially) the existing list valiue. Then it
gets a timeuuid which maybe both larger and smaller than the existing
key's timeuuid, and then turns either to an append or a prepend.
The part of the timestamp responsible for the result is the shard
id's spoof node address implemented in scope of fixing Scylla's
timeuuid uniqueness. When the test was implemented all spoof node ids
where 0 on all shards and all coordinators. Later the difference
in behaviour was dormant because cql_repl would always execute
the append on the same shard.
We could fix Scylla to use a zero spoof node address in case a user
timestamp is supplied, but the purpose of this is unclear, it
may actually be to the contrary of the user's intent.
Before this patch, approval tests (test/cql/*) were using a C++
application called cql_repl, which is a seastar app running
Scylla, reading commands from the standard input and producing
results in Json format in the standard output. The rationale for
this was to avoid running a standalone Scylla which could leak
more resources such as open sockets.
Now that other suites already start and stop Scylla servers, it
makes more sense to run CQL commands in approval tests against an
existing running server. It saves us from building a one more
binary and allows to better format the output. Specifically, we
would like to see Scylla output in tabular format in approval
tests, which is difficult to do when C++ formatting libraries
are used.
Implement a pytest which would run CQL commands against
a scylla server and pretty print server output.
Will be used in existing Approval tests in subsequent patches.
Manage scylla servers for rest_api and cql-pytest suites
using PythonTestSuite. The pool size determines the max
number of servers test.py would run concurrently per
suite. For tiny suites (rest_api) the cost of starting
the servers overweights the cost of running tests so keep
it at a minimum. cql-pytest cas dozens of tests, so run them
in 4 parallel tracks.
Track running tests in the suite.
Cleanup after each suite (after all tests
in the suite end).
Cleanup all artifacts before exit. Don't drop server logs if
there is at least one failed test.
Allow starting clusters of Scylla servers. Chain up the next
server start to the end of the previous one, and set the next
server's seed to the previous server.
As a workaround for a race between token dissemination through
gossip and streaming, change schema version to force a gossip
round and make sure all tokens end up at the joining node in time.
Make sure scylla start is not race prone.
auth::standard_role_manager creates "cassandra" role in an async loop
auth::do_after_system_ready(), which retries role creation with an
exponential back-off. In other words, even after CQL port is up, Scylla
may still be initializing.
This race condition could lead to spurious errors during cluster
bootstrap or during a test under CI.
When the role is ready, queries begin to work, so rely on this "side
effect".
To start or stop servers, use a new class, ScyllaCluster,
which encapsulates multiple servers united into a cluster.
In it, validate that a test case cleans up after itself.
Additionally, swallow startup errors and throw them when
the test is actually used.
The test would perform `read_barrier`s but not check the correctness
of the reads: whether the state observed by a read is consistent with
the model and recent enough (in short, check linearizability).
This commit adds the correctness checks.
Introduce a new operation, `raft_read`, which calls `read_barrier`
on a server, reads the state of the server's state machine, and returns
that state.
Extend the generator in `basic_generator_test` to generate `raft_read`s.
Only do it if forwarding is enabled (although it may make sense to test
read barriers in non-forwarding scenario as well - we may think about it
and do it in a follow-up).
For now, we don't check the consistency of the results of the reads.
They do return the observed state, but we don't compare it yet with the
model. For now we simply issue the reads concurrently with other
operations to introduce some more chaos to the cluster and check
liveness and consistency of existing operations.
`rpc::abort` may need to wait until all read barriers finish, so abort
read barrier before waiting for `rpc::abort` to finish to avoid a
deadlock on shutdown.
`rpc::abort` is still called before the read barriers are aborted, only
waited for after. Calling it first prevents new read barriers from being
started by `rpc` (see `rpc::abort` comment).
Also prevent new read barriers from being started after abort starts
directly on a leader by checking the `_aborted` flag at the beginning
of `execute_read_barrier`.
Finally, use the opportunity to remove some compiler-dependent code.
Add column_index_auto_scale_threshold_in_kb to the configuration (defaults to 10MB).
When the promoted index (serialized) size gets to this
threshold, it's halved by merging each two adjacent blocks
into one and doubling the desired_block_size.
Fixes#4217
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10646
* github.com:scylladb/scylla:
sstables: mx: add pi_auto_scale_events metric
sstables: mx/writer: auto-scale promoted index
When entry loading fails and there is another request blocked on the
same page, attempt to erase the failed entry will abort because that
would violate entry_ptr guarantees, which is supposed to keep the
entry alive.
The fix in 92727ac36c was incomplete. It
only helped for the case of a single loader. This patch makes a more
general approach by relaxing the assert.
The assert manifested like this:
scylla: ./sstables/partition_index_cache.hh:71: sstables::partition_index_cache::entry::~entry(): Assertion `!is_referenced()' failed.
Fixes#10617Closes#10653
The docs test dislike the gdbinit link because it refers out of
the source tree. Unconfuse the tests by removing the link. It's
sad, but the file is more easily used by referring to it rather
than viewing it, so give a hint about that too.
Closes#10650
msg_proc_guard is a guard that makes sure _msg_processing is always
decreased. We can use regular defer() to achieve the same.
Message-Id: <YoZTQPbTMWAdCObs@scylladb.com>
Extend the reconfiguration nemesis to send `modify_config` requests as
well as `reconfigure` requests. It chooses one or the other with
probability 1/2.
Fix a bunch of problems that surfaced during testing.
Closes#10544
* github.com:scylladb/scylla:
test: raft: randomized_nemesis_test: send `modify_config` requests in reconfiguration nemsesis
test: raft: randomized_nemesis_test: fix `rpc` reply ID generation
test: raft: randomized_nemesis_test: during bouncing call, allow a leader to reroute to itself
test: raft: randomized_nemesis_test: handle timed_out_error from modify_config
service: raft: rpc: don't call `execute...` functions after `abort()`
raft: server: fix bad_variant_access in `modify_config`
Counts the number of promoted index auto-scale events.
A large number of those, relative to `partition_writes`,
indicates that `column_index_size_in_kb` should be increased.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add column_index_auto_scale_threshold_in_kb to the configuration
(defaults to 10MB).
When the promoted index (serialized) size gets to this
threshold, it's halved by merging each two adjacent blocks
into one and doubling the desired_block_size.
Fixes#4217
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Extend the reconfiguration nemesis to send `modify_config` requests as
well as `reconfigure` requests. It chooses one or the other with
probability 1/2.
When `rpc` wants to perform a two-way RPC call it sends a message
containing a `reply_id`. The other side will send the `reply_id` back
when answering, so the original side can match the response to the promise
corresponding to the future being waited on by the RPC caller.
Previously each instance of `rpc` generated reply IDs independently as
increasing integers starting from 0. The network delivers messages
based on Raft server IDs. A response message may thus be delievered not
to the original instance which invoked the RPC, but to a new instance
which uses the same Raft server ID (after we simulated a server
crash/stop and restart, creating a new server with the same ID that
reuses the previous instance's `persistence` instance but has a new `rpc`).
The new instance could have started a new RPC call using the same
`reply_id` as one currently being in-flight that was started by the
previous instance. The new instance could then receive and handle a
response that was intended for the previous instance, leading to weird
bugs.
Fix this by replacing the local reply ID counters by a global counter so
that every two-way RPC call gets a unique reply ID.
A server executing a `modify_config` call, even if it initially was a
leader and accepted the request, may end up throwing a `not_a_leader`
error, rerouting the caller to a new leader - but this new leader may be
that same server. This happens because `execute_modify_config`
translates certain errors that it considers transient (such as
`conf_change_in_progress`) into `not_a_leader{last_known_leader}`,
in attempt to notify the caller that they should retry the request; but
when this translation happens, the `last_known_leader` may be that same
server (it could have even lost leadership and then regained it back
while the request was being handled).
This is not strictly an error, and it should be safe for the client to
retry the request by sending it to the same server. The nemesis test
assumed that a server never returns `not_a_leader{itself}`; this commit
drops the assumption.
An alternative solution would be to extend the error types that are now
translated to `not_a_leader` so they include information about the last
known leader. This way the client does not lose information about the
original error and still gets a potential contact point for retry.
The functions are called from RPC when a follower forwards a request to
a leader (`add_entry`, `modify_config`, `read_barrier`). The call may be
attempted during shutdown. The Raft shutdown code cleans up data structures
created by those requests. Make sure that they are not updated
concurrently with shutdown. This can lead to problems such as using the
server object after it was aborted, or even destroyed.
After this change, the RPC implementation may wait for a `execute_modify_config`
call to finish before finishing abort. That call in turn may be stuck on
`wait_for_entry`. Thus the waiter may prevent RPC from aborting. Fix
this be moving the wait on the future returned from `_rpc->abort()` in
`server::abort()` until after waiters were destroyed.
`modify_config` would call `execute_modify_config` or
`_rpc->send_modify_config`, which returned a reply of type
`add_entry_reply`. This is a variant of 3 options: `entry_id`,
`not_a_leader`, or `commit_status_unknown`. The code would check
for the `entry_id` option and otherwise assume that it was `not_a_leader`.
During nemesis testing however, the reply was sometimes
`commit_status_unknown`, which caused a `bad_variant_access` exception
during `std::get` call. Fix this.
There is a similar piece of code in `add_entry`, but there it should be
impossible to obtain `commit_status_unknown` even though the types don't
enforce it. Make it more explicit with a comment and an assertion.
Scylla has a long-standing bug (issue #7620) where having many
tombstones in the schema table significantly slows down further
schema operations.
Many cql-pytest tests use new_test_table() to create a temporary test
table with a specific schema. Before this patch, each temporary table
was created with a random name, and deleted after the test. When
running many tests on the same Scylla server, this results in a lot
of tombstones in the schema tables, and really slow schema operations.
For example, look at home much time it takes to run the same test file
N times:
$ test/cql-pytest/run --count N test_filtering.py
N=25 - 16 seconds (total time for the N repetitions)
N=50 - 41 seconds
N=100 - 122 seconds
Notice how progressively slower each repetition is becoming - the
total test time should have been linear in N, but it isn't!
In this patch, we keep a cache of already-deleted table names (not the
tables, just their names!) so as to reuse the same name when we can
instead of inventing a new random name. With this patch, the performance
improvement after some repetitions is amazing (compare to the table above):
N=25 - 14 seconds
N=50 - 29 seconds
N=100 - 46 seconds
Note how the testing time is now more-or-less linear in the number of
repetitions, as expected.
The table-name recycling trick is the same trick I already used in the
past for the translated Cassandra tests (test/cql-pytest/cassandra_tests).
The problem was even more obvious there because those tests create a
lot of different tables. But the same problem also exists in cql-pytest
in general, so let's solve it here too.
Refs #7620
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10635
Consider:
- n1 and n2 in the cluster
- n3 bootstraps to join
- n1 does not hear gossip update from n3 due to network issue
- n1 removes n3 from gossip and pending node list
- stream between n1 and n3 fails
- n1 and n3 network issue is fixed
- n3 retry the stream with n1
- n3 finishes the stream with n1
- n3 advertises normal to join the cluster
The problem is that n1 will not treat n3 as the pending node so writes
will not route to n3 once n1 removes n3.
Another problem is that when n1 gets normal gossip status update from
n3. The gossip listener will fail because n1 has removed n3 so n1 could
not find the host id for n3. This will cause n1 to abort.
To fix, disable the retry logic in range_streamer so that once a stream
with existing fails the bootstrap fails.
The downside is that we lose the ability to restream caused by temporary
network issue but since we have repair based node operation. We can use
it to resume the previous failed node operations.
Fixes: #9805Closes#9806
Currently we support queries like:
```cql
SELECT * FROM ks.tab WHERE p IN (1, 2, null, 4);
```
Nothing can be equal to null so this is equivalent to:
```cql
SELECT * FROM ks.tab WHERE p IN (1, 2, 4);
```
Cassandra doesn't support it at all.
```cql
> SELECT * FROM ks.tab WHERE p IN (1, 2, null, 4)
Error: DbError(Invalid, "Invalid null value in condition for column p")
> SELECT * FROM ks.tab WHERE p IN (1, 2, ?, 4) # ? is NULL
Error: DbError(Invalid, "Invalid null value in condition for column p")
> SELECT * FROM ks.tab WHERE p IN ? # ? is (1, 2, null, 4)
Error: DbError(Invalid, "Invalid null value in condition for column p")
```
It makes little sense to send a null inside list of IN values and supporting it is a bit cumbersome.
Supporting it causes trouble because internally the values are represented as a list, not a tuple, and lists can't contain nulls.
Because of that code requires exceptions because in this single case there can be a null inside of a collection.
This PR starts treating a llist of IN values the same as any other list and as result nulls are forbidden inside them.
In case of a null the message is the same as any other collection:
```
null is not supported inside collections
```
I'm not entirely happy about it - someone could be confused if they received this message after a query that didn't involve any collections.
The problem with making a prettier error message is that once again we would have to give `evaluate` additional information that it's now evaluating a list of IN values. And we would end up back with `evaluate_IN_list`
I think we could consider adding some kind of generic context to evaluate. The context would contain the whole expression and a mark on the part that we are currently evaluating. Then in case of error we could use this context and use it to create more helpful error messages, e.g. point to the part of the expression where a problem occured. But that's outside of the scope of this PR.
Fixes#10579Closes#10620
* github.com:scylladb/scylla:
cql: Add test for null in IN list
cql: Forbid null in lists of IN values
We used to allow nulls in lists of IN values,
i.e. a query like this would be valid:
SELECT * FROM tab WHERE pk IN (1, null, 2);
This is an old feature that isn't really used
and is already forbidden in Cassandra.
Additionally the current implementation
doesn't allow for nulls inside the list
if it's sent as a bound value.
So something like:
SELECT * FROM tab WHERE pk IN ?;
would throw an error if ? was (1, null, 2).
This is inconsistent.
Allowing it made writing code cumbersome because
this was the only case where having a null
inside of a collection was allowed.
Because of it there needed to be
separate code paths to handle regular lists
and lists of NULL values.
Forbidding it makes the code nicer and consistent
at the cost of a feature that isn't really
important.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
This patch set adds two commits to allow trigger off strategy early for node operations.
*) repair: Repair table by table internally
This patch changes the way a repair job walks through tables and ranges
if multiple tables and ranges are requested by users.
Before:
```
for range in ranges
for table in tables
repair(range, table)
```
After:
```
for table in tables
for range in ranges
repair(range, table)
```
The motivation for this change is to allow off-strategy compaction to trigger
early, as soon as a table is finished. This allows to reduce the number of
temporary sstables on disk. For example, if there are 50 tables and 256 ranges
to repair, each range will generate one sstable. Before this change, there will
be 50 * 256 sstables on disk before off-strategy compaction triggers. After this
change, once a table is finished, off-strategy compaction can compact the 256
sstables. As a result, this would reduce the number of sstables by 50X.
This is very useful for repair based node operations since multiple ranges and
tables can be requested in a single repair job.
Refs: #10462
*) repair: Trigger off strategy compaction after all ranges of a table is repaired
When the repair reason is not repair, which means the repair reason is
node operations (bootstrap, replace and so on), a single repair job contains all
the ranges of a table that need to be repaired.
To trigger off strategy compaction early and reduce the number of
temporary sstable files on disk, we can trigger the compaction as soon
as a table is finished.
Refs: #10462Closes#10551
* github.com:scylladb/scylla:
repair: Trigger off strategy compaction after all ranges of a table is repaired
repair: Repair table by table internally
"
There are several issues with it
- it's scattered between main() and storage_service methods
- yet another incarnation of it also sits in the cql-test-env
- the prepare_to_join() and join_token_ring() names are lying to readers,
as sometimes node joins the ring in prepare- stage
- storage service has to carry several private fields to keep the state
between prepare- and join- parts
- some storage service dependencies are only needed to satisfy joining,
but since they cannot start early enough, they are pushed to storage
service uninitialized "in the hope" that it won't use them until join
This patch puts joining steps in one place and enlightens storage service
not to carry unneeded dependencies/state onboard. And eliminates one more
usage of global proxy instance while at it.
branch: https://github.com/xemul/scylla/tree/br-merge-init-server-and-join-cluster
tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/466/
refs: #2795
"
* 'br-merge-init-server-and-join-cluster' of https://github.com/xemul/scylla:
storage_service: Remove global proxy call
storage_service: Remove sys_dist_ks from storage_service dependencies
storage_service: Remove cdc_gen_service from storage_service dependencies
storage_service: Make _cdc_gen_id local variable
storage_service: Make _bootstrap_tokens local variable
storage_service: Merge prepare- and join- private members
storage_service: Move some code up the file
storage_service: Coroutinize join_token_ring
storage_service: Fix indentation after previous patch
storage_service: Execute its .bootstrap() into async()
storage_service: Dont assume async context in mark_existing_views_as_built
storage_service: Merge init-server and join-cluster
main, storage_service: Move wait for gossip to settle
main, storage_service: Move passive announce subscription
main, storage_service: Move early group0 join call
An overload of storage_proxy::query_mutations_locally was declared in
a35136533d which takes a vector of
partition ranges as an argument, but it was never defined. This commit
removes the unused overload declaration.
Closes#10610
Since 9b49d27a8 ("cql3: expr: Remove shape_type from bind_variable"),
bind variables no longer remember their context (e.g. if they are
in a scalar or vector comparison, or if they are in an IN or
other relation. Exploit that my merging all of the productions that
generate a bind variables (that are now exactly equal) into a single
marker production.
Closes#10624
Storage service needs it to calculate schema version on join. The proxy
at this point can be passed as an argument to the joining helper.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The service in question is only needed join_cluster-time, no need to
keep it in the dependencies list. This also solves the dependency
trouble -- the distributed keyspace is sharded::start-ed after it's
passed to storage_service initialization.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This service is only needed join-time, it's better to pass it as
argument to join_cluster(). This solves current reversed dependency
issuse -- the cdc_gen_svc is now started after it's passed to storage
service initialization.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Same as with _bootstrap_tokens -- this variable is only needed
throughout a single function invocation, so it doesn't have to be a
class member.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now it's a member on storage_service, but it was such just to carry the
set of tokens between to subsequent calls. Now when all the joining
happens in one function, the set can become local variable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
These two are the real code that does preparation and joining. They are
called in async() context by public storage_service methods that had
been merged recently, so this patch merges the internals.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
No logic change, this is to keep join_token_ring next to
prepare_to_join so that the patch merging them becomes clean and small.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patch will merge this method with prepare_to_join() which is
already coroutinized. To make it happen -- coroutinize it in advance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patches will coroutinize join_cluster(), so the .bootstrap() method
should return a future. It's worth coroutinizing it as well, but that's
a huge change, so for now -- keep it in its own explicit async().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now they always follow one another both in main and cql-test-env.
Also, despite the name, init-server does joins the cluster when it's
just a normal node restarting, so join-cluster is called when the
cluster is already joind. This merge make the function be named as
what it really does.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
And make cql-test-env configure to skip it not to slow down tests in
vain. Another side effect is that cql-test-env would trigger features
enabling at this point, but that's OK, they are enabled anyway.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Storage service already has a vector of random subscription scope
holders, this becomes yet another one. This partially reverts
e4f35e2139, which's half-step backwards, but so far I've no better
ideas where to track that scope guard.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It happens right after the prepare to join, moving it at the end of the
latter call doesn't change the code logic. A side effect -- this removes
a silly join_group0() one-line helper.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When a user runs a script and presses control-C, a SIGINT (signal 2)
gets sent to every process in the script's "process group". By default,
every subprocess started by a script joins the parent's process group.
Our test/*/run test-runner scripts typically start two processes: scylla
and pytest. If we keep them in the same process group, a control-C
would kill them in a random order and that is ugly - if Scylla is
killed before pytest, we'll see a few test failures before pytest
is finally killed. So the existing code put Scylla in its own process
group, and killed it on exit after killing pytest.
But there were a few inconsistencies in our implementation, leading
to some annoying behaviors:
1. Doing "kill -2" to the runner's process (not a control-C which sends
a signal to the process group) caused scylla and pytest to be killed
on exit. So far so good. But, we should kill their entire process
groups, not just the one process. This is important when pytest starts
its own subprocesses (as happens in cql-pytest/test_tools.py),
otherwise they just remain running.
We need to call pgkill() instead of kill(), but also we forgot
to start a new process group for the pytest run - so this patch
fixes it.
2. Our exit handler - which kills the subprocesses - only gets called
on signals which Python catches, and this is only SIGINT. Killing
the test runner with SIGTERM or SIGHUP before this patch caused
the subprocesses to be left running. In this patch we also catch
SIGTERM and SIGHUP, so our exit handler is also run in that case.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10629
This small series includes a few more CQL tests in the cql-pytest
framework.
The main patch is a translation of a unit test from Cassandra that
checks the behavior of restrictions (WHERE expressions, filtering) in
different cases. It turns out that Cassandra didn't implement some
cases - for example filtering on unfrozen UDTs - but Scylla does
implement them. So in the translated test, the
checks-that-these-features-generate-an-error from Cassandra are
commented out, and this series also includes separate tests for these
Scylla-unique features to check that they actually work correctly and
not just that they exist.
Closes#10611
* github.com:scylladb/scylla:
cql-pytest: translate Cassandra's tests for relations
test/cql-pytest: add test for filtering UDTs
test/cql-pytest: tests for IN restrictions and filtering
test/cql-pytest: test more cases of overlapping restrictions
* abseil f70eadad...9e408e05 (109):
> Cord: workaround a GCC 12.1 bug that triggers a spurious warning
> Change workaround for MSVC bug regarding compile-time initialization to trigger from MSC_VER 1910 to 1930. 1929 is the last _MSC_VER for Visual Studio 2019.
> Don't default to the unscaled cycle clock on any Apple targets.
> Use SSE instructions for prefetch when __builtin_prefetch is unavailable
> Replace direct uses of __builtin_prefetch from SwissTable with the wrapper functions.
> Cast away an unused variable to play nice with -Wunused-but-set-variable.
> Use NullSafeStringView for const char* args to absl::StrCat, treating null pointers as "" Fixes#1167
> raw_logging: Extract the inlined no-hook-registered behavior for LogPrefixHook to a default implementation.
> absl: fix use-after-free in Mutex/CondVar
> absl: fix live-lock in CondVar
> Add a stress test for base_internal::ThreadIdentity reuse.
> Improve compiler errors for mismatched ParsedFormat inputs.
> Internal change
> Fix an msan warning in cord_ringbuffer_test
> Fix spelling error "charachter"
> Document that Consume(Prefix|Suffix)() don't modify the input on failure
> Fixes for C++20 support when not using std::optional.
> raw_logging: Document that AbortHook's buffers live for as long as the process remains alive.
> raw_logging: Rename SafeWriteToStderr to indicate what about it is safe (answer: it's async-signal-safe).
> Correct the comment about the probe sequence. It's (i/2 + i)/2 not (i/2 - i)/2.
> Improve analysis of the number of extra `==` operations, which was overly complicated, slightly incorrect.
> In btree, move rightmost_ into the CompressedTuple instead of root_.
> raw_logging: Rename LogPrefixHook to reflect the other half of it's job (filtering by severity).
> Don't construct/destroy object twice
> Rename function_ref_benchmark.cc into more generic function_type_benchmark.cc, add missing includes
> Fixed typo in `try_emplace` comment.
> Fix a typo in a comment.
> Adds ABSL_CONST_INIT to initializing declarations where it is missing
> Automated visibility attribute cleanup.
> Fix typo in absl/time/time.h
> Fix typo: "a the condition" -> "a condition".
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Fix build with uclibc-ng (#1145)
> Export of internal Abseil changes
> Export of internal Abseil changes
> Replace the implementation of the Mix function in arm64 back to 128bit multiplication (#1094)
> Support for QNX (#1147)
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Exclude unsupported x64 intrinsics from ARM64EC (#1135)
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Add NetBSD support (#1121)
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Some trivial OpenBSD-related fixes (#1113)
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Add support of loongarch64 (#1110)
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Disable ABSL_INTERNAL_ENABLE_FORMAT_CHECKER under VsCode/Intellisense (#1097)
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> macos: support Apple Universal 2 builds (#1086)
> cmake: make `random_mocking_bit_gen` library public. (#1084)
> cmake: use target aliases from local Google Test checkout. (#1083)
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> cmake: add ABSL_BUILD_TESTING option (#1057)
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Fix googletest URL in CMakeLists.txt (#1062)
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Export of internal Abseil changes
> Fix Randen and PCG on Big Endian platforms (#1031)
> Export of internal Abseil changes
Closes#10630
The 'relation' production is self-contained, except for its interface
to the rest of the grammar where it appends to a vector of expressions
(that happens to represent a conjunction of relations). Make it stand-
alone by returning an expression, and move the responsibility for
appending to an expression vector to the whereClause production (later
we can make it build a conjunction expression rather than a vector
of expressions, paving the way for more boolean operators).
Closes#10623
This is a translation of Cassandra's CQL unit test source file
validation/operations/SelectSingleColumnRelationTest.java into our
cql-pytest framework.
This test file includes 23 tests for various types of SELECT operations
which involve relations, a.k.a expressions (i.e., WHERE).
All 23 tests pass on Cassandra. 3 of the tests fail on Scylla
reproducing 2 already known Scylla issues and three minor
previously-unknown issues:
Previously known issues:
Refs #2962: Collection column indexing
Refs #10358: Comparison with UNSET_VALUE should produce an error
Three new (and minor) issue:
Refs #10577: Is max-clustering-key-restrictions-per-query too low?
Refs #10631: Invalid IN restriction is reported as a '=' restriction
Refs #10632: Column name printed in a strange way in error message
NOTE: Scylla supports some expressions which Cassandra does not. In some
cases the Cassandra unit test had checks that certain constructs are not
allowed, and I had to comment out such checks when the expression *does*
work in Scylla. But of course, in such cases, it is not enough to comment
out a check - we also need to verify that Scylla's unique behavior
is the correct one. For that, we will have separate cql-pytest test for
those features - they won't be in the translated Cassandra unit tests
(of course). For example, in this test I had to comment out a check
that filtering on *non*-frozen UDTs is not allowed. In a separate
patch which I'm sending in parallel, I added a new test -
test_filter_UDT_restriction_nonfrozen - which will verify that what
Scylla does in that case is the correct behavior.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Both Scylla and Cassandra support filtering on frozen UDTs, which are
compared using lexicographical order. This patch adds a test to verify
that the behavior here is the same - and indeed it is.
For *non*-frozen UDTs, Cassandra does not allow filtering on them (this
was decided in CASSANDRA-13247), but Scylla does. So we also add a test
on how non-frozen UDTs work - that passes on Scylla (and of course not
in Cassandra).
The two tests here - for frozen and non-frozen UDTs - are identical
(they just call the same function) - to ensure these two cases work the
same. This is important because we can't judge the correctness of the
non-frozen test by comparison to Cassandra - because it can't run there.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Cassandra only allows IN restrictions on primary key columns.
With filtering (ALLOW FILTERING), this limitation makes little
sense, and it turns out that Scylla does not have limitation.
So this patch adds a test that we support such queries *correctly*
(we can't compare to Cassandra because it doesn't implement this).
Another test checks IN restrictions on an *indexed* column
(without ALLOW FILTERING). We could have implemented this - the
indexed column behaves like a partition key - but we didn't,
so this test xfails. It also fails on Cassandra because as mentioned
above, Cassandra didn't implement IN except for primary key columns.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
As noted by test_filtering.py::test_multiple_restrictions_on_same_column
Cassandra WHERE does not allow specifying two restrictions on the the
same column, but Scylla does allow it, and this test verifies that the
results are correct (conflicting restrictions would lead to no results,
but overlapping restrictions can return some results).
In this patch we add yet another example of multiple restrictions
on the same column that was seen in a Cassandra unit test - this time
one of the restrictions involves a IN. These patch helps confirm that
the expression evaluation is done correct (and, again, differently from
Cassandra - Cassandra results in an error in this case).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The replication happens on all shards but current one. There's a special
helper in seastar for such cases
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All snitch drivers are supposed to snitch info on some shard and
replicate the dc/rack info across others. All, but azure really do so.
The azure one gets dc/rack on all shards, which's excessive but not
terrible, but when all shards start to replicate their data to all the
others, this may lead to use-after-frees.
fixes: #10494
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Without that there's use-after-free when called from
distributed_loader::make_sstables_available where
func is turned into a coroutine and the shared_sstable parameter
is not explicitly copied and captured for the continuation
of sst->move_to_new_dir.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The handle must not point at this reader implementation after it's destroyed.
This fixes use-after-free when the queue_reader_v2
is destroyed first as repair_writer_impl::_queue_reader,
before repair_writer_impl::_mq is destroyed.
The issue was introduced in 39205917a8
in the definition of `repair_writer_impl`.
Fixes#10528
While at it, fix also an ignored exceptional future seen in the test:
`repair_additional_test.py::TestRepairAdditional::test_repair_kill_3`
Closes#10591
* github.com:scylladb/scylla:
mutation_readers: queue_reader_v2: detach from handle when destroyed
messaging_service: do_make_sink_source: handle failed source future
Off-strategy works on maintenance sstable set using maintenance
scheduling group, whereas "in-strategy" works on main sstable set
and uses compaction group.
Today, it can happen that off-strategy has to wait for an "in-strategy"
maintenance compaction, e.g. cleanup, to complete before getting
a chance to run. But that's not desired behavior as off-strategy uses
maintenance group, and its candidates don't add to the backlog that
influences "in-strategy" bandwidth. Therefore, "in-strategy" and
off-strategy should be decoupled, with off-strategy having its own
semaphore for guaranteeing serialization across tables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#10595
To run scylla-housekeeping we currently use "sudo -u scylla <cmd>" to switch
scylla user, but it fails on some environment.
Since recent version of Python 3 supports to switch user on subprocess module,
let's use python native way and drop sudo.
Fixes#10483Closes#10538
Currently, the idl-generated deserialization code, e.g. mutation_partition_view::rows() deserializes and returns a complete utils::chunked_vector<deletable_row_view> . And that could be arbitrarily long.
To consume it gently, we don't need the whole vector in advance, but rather we can consume it one element at a time (and in a nested way for cells in a row in the future).
Use `range_deserializer` to consume range tombstones and rows one item at a time.
We may consider in the future also gently iterating over cells
in a row and then dipping into collection cells that might also
contain a large number of items.
Fixes#10558Closes#10566
* github.com:scylladb/scylla:
ser: use vector_deserializer by default for all idl vectors
mutation_partition_view: do_accept_gently: use the range based deserializers
idl-compiler: generate *_range methods using vector_deserializer
serializer_impl: add vector_deserializer
test: frozen_mutation_test: test_writing_and_reading_gently: log detailed error
This series is split from another, bigger RFC series which provides
manual remedies to deal with inconsistencies between the base table
and its views. This part deals with ghost rows by providing a statement
which fetches view rows from a given range, then reads its corresponding
rows from the base table (cl=ALL), and finally removes rows which were
not present in the base table at all, qualifying them as ghost rows.
Motivations for introducing such a statement:
* in case of detected inconsistencies, it can be used to fix
materialized views without recreating them from scratch, which can
take days and generates lots of throughput
* a tool which periodically scrubs a materialized view can be easily
created on top of this statement, especially that it's possible
to remove ghost rows from a user-defined view token range;
This series comes with a unit test.
The reason for digging up this series is because it's still possible to end up with ghost rows in certain rather improbable scenarios, and we lack a way of fixing them without rebuilding the whole view. For instance, in case of a failed synchronous update to a local view, the user will be notified that the query failed, but a ghost row can be created nonetheless. The pruning statement introduced in this series would allow healing the failure locally, without rebuilding the whole view.
Tests: unit(dev)
Closes#10426
* github.com:scylladb/scylla:
docs: add a paragraph on PRUNE MATERIALIZED VIEW statement
service,test: add a test case for error during pruning
tests: add ghost row deletion test case
cql3: enable ghost row deletion via CQL
cql3: add a statement for deleting ghost rows
cql3: convert is_json statement parameter to enum
pager: add ghost row deleting pager
db,view: add delete ghost rows visitor
Writing into the group0 raft group on a client side involves locking
the state machine, choosing a state id and checking for its presence
after operation completes. The code that does it resides now in the
migration manager since the currently it is the only user of group0. In
the near future we will have more client for group0 and they all will
have to have the same logic, so the patch moves it to a separate class
raft_group0_client that any future user of group0 can use to write
into it.
Message-Id: <YoYAJwdTdbX+iCUn@scylladb.com>
"
There's an explicit barrier in main that waits for the sstable format
selector to finish selecting it by the time node start to join a cluter.
(Actually -- not quite, when restarting a normal node it joins cluster
in prepare_to_join()).
This explicit barrier is not needed, the sync point already exists in
the way features are enabled, the format-selector just needs to use it.
branch: https://github.com/xemul/scylla/tree/br-format-selector-sync
tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/351/
refs: #2795
"
* 'br-format-selector-sync' of https://github.com/xemul/scylla:
format-selector: Remove .sync() point
format-selector: Coroutinize maybe_select_format()
format-selector: Coroutinize simple methods
This series adds two commands:
* scylla sstable-summary
* scylla sstable-index-cache
The former dumps the content of the sstable summary. This component is kept in memory in its entirety, so this can be easily done. The latter command dumps the content of the sstable index cache. This contains all the index-pages that are currently cached. The promoted index is not dumped yet and there is no indication of whether a given entry is in the LRU or not, but this already allows at seeing what pages are in the cache and what aren't.
Closes#10546
* github.com:scylladb/scylla:
scylla-gdb.py: add scylla sstable-index-cache command
scylla-gdb.py: add scylla sstable-summary command
test/scylla-gdb: add sstable fixture
scylla-gdb.py: make chunked_vector a proper container wrapper"
scylla-gdb.py: make small_vector a proper container wrapper"
scylla-gdb.py: add sstring container wrapper
scylla-gdb.py: add chunked_managed_vector container wrapper
scylla-gdb.py: add managed_vector container wrapper
scylla-gdb.py: std_variant: add workaround for clang template bug
scylla-gdb.py: add bplus_tree container wrapper
The handle must not point at this reader implementation
after it's destroyed.
This fixes use-after-free when the queue_reader_v2
is destroyed first as repair_writer_impl::_queue_reader,
before repair_writer_impl::_mq is destroyed.
The issue was introduced in 39205917a8
in the definition of `repair_writer_impl`.
Fixes#10528
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We have a check for whether we can use standard coroutines (in namespace
std) or the technical specification (in std::experimental), but it doesn't
work since Clang doesn't report the correct standard version. Use a
compiler versionspecific check, inspired by Seastar's check.
This allows building with clang 14.
Closes#10603
Let's say we have a query like:
```cql
INSERT INTO ks.t (list_column) VALUES (?);
```
And the driver sends a list with null inside as the bound value, something like `[1, 2, null, 4]`.
In such case we should throw `invalid_request_exception` because `nulls` are not allowed inside collections.
Currently when a query like this gets executed Scylla throws an ugly marshalling error.
This is because the validation code reads size of the next element, interprets it as an unsigned integer and tries to read this much.
In case of `null` element the size is `-1`, which when converted to unsigned `size_t` gives 18446744073709551615 and it fails to read this much.
This PR adds proper validation checks to make the error message better.
I also added some tests.
I originally tried to write them in python, but python driver really doesn't like sending invalid values.
Trying to send `[1, None, 2]` results in a list with empty value instead of null.
Trying to send `[1, UNSET_VALUE, 2]` Fails before query even leaves the driver.
Fixes#10580Closes#10599
* github.com:scylladb/scylla:
cql3: Add tests for null and unset inside collections
cql3: Add null and unset checks in collection validation
The tests checks if manually injected ghost rows are properly deleted
by the ghost row delete statement - and, that non-ghost regular rows
are left intact.
This commit allows accepting a CQL request to clear ghost rows
from a given view partition range. Currently its syntax is a purposely
convoluted mix of existing keywords, which makes sure that the statement
is never issued by mistake. Example runs:
-- try deleting all ghost rows, effectively performs a paged full scan
PRUNE MATERIALIZED VIEW my_mv;
-- try deleting ghost rows from a single view partition
PRUNE MATERIALIZED VIEW my_mv WHERE mv_pk = 3;
-- try deleting ghost rows from a token range (effective full scans)
PRUNE MATERIALIZED VIEW my_mv WHERE TOKEN(mv_pk) > 7 AND TOKEN(mv_pk) < 42
In order to expose the API for deleting ghost rows from a view,
a CQL statement is created. It is loosely based on select_statement,
as its first step is to select view table rows.
Right now is_json is used to decide if the statement needs to be treated
in a special way. For two types (regular statement and JSON statement),
a boolean is enough, but this series extends it for two more types,
so the flag is converted to an enum.
The visitor is used to traverse view rows, and if it detects
a ghost row it qualifies it for deletion. Qualification is based
on a base table read with cl=ALL: if the corresponding row is not
present in the base table, it is considered a ghost.
Add a bunch of tests that test what happens
when there is a null or unset value inside collections.
They are not allowed so every such attempt
should end with invalid_request_exception
with proper message.
I had to write a new function for collection serialization.
I tried to use data_value and its methods, but it's impossible
to create a data_value that represents an unset value.
Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
Currently use the range_deserializer for range tombstones and rows.
We may consider in the future also gently iterating over cells
in a row and then dipping into collection cells that might also
contain a large number of items.
Fixes#10558
Test: frozen_mutation_test(dev, debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Generate code for *_range methods that return a
vector_deserializer rather than constructing the complete
vector of views.
This would be useful for streamed mutation unfreezing
in the following patch.
Later, we should just use vector_deserializer for all vectors.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be used for streaming through a serialized
vector, deserializing the items as we go
when dereferencing or incrementing the iterator.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Reapply: "disable_auto_compaction: stop ongoing compactions"
This is a reapplication of a former commit
4affa801a5 which was reverted by
8e8dc2c930.
This commit is a fixed version of the original where a call to the
compaction_manager constructor accidentally issued (`compaction_manager()`)
instead a call to retrieve a compaction manager reference
(`get_compaction_manager()`), we don't use this function because it
doesn't exist anymore - it existed at the time the patch was written
bu was removed in 9066224cf4 later on,
instead, we just use the private table member _compaction_manager which refs
the compaction manager.
The explanation for the bad effect is probably that a `this` pointer
capture down the call chain, resulted in a use after free which had
an unknown effect on the system. (memory corruption at startup).
Test: unit (dev,debug)
write performance test as the one used to find the bug.
A screenshot of the performance test can be found at
https://github.com/scylladb/scylla/issues/10146/#issuecomment-1129578381
Fixes https://github.com/scylladb/scylla/issues/9313
Refs https://github.com/scylladb/scylla/issues/10146
For completeness, the original commit message was:
The api call disables new regular compaction jobs from starting
but it doesn't wait for ongoing compaction to stop and so it's
much less useful.
Returning after stopping regular compaction jobs and waiting
for them to stop guarantees that no regular compactions job are
running when nodetool disableautocompaction returns successfully.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Closes#10597
* github.com:scylladb/scylla:
compaction_manager: Make invoking the empty constructor more explicit
Reapply: "disable_auto_compaction: stop ongoing compactions"
The compaction manager's empty constructor is supposed to be invoked
only in testing environment, however, it is easy to invoke it by mistake
from production code.
Here we add a more verbose constructor and making the default compaction
private, the verbose compiler need to be invoked with a tag
for_testing_tag, this will ensure that this constructor will be invoked
only when intended.
The unit tests were changed according to this new paradigm.
Tests: unit (dev)
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
This is a reapplication of a former commit
4affa801a5 which was reverted by
8e8dc2c930.
This commit is a fixed version of the original where a call to the
compaction_manager constructor accidentally issued (`compaction_manager()`)
instead a call to retrieve a compaction manager reference
(`get_compaction_manager()`), we don't use this function because it
doesn't exist anymore - it existed at the time the patch was written
bu was removed in 9066224cf4 later on,
instead, we just use the private table member _compaction_manager which refs
the compaction manager.
The explanation for the bad effect is probably that a `this` pointer
capture down the call chain, resulted in a use after free which had
an unknown effect on the system. (memory corruption at startup).
Test: unit (dev,debug)
write performance test as the one used to find the bug.
A screenshot of the performance test can be found at
https://github.com/scylladb/scylla/issues/10146/#issuecomment-1129578381Fixes#9313
Refs #10146
For completeness, the original commit message was:
The api call disables new regular compaction jobs from starting
but it doesn't wait for ongoing compaction to stop and so it's
much less useful.
Returning after stopping regular compaction jobs and waiting
for them to stop guarantees that no regular compactions job are
running when nodetool disableautocompaction returns successfully.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Validating a collection should ensure that there
are no null or unset values inside the collection.
The validation already fails in case of such values,
but it does so in an ugly way.
Length of null and unset value is negative but is
cast to unsigned size_t. Then it tries to read
a really large value and fails with marshalling error.
The new checks are a better way to handle this.
Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
This test started failing sporadically of late. This failure is seen
quite often in CI tests but is very hard to reproduce locally. The
problem seems to be timing related, as the same seeds that fail in CI
don't fail locally. This patch is a speculative fix. The test has a
single time-related components: `gc_clock::now()`. This is invoked in 4
different places during a single iteration, giving ample opportunity for
off-by-one errors to appear. Although there is no solid proof for this
being the problem, this is a good candidate. This patch replaces all
those different invocations, with a single one per test: this value is
then propagated to all places that need it.
Fixes: #10554
Marking the patch as a fix for the issue, if the problem re-surfaces
after this patch we'll re-poen it.
Closes#10589
Using traceback_with_variables module, generate more detail traceback
with variables into debug log.
This will help fixing bugs which is hard to reproduce.
Closes#10472
[avi: regenerate frozen toolchain]
Today, aborting a maintenance compaction like major, which is waiting for
its turn to run, can take lots of time because compaction manager will
only be able to bail out the task once it gets the "permit" from the
serialization mechanism, i.e. semaphore. Meaning that the command that
started the task will only complete after all this time waiting for
the "permit".
To allow a pending maintenance compaction to be quickly aborted, we
can use the abortable variant of get_units(). So when user submits an
abortion request, get_units() will be able to return earlier through
the abort exception.
Refs #10485.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#10581
Before this change parser used to output instances of the `relation` class, which were later converted to `restrction`.
`relation` took care of initial processing such as preparing and some validation checks.
This PR aims to remove the `relation` class and perform it's functionality using only `expression`.
This is a step towards removing the legacy classes and converting all AST analysis to work on `expressions`.
Closes#10409
* github.com:scylladb/scylla:
cql3: Remove scalar from bind_variable_scalar_prepare_expression
cql3: expr: Remove shape_type from bind_variable
cql3: Remove prepare_expression_multi_column
cql3: Remove relation class
cql3: Add more tests for expr::printer
cql3: Make parser output expression for relations
cql3: expr: add printer for expression
cql3: expr: expr::to_restriction: Handle token relations
cql3: expr: expr::to_restriction: Handle multi column relations
cql3: expr: Add expr::to_restriction for single column relations
cql3: expr: Add prepare_binary_operator
cql3: expr: Change how prepare_expression handles bind_variable
clq3: expr: Add columns to expr::token struct
cql3: expr: Modify list_prepare_expression to handle lists of IN values
cql3: expr: Add expr::as_if for non-const expressions
Currently our error message on scylla_prepare says "Exception occurred
while creating perftune.yaml", even perftune.yaml is already generated,
and error occurred after that.
To describe error more correctly, add another error message after
perftune.yaml generated.
see scylladb/scylla-enterprise#2201
Closes#10575
shape_type was used in prepare_expression to differentiate
between a few cases and create the correct receivers.
This was used by the relation class.
Now creating the correct receiver has been delegated to the caller
of prepare_expression and all bind_variables can be handled
in the same simple way.
shape_type is not needed anymore.
Not having it is better because it simplifies things.
Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
This function was used by multi_column_relation.hh,
but now it isn't needed anymore.
The only way to prepare a bind_variable is now the standard prepare_expression.
Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
Functionality of the relation class has been replaced by
expr::to_restriction.
Relation and all classes deriving from it can now be removed.
Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
Now that parser outputs expressions
it's much easier to check whether
expression printer works correctly.
We can prepare a bunch of strings
which will be parsed and then printed
back to string.
Then we can compare those strings.
It's much easier than creating
expresions to print manually.
The only downside is that this tests
only unprepared version of expression,
so instead of column_value there will
be unresolved identifier, insted of constant
untyped_constant etc.
Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
Parser used to output the where clause as a vector of relations,
but now we can change it to a vector of expressions.
Cql.g needs to be modified to output expressions instead
of relations.
The WHERE clause is kept in a few places in the code that
need to be changed to vector<expression>.
Finally relation->to_restriction is replaced by expr::to_restriction
and the expressions are converted to restrictions where required.
The relation class isn't used anywhere now and can be removed.
Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
expression::printer is used to print CQL expressions
in a pretty way that allows them to be parsed back
to the same representation.
There is a bunch of things that need to be changed when
compared to the current implementation of opreatorr<<(expression)
to output something parsable.
column names should be printed without 'unresolved_identifier()'
and sometimes they need to be quoted to perserve case sensitivity.
I needed to write new code for printing constant values
because the current one did debug printing
(e.g. a set was printed as '1; 2; 3').
A list of IN values should be printed inside () intead of [],
but because it is internally represented as a list it is
by default printed with [].
To fix this a temporary tuple_constructor is created and printed.
Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
Implement converting token relations to expressions.
The code is mostly tekken from functions in token_relation.hh,
because we are replicating functionliaty of the functions called
token_relation::new_XX_restrictions.
Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
Implement converting multi column relations to expressions.
The code is mostly taken from functions in multi_column_relation.hh,
because we are replicating functionality of the functions called
multi_column_relation::new_XX_restriction.
Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
Add a function that will be used to convert expressions
received from the parser to restrictions.
Currently parser creates relations with expressions inside
and then those relations are converted to restrictions.
Once this function is implemented we will be able to skip
creating relations altogether and convert straight from
expression to restriction. This will allow us to remove
the relation class.
Further functionality will be implemented in the following commits.
This commit implements converting single column relations to expressions.
The code is mostly taken from functions in single_column_relation.hh,
because we are replicating functionality of the functions called
single_column_relation::new_XX_restriction.
Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
Add a function that allows to prepare
a binary_operator received from the parser.
It resolves columns on the LHS, calculates type of LHS,
and prepares RHS with the correct type.
It will be used by expr::to_restriction.
Some basic type checks are performed, but more throughout
checks will be required in expr::to_restriction to fully
validate a relation.
Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
The situation with preparing bind_variable is a bit strange,
there are four shapes of bind variables and receiver behaviour
is not in line with other types.
To prepare a bind_variable for a list of IN values for an int column
the current code requires us to pass a receiver of type int.
This is counterintuitive, to prepare a string we pass
a receiver with string type, so to prepare list<int> we should
pass a receiver of type list<int>, not just int.
This commit changes the behaviour in two ways:
- Shape of bind_variable doesn't matter anymore
- The bind_variable gets the receiver passed to prepare_expression,
no more list<receiver> magic.
Other variants of bind_variable_x_prepare_expression are not removed yet
because they are needed by prepare_expression_mutlti_column.
They will be removed later, along with bind_variable::shape_type.
Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
The expr::token struct is created when something
like token(p1, p2) occurs in the WHERE clause.
Currently expr::token doesn't keep columns passed
as arguemnts to the token function.
They weren't needed because token() validation
was done inside token_relation.
Now that we want to use only expressions
we need to have columns inside the token struct
and validate that those are the correct columns.
Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
The standard CQL list type doesn't allow for nulls inside the collection.
However lists of IN values are the exception where bind nullsare allowed,
for example in restrictions like: p IN (1, 2, null)
To be able to use list_prepare_expression with lists of IN values
a flag is added to specify whether nulls should be allowed.
Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
expr::as_if is our wrapper for std::get_if.
There was a version for const expression*,
but there weren't one for mutable expression*.
Add the mutable version,
it will be needed in the following commits.
Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
The feature listener callbacks are waited upon to finish in the
middle of the cluster joining process. I particular -- before
actually joining the cluster the format should have being selected.
For that there's a .sync() method that locks the semaphore thus
making sure that any update is finished and it's called right after
the wait_for_gossip_to_settle() finishes.
However, features are enabled inside the wait_for_gossip_to_settle()
in a seastar::async() context that's also waited upon to finish. This
waiting makes it possible for any feature listener to .get() any of
its futures that should be resolved until gossip is settled.
Said that, the format selection barrier can be moved -- instead of
waiting on the semaphore, the respective part of the selection code
can be .get()-ed (it all runs in async context). One thing to care
about -- the remainder should continue running with the gate held.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This method is run when a feature is enabled. It's a bit trickier than
the others, also there are two methods actually, that are merged into
one by this patch. By and large most of the care is about the _sel
gate and _sem semaphore.
The gate protects the whole selection code from the selector being freed
from underneath it on stop. The semaphore is only needed to keep two
different format selections from each other -- each update the system
keyspace, local variable and replica::database instance on all shards.
In the end there's a gossiper update, but it happens outside of the
semaphore.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Our documentation for SELECT,
https://docs.scylladb.com/getting-started/dml/#ordering-results
says that:
"The ORDER BY clause lets you select the order of the returned results.
It takes as argument a list of column names along with the order for
the column (ASC for ascendant and DESC for descendant,
**omitting the order being equivalent to ASC**)."
The test in this patch confirms that the last emphasized line is not
accurate - The default order for SELECT is the default order of the table
being read - NOT always ascending order. If the table was created with
descending WITH CLUSTERING ORDER BY, then a SELECT not specifying
an ORDER BY will get this descending order by default.
The test passes on both Scylla and Cassandra, demonstrating that this
behavior is expected and correct - regardless of what our docs say.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220515115030.775813-1-nyh@scylladb.com>
When split from service/storage_service.hh (4ca2ae13) it accidentally
changed license. Change it back (since it does not contain
Apache derived code, constrain it to AGPL-3.0-or-later).
Closes#10572
The bytes_stream param is passed by value from `write_promoted_index`
(since 0d8463aba5)
causing an uneeded copy.
This can lead to OOM if the promoted index is extremely large.
Pass the bytes_ostream by reference instead to prevent this copy.
Fixes#10569
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10570
Current code already assumes (correctly), that shrink() does not
throw, otherwise we risk leaking memory allocated in get_ptr():
```
ts_value_lru_entry* new_lru_entry = Alloc().template allocate_object<ts_value_lru_entry>();
// Remove the least recently used items if map is too big.
shrink();
```
Let's be explicit and mark shrink() and a few helper methods
that it uses as noexcept. Ultimately they are all noexcept anyway,
because polymorphic allocator's deallocation routines don't throw,
and neither do boost intrusive list iterators.
Closes#10565
It turns out that there is a difference between how Scylla and Cassandra
handle multiple restrictions on the same column - for example "WHERE
c = 0 and c >0". Cassandra treats all such cases as invalid queries,
whereas Scylla *allows* them.
This test demonstrates this difference (it is marked "scylla_only"
because it's a Scylla-only feature), and also verifies that the results
of such queries on Scylla are correct - i.e., if the two restrictions
conflict the result is empty, and if the two restrictions overlap,
the result can be non-empty.
The test passes, verifying that although Scylla differs from Cassandra
on this, its behavior is correct.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220512165107.644932-1-nyh@scylladb.com>
Recently Abseil started to ask to enable ABSL_PROPAGATE_CXX_STD,
warning that it will do so itself in the future. Do so, and
specify that we use C++20 to avoid inconsistencies.
Closes#10563
After fcb8d040 ("treewide: use Software Package Data Exchange
(SPDX) license identifiers"), many dual-licensed files were
left with empty comments on top. Remove them to avoid visual
noise.
Closes#10562
"Almost" because 2 uses of the v1 asserter remain (as they are deliberate).
Closes#10518
* github.com:scylladb/scylla:
tests: remove obsolete utility functions
tests: less trivial flat_reader_assertions{,_v2} conversions
tests: trivial flat_reader_assertions{,_v2} conversions
flat_mutation_reader_assertions_v2: improve range tombstone support
This reverts commit 4affa801a5.
In issue #10146 a write throughput drop of ~50% was reported, after
bisect it was found that the change that caused it was adding some
code to the table::disable_auto_compaction which stops ongoing
compactions and returning a future that resolves once all the compaction
tasks for a table, if any, were terminated. It turns out that this function
is used only at startup (and in REST api calls which are not used in the test)
in the distributed loader just before resharding and loading of
the sstable data. It is then reanabled after the resharding and loading
is done.
For still unknown reason, adding the extra logic of stopping ongoing
compactions made the write throughput drop to 50%.
Strangely enough this extra logic **should** (still unvalidated) not
have any side effects since no compactions for a table are supposed to
be running prior to loading it.
This regains the performance but also undo a change which eventually
should get in once we find the actual culprit.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Closes#10559Reopens#9313.
We're hitting a unit test failure as in
https://jenkins.scylladb.com/view/master/job/scylla-master/job/build/1010/artifact/testlog/aarch64_dev/frozen_mutation_test.test_writing_and_reading_gently.918.log
```
unknown location(0): fatal error: in "test_writing_and_reading_gently": std::_Nested_exception<std::runtime_error>: frozen_mutation::unfreeze_gently(): failed unfreezing mutation pk{00801806000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000} of ks.cf
```
on aarch64 clang 12.0.1, in release and dev modes (but not debug).
This turned out to be a miscompilation
in `position_in_partition_view::for_key(cr.key())`
that returns a position_in_partition_view of the
clustering_key_prefix rvalue that cr.key() returns.
The latter is lost on aarch64 in release mode.
Keeping the key on the stack allows to safely pass
a view to it.
Fixes#10555
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The main reason for adding rust dependency to scylla is the
wasmtime library, which is written in rust. Although there
exist c++ bindings, they don't expose all of its features,
so we want to do that ourselves using rust's cxx.
The patch also includes an example rust source to be used in
c++, and its example use in tests/boost/rust_test.
The usage of wasmtime has been slightly modified to avoid
duplicate symbol errors, but as a result of adding a Rust
dependency, it is going to be removed from `configure.py`
completely anyway
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Closes#10341
* github.com:scylladb/scylla:
docs: document rust
tests: add rust example
Only for reasons other than "no such KS", i.e. when the failure is
presumed transient and the batch in question is not deleted from
batchlog and will be retried in the future.
(Would info be more appropriate here than warning?)
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Closes#10556
When the repair reason is not repair, which means the repair reason is
node operations (bootstrap, replace and so on), a single repair job contains all
the ranges of a table that need to be repaired.
To trigger off strategy compaction early and reduce the number of
temporary sstable files on disk, we can trigger the compaction as soon
as a table is finished.
Refs: #10462
This patch changes the way a repair job walks through tables and ranges
if multiple tables and ranges are requested by users.
Before:
```
for range in ranges
for table in tables
repair(range, table)
```
After:
```
for table in tables
for range in ranges
repair(range, table)
```
The motivation for this change is to allow off-strategy compaction to trigger
early, as soon as a table is finished. This allows to reduce the number of
temporary sstables on disk. For example, if there are 50 tables and 256 ranges
to repair, each range will generate one sstable. Before this change, there will
be 50 * 256 sstables on disk before off-strategy compaction triggers. After this
change, once a table is finished, off-strategy compaction can compact the 256
sstables. As a result, this would reduce the number of sstables by 50X.
This is very useful for repair based node operations since multiple ranges and
tables can be requested in a single repair job.
Refs: #10462
Using Rust in Scylla is not intuitive, the doc explains the entire
process of adding new Rust source files to Scylla. What happens
during compilation is also explained.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
The patch includes an example rust source to be used in
c++, and its example use in tests/boost/rust_test.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
This series update debugging.md with:
- add an example .gdbinit file
- update recommendation for finding the relocatable packages using a build-id on http://backtrace.scylladb.com/Closes#10492
* github.com:scylladb/scylla:
docs: debugging.md: update instructions regarding backtrace.scylladb.com
docs: debugging.md: add a sample gdbinit file
Drop the section about non-relocatable packages. They are not a thing anymore.
Also tweaked the instructions for launching the toolchain container.
Closes#10539
* github.com:scylladb/scylla:
docs/debugging.md: adjust instructions for using the toolchain
docs/debugging.md: drop section about handling binaries from non-relocatable packages
Seems like 59adf05 has a bug, the regex pattern only handles first
32CPUs cpuset pattern, and ignores rest.
We should extend regex pattern to handle all CPUs.
Fixes#10523Closes#10524
to Scylla itself to make it still compile - see below
* seastar 5e863627...96bb3a1b (18):
> install-dependencies: add rocky as a supported distro
> circleci: relax docker limits to allow running with new toolchain
> core: memory: Add memory::free_memory() also in Debug mode
> build: bump up zlib to 1.2.12
> cmake: add FindValgrind.cmake
> Merge 'seastar-addr2line: support sct syslogs' from Benny Halevy
> rpc: lower log level for 'failed to connect' errors
> scripts: Build validation
> perftune.py: remove rx_queue_count from mode condition.
> memory: add attributes to memalign for compatibility with glibc 2.35
> condition-variable: Fix timeout "when" potentially not killing timer
> Merge "tests: perf: measure coroutines performance" from Benny
> Merge: Refine COUNTER metrics
> Revert "Merge: Refine COUNTER metrics"
> reactor: document intentional bitwise-on-bool op in smp_pollfn::poll()
> Merge: Refine COUNTER metrics
> SLES: additionally check irqbalance.service under /usr/lib
> rpc_tester: job_cpu: mark virtual methods override
Changes to Scylla also included in this merge:
1. api: Don't export DERIVEs (Pavel Emelyanov)
Newer seastar doesn't have DERIVE metrics, but does have REAL_COUNTER
one. Teach the collectd getter the change.
(for the record: I don't understand how this endpoing works at all,
there's a HISTOGRAM metrics out there that would be attempted to get
exposed with the v.ui() call which's totally wrong)
2. test: use linux_perf_events.{cc,hh} from Seastar
Seastar now has linux_perf_events.{cc,hh}. Remove Scylla's version
of the same files and use Seastar's. Without this change, Scylla
fails to compile when some source files end up including both
versions and seeing double definitions.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
No functional changes intended - this series is quite verbose,
but after it's in, it should be considerably easier to change
the type of SSTable generations to something else - e.g. a string
or timeUUID.
Closes#10533
Commit e8f3d7dd13 added eof() checks to public partition-level
advance_to() methods, to ensure we do not attempt to re-read the last
page of the index when at eof(). It was noted however that this check
would be safer in advance_to(index_bound&, dht::ring_position_view)
because that is the method that all these higher-level methods end up
calling. Placing the check there would guarantee safety for all such
operations. This path does exactly that: it pushes down the check to
said method. One change needed for this to work is to check eof on the
bound that is currently advanced, instead of unconditionally checking
the lower bound.
Closes#10531
We introduce a new service that performs failure detection by periodically pinging
endpoints. The set of pinged endpoints can be dynamically extended and
shrinked. To learn about liveness of endpoints, user of the service
registers a listener and chooses a threshold - a duration of time which
has to pass since the last successful ping in order to mark an endpoint
as dead. When an endpoint responds it's immediately marked as alive.
Endpoints are identified using abstract integer identifiers.
The method of performing a ping is a dependency of the service provided
by the user through the `pinger` interface. The implementation of `pinger` is
responsible for translating the abstract endpoint IDs to 'real'
addresses. For example, production implementation may map endpoint IDs
to IP addresses and use TCP/IP to perform the ping, while a test/simulation
implementation may use a simulated network that also operates on
abstract identifiers.
Similarly, the method of measuring time is a dependency provided by the
user using the `clock` interface. The service operates on abstract time
intervals and timepoints. So, for example, in a production
implementation time can be measured using a stopwatch, while in
test/simulation we can use a logical clock.
The service distributes work across different shards. When an endpoint
is added to the set of detected endpoints, the service will choose a
shard with the smallest amount of workers and create a worker that is
responsible for periodically pinging this endpoint on that shard and
sending notifications to listeners.
We modify the randomized nemesis test to use the new service.
The service is sharded, but for simplicity of implementation in the test
we implement rpcs and sleeps by routing the requests to shard 0, where
logical timers and network live. rpcs are using the existing simulated
network and clock using the existing logical timers.
We also integrate the service with production code. There,
`pinger` is implemented using existing GOSSIP_ECHO verb. The gossip echo
message requires the node's gossip generation number. We handle this by
embedding the pinger implementation inside `gossiper`, and making
`gossiper` update the generation number (cached inside the pinger class)
periodically.
Production `clock` is a simple implementation which uses
`std::chrono::steady_clock` and `seastar::sleep_until` underneath.
Translating `steady_clock` durations to `direct_fd::clock` durations happens
by taking the number of ticks.
We connect the group 0 raft server rpc implementation to the new service,
so that when servers are added or removed from the the group 0 configuration,
corresponding endpoints are added to the direct failure detector service.
Thus the set of detected endpoints will be equal to the group 0 configuration.
On each shard, we register a listener for the service.
The listener maintains a set of live addresses; on mark_alive it adds a
server to the set and on mark_dead it removes it. This set is then used
to implement the `raft::failure_detector` interface, consisting of
`is_alive()` function, which simply checks set membership.
---
v6:
- remove `_alive_start_index`. Instead, keep a map of `bool`s to track liveness of each endpoint. See the code for details (`listeners_liveness` struct and its usage in `ping_fiber()`, `notify_fiber()`, `add/remove_worker`, `add/remove_listener`). The diff is easy to read: f617aeca62..d4b225437c
v5:
- renamed `rpc` to `pinger`
- replaced `bool` with `enum class endpoint_update` (with values `added` and `removed`) in `_endpoint_updates`
- replaced `unsigned` with `shard_id`
- fixed definition of `threshold(size_t n)` (it didn't use `n`, but `_alive_start`; fortunately all uses passed `_alive_start` as `n` so the bug wouldn't affect the behavior)
- improve `_num_workers` assertions
- signal `_alive_start_changed` only when `_alive_start` indeed changed
- renamed `{_marked}_alive_start` to `{_marked}_alive_start_index`
v4:
- rearrange ping_fiber(). Remove the loop at the end of the big `while`
which was timing out listeners (after the sleep). Instead:
- rely on the loop before the sleep for timing out listeners
- before calling ping(), check if there is a timed out listener,
if so abandon the ping, immediately proceed to the timing-out-listeners
loop, and then immediately proceed to the next iteration of the big `while`
(without sleeping)
- inline send_mark_dead() and send_mark_alive(); each was used in
exactly one place after the rearrangement
- when marking alive, instead of repeatedly doing `--_alive_start` and
signalling the condition variable, just do `_alive_start = 0` and signal
the condition variable once
- fix the condition for stopping `endpoint_worker::notify_fiber()`: before, it was
`_as.abort_requested()`, now it is `_as.abort_requested() && _alive_start == _fd._listeners.size()`.
Indeed, we want to wait for the stopping code (`destroy_worker()`)
to set `_alive_start = _fd._listeners.size()` before `notify_fiber()`
finishes so `notify_fiber()` can send the final `mark_dead`
notifications for this endpoint. There was a race before where
`notify_fiber()` could finish before it sent those notifications
(because it finished as soon as it noticed `_as.abort_requested()`)
- fix some waits in the unit test; they depended on particular ordering
of tasks by the Scylla reactor, the test could sometimes hang in debug
mode which randomizes task order
- fix `rpc::ping()` in randomized_nemesis_test so it doesn't give an
exceptional discarded future in some cases
v3:
- fix a race in failure_detector::stop(): we must first wait for _destroy_subscriptions fiber to finish on all shards, only then we can set _impl to nullptr on any shard
- invoke_abortable_on was moved from randomized_nemesis_test to raft/helpers
- add a unit test (second patch)
v2:
- rename `direct_fd` namespace to `direct_failure_detector`
- move gms/direct_failure_detector.{cc,hh} to direct_failure_detector/failure_detector.{cc,hh}
- cleaned license comments
- removed _mark_queue for sending notifications from ping_fiber() to notify_fiber(). Instead:
- _listeners is now a boost::container::flat_multimap (previously it was std::multimap)
- _alive_start is no longer an iterator to _listeners, but an index (size_t)
- _mark_queue was replaced with a second index to _listeners, _marked_alive_start, together with a condition variable, _alive_start_changed
- ping_fiber() signals _alive_start_changed when it changes _alive_start
- notify_fiber() waits on _alive_start_changed. When it wakes up, it compares _marked_alive_start to _alive_start, sends notifications to listeners appropriately, and updates _marked_alive_start
- replacing _mark_queue with index + condition variable allowed some better exception specifications: send_mark_alive and send_mark_dead are now noexcept, ping_fiber() is specified to not return exceptional futures other than sleep_aborted which can only happen when we destroy the worker (previously, ping_fiber() could silently stop due to exception happening when we insert to _mark_queue - it could probably only be bad_alloc, but still)
- _shard_workers is now unordered_map<endpoint_id, endpoint_worker> instead of unordered_map<endpoint_id, unique_ptr<endpoint_worker>> (after learning how to construct map values in place - using either `emplace`+`forward_as_tuple` or `try_emplace`)
- `failure_detector::impl::add_endpoint` now gives strong exception guarantee: if an exception is thrown, no state changes
- same for `failure_detector::impl::remove_endpoint`
- `failure_detector::impl::create_worker` now uses `on_internal_error` when it detects that there is a worker for this endpoint already - thanks to the strong exception guarantees of `add_endpoint` and `remove_endpoint` this should never happen
- comment at _num_workers definition why we maintain this statistic (to pick a shard with smallest number of workers)
- remove unnecessary `if (_as.abort_requested())` in `ping_fiber()`
- in ping_fiber(), after a ping, we send notifications to listeners which we know will time-out before the next ping starts. Before, we would sleep until the threshold is actually passed by the clock. Now we send it immediately - we know ahead of time that the listener will time-out and we can notify it immediately.
- due to above, comment at `register_listener` was adjusted, with the following note added: "Note: the `mark_dead` notification may be sent earlier if we know ahead of time that `threshold` will be crossed before the next `ping()` can start."
- `register_listener` now takes a `listener&`, not `listener*`
- at `register_listener` comment why we allow different thresholds (second to last paragraph)
- at `register_listener` mention that listeners can be registered on any shard (last paragraph)
- add protected destructors to rpc, clock, listener, and mention that these objects are not owned/destroyed by `failure_detector`.
- replaced _endpoint_queue (seastar::queue<pair<endpoint_id, bool>>) with unordered_map<endpoint_id, bool> + condition variable. When user calls add/remove_endpoint, an entry is inserted to this map, or existing entry is updated, and the condition variable is signaled. update_endpoint_fiber() waits on the condition variable, performs the add/remove operation, and removes entries from this map. Compared to the previous solution:
- the new solution has at most one entry for a given endpoint, so the number of entries is bounded by the number of different endpoints (so in the main Scylla use case, by the number of different nodes that ever exist); the previous solution could in theory have a backlog of unprocessed events, with updates for a given endpoint appearing multiple times in the queue at once
- when the add/remove operation fails in update_endpoint_fiber(), we don't remove the entry from the map so the operation can be retried later. Previously we would always remove the entry from the queue so it doesn't grow too big in presence of failures.
- when the add/remove operation fails in update_endpoint_fiber(), we sleep for 10*ping_period before retrying. Note that this codepath should not be reached in practice, it can basically only happen on bad_alloc
- commented that `clock::sleep_until` should signalize aborts using `sleep_aborted`
- `clock::now()` is `noexcept`
- `add/remove_endpoint` can be called after `stop()`, they just won't do anything in that case. Reason: next item
- in randomized_nemesis_test, stop failure detector before raft server (it was the other way before), so it stops using server's RPC before server is aborted. Before, the log was spammed with errors from failure detector because failure detector was getting gate_closed_exceptions from the RPC when the server was stopped. A side effect is that the raft server may continue adding/removing endpoints when the failure detector is stopped, which is fine due to above item
- randomized_nemesis_test: direct_fd_clock::sleep_until translates abort_requested_exception to sleep_aborted (so sleep_until satisfies the interface specification)
- message/rpc_protocol_impl: send_message_abortable: if abort_source::subscribe returns null, immediately throw abort_requested_exception (before we would send the message out and not react to an abort if it happened before we were called)
- rebase
Closes#10437
* github.com:scylladb/scylla:
service: raft: remove `raft_gossip_failure_detector`
service: raft: raft_group_registry: use direct failure detector notifications for raft server liveness
service: raft: add/remove direct failure detector endpoints on group 0 configuration changes
main: start direct failure detector service
messaging_service: abortable version of `send_gossip_echo`
message: abortable version of `send_message`
test: raft: randomized_nemesis_test: remove old failure_detector
test: raft: randomized_nemesis_test: use `direct_failure_detector::failure_detector`
test: raft: randomized_nemesis_test: ping all shards on each tick
test: unit test for new failure detector service
direct_failure_detector: introduce new failure detector service
Clang and GDB don't see eye to eye on the template arguments of
std::variants. Executables generated by clang are known to yield 0
template arguments when one queries them via the GDB python API.
This patch adds a workaround to the std_variant wrapper, allowing a
caller who knows the type to brute-force getting the variant member with
the known type.
The attached volume doesn't need to be relabeled anymore (`:z` not
needed at the end of the volume attach instructions). This also allows
dropping the `sudo` from the invocation.
Dealing with the handful of tests that check range tombstones in
interesting ways and need more than search-and-replace.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
* Track the active range tombstone.
* Add `may_produce_tombstones()`.
* Flesh out `produces_row_with_key()`.
* Add more trace logs.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
This series refactors `table::snapshot` and moves the responsibility
to flush the table before taking the snapshot to the caller.
`flush_on_all` and `snapshot_on_all` helpers are added to replica::database
(by making it a peering_sharded_service) and upper layers,
including api and snapshot-ctl now call it instead of calling cf.snapshot directly.
With that, error are handed in table::snapshot and propagated
back to the callers.
Failure to allocate the `snapshot_manager` object is fatal,
similar to failure to allocate a continuation, since we can't
coordinate across the shards without it.
Test: unit(dev), rest_api(debug)
Fixes#10500Closes#10513
* github.com:scylladb/scylla:
table: snapshot: handle errors
table: snapshot: get rid of skip_flush param
database: truncate: skip flush when taking snapshot
test: rest_api: storage_service: verify_snapshot_details: add truncate
database: snapshot_on_all: flush before snapshot if needed
table: make snapshot method private
database: add snapshot_on_all
snapshot-ctl: run_snapshot_modify_operation: reject views and secondary index using the schema
snapshot-ctl: refactor and coroutinize take_snapshot / take_column_family_snapshot
api: storage_service: increase visibility of snapshot ops in the log
api: storage_service: coroutinize take_snapshot and del_snapshot
api: storage_service: take_snapshot: improve api help messages
test: rest_api: storage_service: add test_storage_service_snapshot
database: add flush_on_all variants
test: rest_api: add test_storage_service_flush
Turn table::snapshot into a coroutine,
catch exceptions, and return them to the caller.
Make sure that coordination across shards
would not break even if any of the shards hits
an error, by always signaling semaphores other
shards wait on.
All errors except for failing to allocate
the snapshot_manager objects are caught
and propagated back.
Failing to allocate the snapshot_manager is fatal
similar to failing to allocate a continuation
since we can't coordinate across the shards without it,
so abort that fails.
Fixes#10500
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
database::truncate already flushes the table
on auto_snapshot so there is never a reason
to flush it again in table::snapshot.
Note that cf.can_flush() is false only if memtables
are empty so there nothing to flush or there is
is no seal_immediate_fn and then table::snapshot
wouldn't be able to flush either.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
flush_on_all shards before taking the snapshot if !skip_flush
so we can get rid of flushing in table::snapshot.
Refs #10500
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And move the logic from snapshot-ctl down to the
replica::database layer.
A following patch will move the flush phase
from the replica::table::snapshot layer
out to the caller.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Detecting a secondary index by checking for a dot
in the table name is wrong as tables generated by Alternator
may contain a dot in their name.
Instead detect bot hmaterialized view and secondary indexes
using the schema()->is_view() method.
Fixes#10526
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
snapshot operations over the api are rare
but they contain significant state on disk in the
form of sstables hard-linked to the snapshot directories.
Also, we've seen snapshot operations hang in the field,
requiring a core dump to analyse the issue,
while there were no records in the log indicating
when previous snapshot operations were last executed.
This change promotes logging to info level
when take_snapshot and del_snapshot start,
and logs errors if in case they fail.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Test the snapshot operations via the rest api.
Added test/rest_api/rest_util.py with
new_test_snapshot that creates a new test snapshot
and automagically deletes it when the `with` block
if exited, similar to new_test_keyspace and new_test_table.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Allowing bind markers in collection literals is a change which causes minor differences in behavior between Scylla and Cassandra. Despite such an undesirable effect, I think allowing them is a good idea because it makes [refactoring work made by cvybhu](https://github.com/scylladb/scylla/pull/10409) easier - 469d03f8c2.
Also, making Scylla accept a superset of valid Cassandra cql expressions does not make us less compatible (maybe apart from test suit compatibility).
Closes#10457
* github.com:scylladb/scylla:
test/boost: cql_query_test: allow bound variables in test_list_of_tuples_with_bound_var
test/boost: cql_query_test: test bound variables in collection literals
cql3: expr: do not allow unset values inside collections
cql3: expr: prepare_expr: allow bind markers in collection literals
"
There's a cql_type_parser::parse() method that needs to get user
types for a keyspace by its name. For this it uses the global
storage proxy instance as a place to get database from. This set
introduces an abstract user_types_storage helper object that's
responsible in providing the user types for the caller.
This helper, in turn, is provided to the parse() method by the
database itself or by the schema_ctxt object that needs parse()
to unfreeze schemas and doesn't have database at those times.
This removes one more get_storage_proxy() call.
"
* 'br-user-types-storage' of https://github.com/xemul/scylla:
cql_type_parser: Require user_types_storage& in parse()
schame_tables: Add db/ctxt args here and there
user_types: Carry storage on database and schema_ctxt
data_dictionary: Introduce user types storage
When pull_github_pr.sh uses git cherry-pick to merge a single-patch
pull request, this cherry-pick can fail. A typical example is trying
to merge a patch that has actually already been merged in the past,
so cherry-pick reports that the patch, after conflict resolution,
is empty.
When cherry-pick fails, it leaves the working directory in an annoying
mid-cherry-pick state, and today the user needs to manually call
"git cherry-pick --abort" to return to the normal state. The script
should it automatically - so this is what we do in this patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When repair_meta stops it does so in the background and reports back
a shared future into whose shared promise peer it resolves that
background activity. There's a shorter way to forward a future result
into another, even shared, promise. And this method doesn't need to
discard a future.
tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/253
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The lcs at those places are explicitly start()ed beforehand. The
is_start() check is necessary when using the latency_counter with a
histogram that may or may not start the counter (this is the case
in several class table methods).
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
propagate_replacement() is an internal function that shouldn't be in
the public interface. No one besides an unit test for incremental
compaction needs it. In the future, I want to revisit incremental
compaction unit test to stop using it and only rely on public
interfaces
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220506171647.81063-1-raphaelsc@scylladb.com>
On each shard, we register a listener for the new direct failure detector service.
The listener maintains a set of live addresses; on mark_alive it adds a
server to the set and on mark_dead it removes it. This set is then used
to implement the `raft::failure_detector` interface, consisting of
`is_alive()` function, which simply checks set membership.
There is some complexity in between, because we need to translate
direct_failure_detector endpoint_ids to inet_addresses and raft::server_ids
to inet_addreses, but all building blocks are already there.
We connect the group 0 raft server rpc implementation to the new direct
failure detector service, so that when servers are added or removed from
the the group 0 configuration, corresponding endpoints are added to the
direct failure detector service. Thus the set of detected endpoints will
be equal to the group 0 configuration.
This causes the failure detector service to start pinging endpoints,
but no listeners are registered yet. The following commit changes that.
We add the new direct failure detector to the list of services started
in the Scylla process.
To start the service, we need an implementation of `pinger` and `clock`.
`pinger` is implemented using existing GOSSIP_ECHO verb. The gossip echo
message requires the node's gossip generation number. We handle this by
embedding the pinger implementation inside `gossiper`, and making
`gossiper` update the generation number (cached inside the pinger class)
periodically.
`clock` is a simple implementation which uses `std::chrono::steady_clock`
and `seastar::sleep_until` underneath. Translating `steady_clock`
durations to `direct_failure_detector::clock` durations happens by taking
the number of ticks.
The service is currently not used, just initialized; no endpoints are
added and no listeners are registered yet, but the following commits
change that.
Use the new `send_message_abortable` function to implement an abortable
version of `send_gossip_echo`.
These echo messages will be used for direct failure detection.
I want to be able to timeout `send_message`, but not through the
existing `send_message_timeout` API which forces me to use a particular
clock/duration/timepoint type. Introduce a more general
`send_message_abortable` API which gets an `abort_source&`, subscribes
to it, and uses the `rpc::cancellable` interface to cancel the RPC on
abort.
The function is 90% copy-pasta from `send_message{_timeout}`, only the
abort part is new.
Until now the nemesis test used its own failure detector implementation
which used one-way heartbeats.
Switch it to use the new direct failure detection service, which will
also be used in production code. Integrating it does require some work
however as we need to implement the `pinger` and `clock` interfaces
for the failure detector.
The service is sharded, but for simplicity of implementation we
implement rpcs and sleeps by routing the requests to shard 0, where
logical timers and network live.
Right now the test is running entirely on shard 0, but we want to
introduce a sharded service to the test. The initial naive attempt of
doing that failed because the test would time out (reach the tick limit)
before any work distributed to other shards could even start. The
solution in this commit solves that by synchronizing the shards on each
tick.
When the test is ran with smp=1, the behavior is as before.
The new service performs failure detection by periodically pinging
endpoints. The set of pinged endpoints can be dynamically extended and
shrinked. To learn about liveness of endpoints, user of the service
registers a listener and chooses a threshold - a duration of time which
has to pass since the last successful ping in order to mark an endpoint
as dead. When an endpoint responds it's immediately marked as alive.
Endpoints are identified using abstract integer identifiers.
The method of performing a ping is a dependency of the service provided
by the user through the `pinger` interface. The implementation of `pinger`
is responsible for translating the abstract endpoint IDs to 'real'
addresses. For example, production implementation may map endpoint IDs
to IP addresses and use TCP/IP to perform the ping, while a test/simulation
implementation may use a simulated network that also operates on
abstract identifiers.
Similarly, the method of measuring time is a dependency provided by the
user using the `clock` interface. The service operates on abstract time
intervals and timepoints. So, for example, in a production
implementation time can be measured using a stopwatch, while in
test/simulation we can use a logical clock.
The service distributes work across different shards. When an endpoint
is added to the set of detected endpoints, the service will choose a
shard with the smallest amount of workers and create a worker that is
responsible for periodically pinging this endpoint on that shard and
sending notifications to listeners.
Endpoints can be added or removed only through the shard 0 instance of
the service and shard 0 is responsible for coordinating the endpoint
workers. Listeners can be registered on any shard.
"
There's a enpoint->state map member of the gossiper class. First
ugly thing about it is that the member is public.
Next, there's a whole bunch of helpers around that map that export
various bits of information from it. All of those helpers reshard
to shard-0 to read from the state mape ignoring the fact that the
map is replicated on all shards internally. Also, some of those
helpers effectively duplicate each other for no real gain. Finally,
most of them are specific to api/ code, and open-coding them often
makes api/ handlers shorter and simpler.
This set removes the unused, api-only or trivial state map accessors
and marks the state map itself private (underscore prefix included).
tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/233/
"
* 'br-gossiper-sanitize-api-2' of https://github.com/xemul/scylla:
gossiper: Add underscores to new private members
code: Indentation fix after previous patch
gossiper, code: Relax get_up/down/all_counters() helpers
api: Fix indentation after previous patch
gossiper, api: Remove get_arrival_samples()
gossiper, api: Remove get/set phi convict threshold helpers
gossiper, api: Move get_simple_states() into API code
gossiper: In-line std::optional<> get_endpoint_state_for_endpoint() overload
gossiper, api: Remove get_endpoint_state() helpers
gossiper: Make state and locks maps private
gossiper: Remove dead code
Attempting to call advance_to() on the index, after it is positioned at EOF, can result in an assert failure, because the operation results in an attempt to move backwards in the index-file (to read the last index page, which was already read). This only happens if the index cache entry belonging to the last index page is evicted, otherwise the advance operation just looks-up said entry and returns it. To prevent this, we add an early return conditioned on eof() to all the partition-level advance-to methods.
A regression unit test reproducing the above described crash is also added.
Fixes: #10403Closes#10491
* github.com:scylladb/scylla:
sstables/index_reader: short-circuit fast-forward-to when at EOF
test/lib/random_schema: add a simpler overload for fixed partition count
One user observed this assertion fail, but it's an extremely rare event.
The root cause - interlacing of processing STARTUP and OPTIONS messages -
is still there, but now it's harmless enough to leave it as is.
Fixes#10487Closes#10503
This series fixes a few issue on the table truncate path:
- "memtable_list: safely futurize clear_and_add"
- reinstates an async version of table::clear_and_add, just safe against #10421
- a unit test reproducing #10421 was added to make sure the new version is indeed safe.
- "table: clear: serialize with ongoing flush" fixes#10423
- a unit test reproducing #10423 was added
Fixes#10281Fixes#10423
Test: unit(dev), database_test. test_truncate_without_snapshot_during_{writes,flushes} (debug)
Closes#10424
* github.com:scylladb/scylla:
test: database_test: add test_truncate_without_snapshot_during_writes
memtable_list: safely futurize clear_and_add
table: clear: serialize with ongoing flush
primitive_consumer::read_bytes() destroys and creates a vector for every value it reads.
This happens for every cell.
We can save a bit of work by reusing the vector.
Closes#10512
* github.com:scylladb/scylla:
sstables: consumer: reuse the fragmented_temporary_buffer in read_bytes()
utils: fragmented_temporary_buffer: add release()
Dtest triggers the problem by:
1) creating table with LCS
2) disabling regular compaction
3) writing a few sstables
4) running maintenance compaction, e.g. cleanup
Once the maintenance compaction completes, disengaged optional _last_compacted_keys
triggers an exception in notify_completion().
_last_compacted_keys is used by regular for its round-robin file picking
policy. It stores the last compacted key for each level. Meaning it's
irrelevant for any other compaction type.
Regular compaction is responsible for initializing it when it runs for
the first time to pick files. But with it disabled, notify_completion()
will find it uninitialized, therefore resulting in bad_optional_access.
To fix this, the procedure is skipped if _last_compacted_keys is
disengaged. Regular compaction, once re-enabled, will be able to
fill _last_compacted_keys by looking at metadata of the files.
compaction_test.py::TestCompaction::test_disable_autocompaction_doesnt_
block_user_initiated_compactions[CLEANUP-LeveledCompactionStrategy]
now passes.
Fixes#10378.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#10508
SSTable was moved into descriptor, so on failure, it couldn't be used
without resulting in a segfault. Fix it by not moving sst, and changing
signature to make it explicit we don't want to move the content.
Fixes#10505.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#10506
These helpers count elements in the endpoint state map. It makes sense
to keep them in gossiper API, but it's worth removing the wrappers that
do invoke_on(0). This makes code shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The API method in question just tries to scan the state map. There's no
need in doing invoke_on(0) and in a separate helper method in gossiper,
the creation of the json return value can happen in the API handler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method helps updating enpoint state in handle_major_state_change by
returning a copy of an endpoint state that's kept while the map's entry
is being replaced with the new state. It can be replaced with a shorter
code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two of them -- one to do invoke_on(0) the other one to get the
needed data. The former one is not needed -- the scanned endpoint state
map is replicated accross shards and is the same everywhere. The latter
is not needed, because there's only one user of it -- the API -- which
can work with the existing gossiper API.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Locks are not needed outside gossiper, state map is sometimes read from,
but there a const getter for such cases. Both methods now desrve the
underbar prefix, but it doesn't come with this short patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This series futurizes two synchronous functions used for data reconciliation:
`data_read_resolver::resolve` and `to_data_query_result` and does so
by introducing lower-level asynchronous infrastructure:
`mutation_partition_view::accept_gently`,
`frozen_mutation::unfreeze_gently` and `frozen_mutation::consume_gently`,
and `mutation::consume_gently`.
This trades some cycles on this cold path to prevent known reactor stalls.
Fixes#2361Fixes#10038Closes#10482
* github.com:scylladb/scylla:
mutation: add consume_gently
frozen_mutation: add consume_gently
query: coroutinize to_data_query_result
frozen_mutation: add unfreeze_gently
mutation_partition_view: add accept_gently methods
storage_proxy: futurize data_read_resolver::resolve
Attempting to call advance_to() on the index, after it is positioned at
EOF, can result in an assert failure, because the operation results in
an attempt to move backwards in the index-file (to read the last index
page, which was already read). This only happens if the index cache
entry belonging to the last index page is evicted, otherwise the advance
operation just looks-up said entry and returns it.
To prevent this, we add an early return conditioned on eof() to all the
partition-level advance-to methods.
A regression unit test reproducing the above described crash is also
added.
Currently, adding a cluster feature requires editing several files and
repeating the new feature name several times. This series reduces
the boilerplate to a single line (for non-experimental features), and
perhaps three for experimental features.
Closes#10488
* github.com:scylladb/scylla:
gms: feature_service: remove variable/helper function duplication
gms: feature: make `operator bool` implicit
gms: feature_service: remove feature variable duplication in enable()
gms: feature_service: remove feature variable declaration/definition duplication
gms: features: de-quadruplicate active feature names
gms: features: de-quadruplicate deprecated feature names
gms: feature_service: avoid duplicating feature names when listing known features
Reduce stalls by maybe yielding in-between partitions,
and by awaiting unfreeze_gently where possible.
Refs #10038
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Allow yielding when consuming mutation_partition_view.
To be used in later patches by a new unfreeze_gently function
and frozen_mutation::consume.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Allow yielding in data_read_resolver::resolve to
prevent reactor stalls.
TODO: unfreeze_gently, to prevent stalls due
to large partitions.
Refs #2361
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Right now to get user types the method in question gets global proxy
instance to get database from it and then peek a keyspace, its metadata
and, finally, the user types. There's also a safety check for proxy not
being initialized, which happens in tests.
Instead of messing with the proxy, the parse() method now accepts the
user_types_storage reference from which it gets the types. All the
callers already have the needed storage at hand -- in most of the cases
it's one shared between the database and schema_ctxt. In case of tests
is's a dummy storage, in case of schema-loader it's its local one.
The get_column_mapping() is special -- it doesn't expect any user-types
to be parsed and passes "" keyspace into it, neither it has db/ctxt to
get types storage from, so it can safely use the dummy one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is to have them in places that call cql_type_parser::parse.
Pure churn reduction for the next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The user types storage is needed in cql_type_parser::parse which is in
turn called with either replica::database or scema_ctxt at hand.
To facilitate the former case replica::database has its own user types
storage created in database constructor.
The latter case is a bit trickier. In many cases the ctxt is created as
a temporary object and the database is available at those places. Also
the ctxt object lives on the schema_registry instance which doesn't have
database nearby. However, that ctxt lifetime is the same as the registry
instance one and when it's created there's a database at hand (it's the
database constructor that calls schema_registry.init() passing "this"
into it). Thus, the solution is to make database's user types storage be
a shared pointer that's shared between database itself and all the ctxts
out there including the one that lives on schema_registry instance.
When database goes away it .deactivate()s its user types storage so that
any ctxts that may share it stay on the safe side and don't use database
after free. This part will go away when the schema_registry will be
deglobalized.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The interface in question will be used by cql type parser to get user
types. There are already three possible implementations of it:
- dummy, when no user types are in use (e.g. tests)
- schema-loader one, which gets user types from keyspaces that are
collected on its implementation of the database
- replica::database one, which does the same, but uses the real
database instance and that will be shared between scema_ctxts
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Each feature has a private variable and a public accessor. Since the
accessor effectively makes the variable public, avoid the intermediary
and make the variable public directly.
To ease mechanical translation, the variable name is chosen as
the function name (without the cluster_supports_ prefix).
References throughout the codebase are adjusted.
Features are usually used as booleans, so forcing allowing them
to implicitly decay to bool is not a mistake. In fact a bunch
of helper functions exist to cast feature variables to bool.
Prepare to reduce this boilerplate by allowing automatic conversion
to bool.
Active feature names are present four or five times in the code:
a delaration in feature.hh, a definition and initialization (two copies)
in feature_service.cc, a use in feature_service.cc, and a possible
reference in feature_service.cc if the feature is conditionally enabled.
Switch to just one copy or two, using the "foo"sv operator (and "foo"s)
to generate a string_view (string) as before.
Note that a few features had different external and C++ names; we
preserve the external name.
This patch does cause literal strings to be present in two places,
making them vulnerable to misspellings. But since feature names
are immutable, there is little risk that one will change without
the other.
Deprecated features are unused, but are present four times in the code:
a delaration in feature.hh, a definition and initialization (two copies)
in feature_service.cc, and a use in feature_service.cc. Switch to just
one copy, using the "foo"sv operator to generate a string_view as before.
Note that a few features had different external and C++ names; we
preserve the external name.
If we are redefining the log table, we need to ensure any dropped
columns are registered in "dropped_columns" table, otherwise clients will not
be able to read data older than now.
Includes unit test.
Should probably be backported to all CDC enabled versions.
Fixes#10473Closes#10474
When updating an updateable value via CQL the new value comes as a
string that's then boost::lexical_cast-ed to the desired value. If the
cast throws the respective exception is printed in logs which is very
likely uncalled for.
fixes: #10394
tests: manual
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220503142942.8145-1-xemul@scylladb.com>
"
- Alternator gets gossiper for its proxy dependency
- Forward service method that takes global gossiper can re-use
proxy method (forward -> proxy reference is already there)
- Table code is patched to require gossiper argument
- Snitch gets a dependency reference on snitch_ptr and some extra
care for snitch driver vs snitch-ptr interaction and gossip test
- Cql test env should carry gossiper reference on-board
- Few places can re-use the existing local gossiper reference
- Scylla-gdb needs to get gossiper from debug namespace and needs
_not_ to get feature service from gossiper
"
* 'br-gossiper-deglobal-2' of https://github.com/xemul/scylla:
code: De-globalize gossiper
scylla-gdb, main: Get feature service without gossiper help
test: Use cql-test-env gossiper
cql test env: Keep gossiper reference on board
code: Use gossiper reference where possible
snitch: Use local gossiper in drivers
snitch: Keep gossiper reference
test: Remove snitch from manual gossip test
gossiper: Use container() instead of the global pointer
main, cql_test_env: Start snitch later
snitch: Move snitch_base::get_endpoint_info()
forward service: Re-use proxy's helper with duplicated code
table: Don't use global gossiper
alternator: Don't use global gossiper
This is a translation of Cassandra's CQL unit test source file
validation/operations/SelectTest.java into our our cql-pytest framework.
This large test file includes 78 tests for various types of SELECT
operations. Four additional tests require UDF in Java syntax,
and were skipped.
All 78 tests pass on Cassandra. 25 of the tests fail on Scylla
reproducing 3 already known Scylla issues and 8 previously-unknown
issues:
Previously known issues:
Refs #2962: Collection column indexing
Refs #4244: Add support for mixing token, multi- and single-column
restrictions
Refs #8627: Cleanly reject updates with indexed values where
value > 64k
Newly-discovered issues:
Refs #10354: SELECT DISTINCT should allow filter on static columns,
not just partition keys
Refs #10357: Spurious static row returned from query with filtering,
despite not matching filter
Refs #10358: Comparison with UNSET_VALUE should produce an error
Refs #10359: "CONTAINS NULL" and "CONTAINS KEY NULL" restrictions
should match nothing
Refs #10361: Null or UNSET_VALUE subscript should generate an
invalid request error
Refs #10366: Enforce Key-length limits during SELECT
Refs #10443: SELECT with IN and ORDER BY orders rows per partition
instead of for the entire response
Refs #10448: The CQL token() function should validate its parameters
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10449
No code uses global gossiper instance, it can be removed. The main and
cql-test-env code now have their own real local instances.
This change also requires adding the debug:: pointer and fixing the
scylle-gdb.py to find the correct global location.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is needed not to mess with removed global gossiper in the next
patch. Other than this, it's better to access services by their own
debug:: pointers, not via under-the-good dependencies chains.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's yet another -test-env -- the alternator- one -- which needs
gossiper. It now uses global reference, but can grab gossiper reference
from the cql-test-env which partitipates in initialization.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The reference is already available at the env initialization, but it's
not kept on the env instance itself. Will be used by the next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some places in the code has function-local gossiper reference but
continue to use global instance. Re-use the local reference (it's going
to become sharded<> instance soon).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Each driver has a pointer to this shard snitch_ptr which, in turn, has
the reference on gossiper. This lets drivers stop using the global
gossiper instance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The reference is put on the snitch_ptr because this is the sharded<>
thing and because gossiper reference is the same for different snitch
drivers. Also, getting gossiper from snitch_ptr by driver will look
simpler than getting it from any base class.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Snitch depends on gossiper and system keyspace, so it needs to be
started after those two do.
fixes#10402
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The get_live_endpoints matches the same method on the proxy side. Since
the forward service carries proxy reference, it can use its method
(which needs to be made public for that sake).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The table::get_hit_rate needs gossiper to get hitrates state from.
There's no way to carry gossiper reference on the table itself, so it's
up to the callers of that method to provide it. Fortunately, there's
only one caller -- the proxy -- but the call chain to carry the
reference it not very short ... oh, well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
Today, both operations are picking the highest level as the ideal level for
placing the output, but the size of input should be used instead.
The formula for calculating the ideal level is:
ceil(log base(fan_out) of (total_input_size / max_fragment_size))
where fan_out = 10 by default,
total_input_size = total size of input data and
max_fragment_size = maximum size for fragment (160M by default)
such that 20 fragments will be placed at level 2, as level 1
capacity is 10 fragments only.
By placing the output in the incorrect level, tons of backlog will be generated
for LCS because it will either have to promote or demote fragments until the
levels are properly balanced.
"
* 'optimize_lcs_major_and_reshape/v2' of https://github.com/raphaelsc/scylla:
compaction: LCS: avoid needless work post major compaction completion
compaction: LCS: avoid needless work post reshape completion
compaction: LCS: extract calculation of ideal level for input
compaction: LCS: Fix off-by-one in formula used to calculate ideal level
This series enforces a minimum size of the unprivileged section when
performing `shrink()` operation.
When the cache is shrunk, we still drop entries first from unprivileged
section (as before this commit), however, if this section is already small
(smaller than `max_size / 2`), we will drop entries from the privileged
section.
This is necessary, as before this change the unprivileged section could
be starved. For example if the cache could store at most 50 entries and
there are 49 entries in privileged section, after adding 5 entries (that would
go to unprivileged section) 4 of them would get evicted and only the 5th one
would stay. This caused problems with BATCH statements where all
prepared statements in the batch have to stay in cache at the same time
for the batch to correctly execute.
To correctly check if the unprivileged section might get too small after
dropping an entry, `_current_size` variable, which tracked the overall size
of cache, is changed to two variables: `_unprivileged_section_size` and
`_privileged_section_size`, tracking section sizes separately.
New tests are added to check this new behavior and bookkeeping of the section
sizes. A test is added, that sets up a CQL environment with a very small
prepared statement cache, reproduces issue in #10440 and stresses the cache.
Fixes#10440.
Closes#10456
* github.com:scylladb/scylla:
loading_cache_test: test prepared stmts cache
loading_cache: force minimum size of unprivileged
loading_cache: extract dropping entries to lambdas
loading_cache: separately track size of sections
loading_cache: fix typo in 'privileged'
This series gets rid of the remaining usage of flat_mutation_reader v1 in compaction
Test: sstable_compaction_test
Closes#10454
* github.com:scylladb/scylla:
compaction: sanitize headers from flat_mutation_reader v1
flat_mutation_reader: get rid of class filter
compaction: cleanup_compaction: make_partition_filter: return flat_mutation_reader_v2::filter
There was some doubts about which internal prepared statements are cached and which aren't.
In addition some queries that should have been cached (IMO), weren't. This PR adds some verbosity
to the caching enabling parameter as well as adding caching to some queries.
As a followup I would suggest to have internal queries as a compile time strings that have a compile time hash, this
will make the cache lookup not be dependent on the query textual length as it is today, this makes sense given that the
queries are static even today.
Closes#10465Fixes#10335.
* github.com:scylladb/scylla:
internal queries: add caching to some queries
query_processor: remove default internal query caching behavior
query_processor: make execute_internal caching parameter more verbose
Some of the internal queries didn't have caching enabled even though
there are chances of the query executing in large bursts or relatively
often, example of the former is `default_authorized::authorize` and for
the later is `system_distributed_keyspace::get_service_levels`.
Fixes#10335
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Also, pass `node_ops_cmd` by value to get rid of lifetime issues
when converting to coroutine.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
When executing internal queries, it is important that the developer
will decide if to cache the query internally or not since internal
queries are cached indefinitely. Also important is that the programmer
will be aware if caching is going to happen or not.
The code contained two "groups" of `query_processor::execute_internal`,
one group has caching by default and the other doesn't.
Here we add overloads to eliminate default values for caching behaviour,
forcing an explicit parameter for the caching values.
All the call sites were changed to reflect the original caching default
that was there.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
`execute_internal` has a parameter to indicate if caching a prepared
statement is needed for a specific call. However this parameter was a
boolean so it was easy to miss it's meaning in the various call sites.
This replaces the parameter type to a more verbose one so it is clear
from the call site what decision was made.
Said method has to evict all querier cache entries, belonging to the to-be-dropped table. This is already the case, but there was a window where new entries could sneak in, causing a stale reference to the table to be de-referenced later when they are evicted due to TTL. This window is now closed, the entries are evicted after the method has waited for all ongoing operations on said table to stop.
Fixes: #10450Closes#10451
* github.com:scylladb/scylla:
replica/database: drop_column_family(): drop querier cache entries after waiting for ops
replica/database: finish coroutinizing drop_column_family()
replica/database: make remove(const column_family&) private
Fix hangs on Scylla node startup with Raft enabled that were caused by:
- a deadlock when enabling the USES_RAFT feature,
- a non-voter server forgetting who the leader is and not being able to forward a `modify_config` entry to become a voter.
Read the commit messages for details.
Fixes: #10379
Refs: #10355Closes#10380
* github.com:scylladb/scylla:
raft: actively search for a leader if it is not known for a tick duration
raft: server: return immediately from `wait_for_leader` if leader is known
service: raft: don't support/advertise USES_RAFT feature
Add a new test that sets up a CQL environment with a very small prepared
statements cache. The test reproduces a scenario described in #10440,
where a privileged section of prepared statement cache gets large
and that could possibly starve the unprivileged section, making it
impossible to execute BATCH statements. Additionally, at the end of the
test, prepared statements/"simulated batches" with prepared statements
are executed a random number of times, stressing the cache.
To create a CQL environment with small prepared cache, cql_test_config
is extended to allow setting custom memory_config value.
This patch enforces a minimum size of unprivileged section when
performing shrink() operation.
When the cache is shrank, we still drop entries first from unprivileged
section (as before this commit), however if this section is already small
(smaller than max_size / 2), we will drop entries from the privileged
section.
For example if the cache could store at most 50 entries and there are 49
entries in privileged section, after adding 5 entries (that would go to
unprivileged section) 4 of them would get evicted and only the 5th one
would stay. This caused problems with BATCH statements where all
prepared statements in the batch have to stay in cache at the same time
for the batch to correctly execute.
New tests are added to check this behavior and bookkeeping of section
sizes.
Fixes#10440.
This patch splits _current_size variable, which tracked the overall size
of cache, to two variables: _unprivileged_section_size and
_privileged_section_size.
Their sum is equal to the old _current_size, but now you can get the
size of each section separately.
lru_entry's cache_size() is replaced with owning_section_size() which
references in which counter the size of lru_entry is currently stored.
That's done by picking the ideal level for the input, such
that LCS won't have to either promote or demote data, because
the output level is not the best candidate for having the
size of the output data.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's done by picking the ideal level for reshape input, such
that LCS won't have to either promote or demote data, because
the output level is not the best candidate for having the
size of the output data.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
ideal level is calculated as:
ceil(log base10 of ((input_size + max_fragment_size - 1) / max_fragment_size))
such that 20 fragments will be placed at level 2, as level 1
capacity is 10 fragments only.
The goal of extracting it is that the formula will be useful for
major in addition to reshape.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
To calculate ideal level, we use the formula:
log 10 (input_size / max_fragment_size)
input_size / max_fragment_size is calculating number of fragments.
the problem is that the calculation can miss the last fragment, so
wrong level may be picked if last fragment would cause the target
level to exceed its capacity.
To fix it, let's tweak the formula to:
log 10 ((input_size + max_fragment_size - 1) / max_fragment_size)
such that the actual # of fragments will be calculated.
If wrong level is picked, it can cause unnecessary writeamp as,
LCS will later have to promote data into the next level.
Problem spotted by Benny Halevy.
Fixes#10458.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Semantic of unset values inside collections is undefined.
Previous behavior of transforming list with unset value into unset value
was removed, because I couldn't find a reason for its existence.
"
After the recent conversion of the row-cache, two v1 mutation sources
remained: the memtable and the kl sstable reader.
This series converts both to a native v2 implementation. The conversion
is shallow: both continue to read and process the underlying (v1) data
in v1, the fragments are converted to v2 right before being pushed to
the reader's buffer. This conversion is simple, surgical and low-risk.
It is also better than the upgrade_to_v2() used previously.
Following this, the remaining v1 reader implementations are removed,
with the exception of the downgrade_to_v1(), which is the only one left
at this point. Removing this requires converting all mutation sinks to
accept a v2 stream.
upgrade_to_v2() is now not used in any production code. It is still
needed to properly test downgrade_to_v1() (which is till used), so we
can't remove it yet. Instead it hidden as a private method of
mutation_source. This still allows for the above mentioned testing to
continue, while preventing anyone from being tempted to introduce new
usage.
tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/191
"
* 'convert-remaining-v1-mutation-sources/v2' of https://github.com/denesb/scylla:
readers: make upgrade_to_v2() private
test/lib/mutation_source_test: remove upgrade_to_v2 tests
readers: remove v1 forwardable reader
readers: remove v1 empty_reader
readers: remove v1 delegating_reader
sstables/kl: make reader impl v2 native
sstables/kl: return v2 reader from factory methods
sstables: move mp_row_consumer_reader_k_l to kl/reader.cc
partition_snapshot_reader: convert implementation to native v2
mutation_fragment_v2: range_tombstone_change: add minimal_memory_usage()
flat_mutation_reader make_scrubbing_reader no longer exists
and there is no need to include flat_mutation_reader.hh
nor forward declare the class.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We filter only on the parittion key, so it doesn't matter,
but we want to get rid of flat_mutation_reader v1.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If SSTable write fails, it will leave a partial sst which contains
a temporary TOC in addition to other components partially written.
temporary TOC content is written upfront, to allow us from deleting
all partial components using the former content if write fails.
After commit e5fc4b6, partial sst cannot be deleted because it is
incorrectly assuming all files being deleted unconditionally has
TOC, but that's not true for partial files that need to be removed.
The consequence of this is that space of partial files cannot be
reclaimed, making it worse for Scylla to recover from ENOSPC,
which could happen by selecting a set of files for compaction with
higher chance of suceeeding given the free space.
Let's fix this by taking into account temp TOC for partial files.
Fixes#10410.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#10411
* github.com:scylladb/scylla:
sstables: Fix deletion of partial SSTables
sstables: Fix fsync_directory()
sstables: Rename dirname() to a more descriptive name
Slice restrictions on the "duration" type are not allowed, and also if
we have a collection, tuple or UDT of durations. We made an effort to
print helpful messages on the specific case encountered, such as "Slice
restrictions are not supported on UDTs containing duration".
But the if()s were reverse, meaning that a UDT - which is also a tuple -
will be reported as a tuple instead of UDT as we intended (and as Cassandra
reports it).
The wrong message was reproduced in the unit test translated from
Cassandra, select_test.py::testFilteringOnUdtContainingDurations
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220428071807.1769157-1-nyh@scylladb.com>
The only user is the tests of downgrade_to_v1(), which uses it through
mutation source. To avoid any new users popping up, we make it a private
method of the latter. In the process the pass-through optimization is
dropped, it is not needed for tests anyway.
We don't have any upgrade_to_v2() left in production code, so no need to
keep testing it. Removing it from this test paves the way for removing
it for good (not in this series).
The only user is row level repair: it is replaced with
downgrade_to_v1(make_empty_flat_reader_v2()). The row level reader has
lots of downgrade_to_v1() calls, we will deal with these later all at
once.
Another use is the empty mutation source, this is trivially converted to
use the v2 variant.
The conversion is shallow: the meat of the logic remains v1, fragments
are converted to v2 right before being pushed into the buffer. This
approach is simple, surgical and is still better then a full
upgrade_to_v2().
This just moves the upgrade_to_v2() calls to the other side of said
factory methods, preparing the ground for converting the kl reader impl
to a native v2 one.
The underlying mutation representation is still v1, so the
implementation still has to do conversion. This happens right above the
lsa reader component.
Reads (part of operations) running concurrent to `drop_column_family()`
can create querier cache entries while we wait for them to finish in
`await_pending_ops()`. Move the cache entry eviction to after this, to
ensure such entries are also cleaned up before destroying the table
object.
This moves the `_querier_cache.evict_all_for_table()` from
`database::remove()` to `database::drop_column_family()`. With that the
former doesn't have to return `future<>` anymore. While at it (changing
the signature) also rename `column_family` -> `table`.
Also add a regression unit test.
Said method was already coroutinized, but only halfway, possibly because
of the difficulty in expressing `finally()` with coroutines. We now have
`coroutines::as_future()` which makes this easier, so finish the job.
These bring in wasm.hh (though they really shouldn't) and make
everyone suffer. Forward declare instead and add missing includes
where needed.
Closes#10444
Minor fixlets to make `ninja dev-headers` pass.
Closes#10445
* github.com:scylladb/scylla:
readers/from_mutations_v2.hh: make self-contained
data_dictionary/storage_options.hh: make self-contained
Reduce #include load by standardizing on std::any.
In keys.cc, we just drop the unneeded include.
One instance of boost::any remains in config_file, due to a tie-in with
other boost components.
Closes#10441
We don't need the database to determine the shard of the mutation,
only its schema. So move the implementation to the respecive
definitions of mutation and frozen_mutation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10430
DynamoDB limits partition-key length to 2048 bytes and sort-key length
to 1024 bytes. Alternator currently has no such limits officially, but
if a user tries a key length of over 64 KB, the result will be an
"internal server error" as Alternator runs into Scylla's low-level key
length limit of 64 KB.
In this patch we add (mostly xfailing) tests confirming all the above
observations. The tests include extensive comments on what they are
testing and why. Some of these tests (specifically, the ones checking
what happens above 64 KB) should pass once Alternator is fixed. Other
tests - requiring that the limits be exactly what they are in DynamoDB -
may either not pass or change in the future, depending on what we decide
the limits should be in Alternator.
Refs #10347
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10438
If SSTable write fails, it will leave a partial sst which contains
a temporary TOC in addition to other components partially written.
temporary TOC content is written upfront, to allow us from deleting
all partial components using the former content if write fails.
After commit e5fc4b6, partial sst cannot be deleted because deletion
procedure is incorrectly assuming all SSTs being deleted unconditionally
have TOC, but partial SSTs only have TMP TOC instead.
That happens because parent_path() requires all path components to
exist due to its usage of fs::path::canonical.
The consequence of this is that space of partial files cannot be
reclaimed, making it worse for Scylla to recover from ENOSPC,
which could happen by selecting a set of files for compaction with
higher chance of suceeeding given the free space.
This is fixed by only calling parent_path() on TMP TOC, which is
guaranteed to exist prior to calling fsync_directory().
Fixes#10410.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
fsync_directory() is broken because it's unconditionally performing
fsync on parent directory, not on the directory that it was called
with. To fix, let's remove wrong parent_path() usage.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
dirname() is confusing because if it's called on a directory, parent
path is retrieved. By renaming it to parent_path(), it's clearer
what the function will do exactly.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Seastar is an external library from Scylla's point of view so
we should use the angle bracket #include style. Most of the source
follows this, this patch fixes a few stragglers.
Also fix cases of #include which reached out to seastar's directory
tree directly, via #include "seastar/include/sesatar/..." to
just refer to <seastar/...>.
Closes#10433
On acaf0bb we applied out() just for perftune.py because we had issue #10390
with this script.
But the issue can happen with other commands too, let's apply it to all
commands which uses capture_output.
related #10390Closes#10414
_estimated_remaining_tasks gets updated via get_next_non_expired_sstables ->
get_compaction_candidates, but otherwise if we return earlier from
get_sstables_for_compaction, it does not get updated and may go out of sync.
Refs #10418
(to be closed when the fix reaches branch-4.6)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10419
lookup_readers might fail after populating some readers
and those better be closed before returning the exception.
Fixes#10351
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10425
"
The method in question performs node bootstrap in several different
modes
(regular, replacing, rnbo) and several subsequent if-else branches just
duplicate each-other. This set merges them making the code easier to
read.
"
* 'br-less-branchy-bootstrap' of https://github.com/xemul/scylla:
storage_service: Remove pointless check in replace-bootstrap
storage_service: Generalize wait for range setup
storage_service: Merge common if-else branches in bootstrap
storage_service: Move tables bootstrap-ON upwards
The doc is being updated to reflect the changes in the commit
d8833de3bb ("Redefine Compaction Backlog to tame
compaction aggressiveness").
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Following a4be927e23
that reverted 2325c566d9
due to #10421, this patch reintroduces an async version
of memtable_list::clear_and_add that calls clear_gently
safely after replacing the _memtables vector with a new one
so that writes and flushes can continue in he foreground
while the old memtables are cleared.
Fixes#10281
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Get all flush permits to serialize with any
ongoing flushes and preventing further flushes
during table::clear, in particular calling
discard_completed_segments for every table and
clearing the memtables in clear_and_add.
Fixes#10423
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For a follower to forward requests to a leader the leader must be known.
But there may be a situation where a follower does not learn about
a leader for a while. This may happen when a node becomes a follower while its
log is up-to-date and there are no new entries submitted to raft. In such
case the leader will send nothing to the follower and the only way to
learn about the current leader is to get a message from it. Until a new
entry is added to the raft's log a follower that does not know who the
leader is will not be able to add entries. Kind of a deadlock. Note that
the problem is specific to our implementation where failure detection is
done by an outside module. In vanilla raft a leader sends messages to
all followers periodically, so essentially it is never idle.
The patch solves this by broadcasting specially crafted append reject to all
nodes in the cluster on a tick in case a leader is not known. The leader
responds to this message with an empty append request which will cause the
node to learn about the leader. For optimisation purposes the patch
sends the broadcast only in case there is actually an operation that
waits for leader to be known.
Fixes#10379
Currently an exception is thrown in the apply stage
when the schema is not synced, but it is too late
since returning an error doesn't pinpoint which code
path was using an unsync'ed schema so move the check
earlier, before _apply_stage is called.
We need to make sure the schema is synced earlier
when the mutation is applied so call on_internal_error
to generate a backtrace in testing and still throw
an error in production.
Typically storage_proxy::mutate_locally implicitly
ensures the schema is synced by making a global_schema_ptr
for it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220424110057.3957597-1-bhalevy@scylladb.com>
In the filtering expression "WHERE m[?] = 2", our implementation was buggy when either the map, or the subscript, was NULL (and also when the latter was an UNSET_VALUE). Our code ended up dereferencing null objects, yielding bizarre errors when we were lucky, or crashes when we were less lucky - see examples of both in issues #10361, #10399, #10401. The existing test `test_null.py::test_map_subscript_null` reproduced all these bugs sporadically.
In this series we improve the test to reproduce the separate bugs separately, and also reproduce additional problems (like the UNSET_VALUE). We then **define** both `m[NULL]` and `NULL[2]` to result in NULL instead of the existing undefined (and buggy, and crashing) behavior. This new definition is consistent with our usual SQL-inspired tradition that NULL "wins" in expressions - e.g., `NULL < 2` is also defined as resulting in NULL.
However, this decision differs from Cassandra, where `m[NULL]` is considered an error but `NULL[2]` is allowed. We believe that making `m[NULL]` be a NULL instead of an error is more consistent, and moreover - necessary if we ever want to support more complicate expressions like `m[a]`, where the column `a` can be NULL for some rows and non-NULL for others, and it doesn't make sense to return an "invalid query" error in the middle of the scan.
Fixes#10361Fixes#10399Fixes#10401Closes#10420
* github.com:scylladb/scylla:
expressions: don't dereference invalid map subscript in filter
expressions: fix invalid dereference in map subscript evaluation
test/cql-pytest: improve tests for map subscripts and nulls
Currently, if a table is dropped during streaming, the streaming would
fail with no_such_column_family error.
Since the table is dropped anyway, it makes more sense to ignore the
streaming result of the dropped table, whether it is successful or
failed.
This allows users to drop tables during node operations, e.g., bootstrap
or decommission a node.
This is especially useful for the cloud users where it is hard to
coordinate between a node operation by admin and user cql change.
This patch also fixes a possible user after free issue by not passing
the table reference object around.
Fixes#10395Closes#10396
If we have the filter expression "WHERE m[?] = 2", the existing code
simply assumed that the subscript is an object of the right type.
However, while it should indeed be the right type (we already have code
that verifies that), there are two more options: It can also be a NULL,
or an UNSET_VALUE. Either of these cases causes the existing code to
dereference a non-object as an object, leading to bizarre errors (as
in issue #10361) or even crashes (as in issue #10399).
Cassandra returns a invalid request error in these cases: "Unsupported
unset map key for column m" or "Unsupported null map key for column m".
We decided to do things differently:
* For NULL, we consider m[NULL] to result in NULL - instead of an error.
This behavior is more consistent with other expressions that contain
null - for example NULL[2] and NULL<2 both result in NULL as well.
Moreover, if in the future we allow more complex expressions, such
as m[a] (where a is a column), we can find the subscript to be null
for some rows and non-null for other rows - and throwing an "invalid
query" in the middle of the filtering doesn't make sense.
* For UNSET_VALUE, we do consider this an error like Cassandra, and use
the same error message as Cassandra. However, the current implementation
checks for this error only when the expression is evaluated - not
before. It means that if the scan is empty before the filtering, the
error will not be reported and we'll silently return an empty result
set. We currently consider this ok, but we can also change this in the
future by binding the expression only once (today we do it on every
evaluation) and validating it once after this binding.
Fixes#10361Fixes#10399
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When we have an filter such as "WHERE m[2] = 3" (where m is a map
column), if a row had a null value for m, our expression evaluation
code incorrectly dereferences an unset optional, and continued
processing the result of this dereference which resulted in undefined
behavior - sometimes we were lucky enough to get "marshaling error"
but other times Scylla crashed.
The fix is trivial - just check before dereferencing the optional value
of the map. We return null in that case, which means that we consider
the result of null[2] to be null. I think this is a reasonable approach
and fits our overall approach of making null dominate expressions (e.g.,
the value of "null < 2" is also null).
The test test_filtering.py::test_filtering_null_map_with_subscript,
which used to frequently fail with marshaling errors or crashes, now
passes every time so its "xfail" mark is removed.
Fixes#10417
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The test test_null.py::test_map_subscript_null turned out to reproduce
multiple bugs related to using map subscripts in filtering expressions.
One was issue #10361 (m[null] resulted in a bizarre error) or #10399
(m[null] resulted in a crash), and a different issue was #10401 (m[2]
resulted in a bizarre error or a crash if m itself was null). Moreover,
the same test uncovered different bugs depending how it was run - alone
or with other tests - because it was using a shared table.
In this patch we introduce two separate tests in test_filtering.py
which are designed to reproduce these separate bugs instead of mixing
them into one test. The new tests also cover a few more corners which
the previous test (which focused on nulls) missed - such as UNSET_VALUE.
The two new tests (and the old test_map_subscript_null) pass on
Cassandra so still assume that the Cassandra behavior - that m[null]
should be an error - is the correct behavior. We may want to change
the desired behavior (e.g., to decide that m[null] be null, not an
error), and change the tests accordingly later - but for now the
tests follow Cassandra's behavior exactly, and pass on Cassandra
and fail on Scylla (so are marked xfail).
The bugs reproduced by these tests involve randomness or reading
uninitialized memory, so these tests sometimes pass, sometimes fail,
and sometimes even crash (as reported in #10399 and #10401). So to
reproduce these bugs run the tests multiple times. For example:
test/cql-pytest/run --count 100 --runxfail
test_filtering.py::test_filtering_null_map_with_subscript
Refs #10361
Refs #10399
Refs #10401
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
"
cache_flat_mutation_reader gets a native v2 implementation. The
underlying mutation representation is not changed: range deletions are
still stored as v1 range_tombstones in mutation_partition. These are
converted to range tombstone changes during reading.
This allows for separating the change of a native v2 reader
implementation and a native v2 in-memory storage format, enabling the
two to be done at separate times and incrementally.
This means there is still conversion ingoing when reading from cache and
when populating, but when reading from underlying, the stream can now be
passed through as-is without conversions.
Also, any future v2 related changes to the in-memory storage will now be
limited to the cache reader implementation itself.
In the process, the non-forwarding reader, whose only user is the cache,
is also converted to v2.
"
Performance results reported by Botond:
"
build/release/test/perf/perf_simple_query -c1 -m2G --flush --
duration=20
BEFORE
median 130421.76 tps ( 71.1 allocs/op, 12.1 tasks/op, 47462
insns/op)
median absolute deviation: 319.64
maximum: 131028.33
minimum: 127502.55
AFTER
median 133297.41 tps ( 64.1 allocs/op, 12.2 tasks/op, 45406
insns/op)
median absolute deviation: 2964.24
maximum: 137581.56
minimum: 123739.4
Getting rid of those upgrade/downgrade was good for allocs and ops.
Curiously there is a 0.1 rise in number of tasks though.
"
* 'row-cache-readers-v2/v1' of https://github.com/denesb/scylla:
row_cache: update reader implementations to v2
range_tombstone_change_generator: flush(): add end_of_range
readers/nonforwardable: convert to v2
read_context: fix indentation
read_context: coroutinize move_to_next_partition()
row_cache: cache_entry::read(): return v2 reader
row_cache: return v2 readers from make_reader*()
readers/delegating_v2: s/make_delegating_reader_v2/make_delegating_reader/
cache_flat_mutation_reader gets a native v2 implementation. The
underlying mutation representation is not changed: range deletions are
still stored as v1 range_tombstones in mutation_partition. These are
converted to range tombstone changes during reading.
This allows for separating the change of a native v2 reader
implementation and a native v2 in-memory storage format, enabling the
two to be done at separate times and incrementally.
Allowing to flush all range tombstone changes, including those that have
a position equal to the passed in upper bound, when finishing off a
read-range, e.g. a clustering range from a slice.
It has a single user, the row cache, which for now has to
upgrade/downgrade around the nonforwardable reader, but this will go
away in the next patches when the row cache readers are converted to v2
proper.
The patchset embeds the mutation_fragment upgrading logic from v1 to v2 into the mutation_fragment_queue. This way the mutation fragments coming to the mutation_fragment_queue can be v1, but the underlying query_reader receives mutation_fragment_v2, eliminating the last usage of query_reader (v1). The last commit removes query_reader, query_reader_handle and associated factory functions.
tests: unit(dev), dtest(incremental_repair_test, read_repair_test, repair_additional_test, repair_test)
Closes#10371
* github.com:scylladb/scylla:
readers: Remove queue_reader v1 and associated code.
repair: Make mutation_fragment_queue internally upgrade fragments to v2
repair: Make mutation_fragment_queue::impl a seastar::shared_ptr
It makes mutation_fragment_queue copyable and makes the pointer to
pending mutation fragments in next commit stable. This allows moving the
mutation_fragment_queue without breaking the underlying
upgrading_consumer.
And adjust callers. The factory functions just sprinkle upgrade_to_v2()
on returned readers for now.
One test in row_cache_test.cc had to be disabled, because the upgrade to
v2 wrapper we now have over cache readers doesn't allow it to directly
control the reader's buffer size and so the test fails. There is a FIXME
left in the test code and the test will be re-enabled once a native v2
reader implementation allows us to get rid of the upgrade wrapper.
It turns out that Cassandra does not allow IN restrictions together with
filtering, except, curiously, when the restriction is on a clustering key.
There is no real reason for this limitation - the error message even says
it is not *yet* supported.
Scylla, on the other hand, does support this case. Of course it's not
enough that we support it - we need to support it correctly... But we don't
have a full regression test that this support is correct - in
filtering_test.cc we test it with clustering and regular columns - but not
partition key columns.
So this patch adds a simple cql-pytest test that this sort of filtering
works in Scylla correctly for partition, clustering and regular columns
(and also confirms that these cases don't work, yet, on Cassandra).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220420075553.1008062-1-nyh@scylladb.com>
The method in question is called in the branch where the replace address
is checked to be present, no need in extra explicit check.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Both the if is_replacing()/else branches call gossiper wating method as
their first steps. Can be done once.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are three modes in there -- bootstrap, b.s. with RBNO and b.s. for
replacing. All three are checked two times in a row, but can be done
once.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This call just places a boolean flag on all. It won't hurt if it lasts
while the node is performing pre-bootstrap checks, but it allows making
the whole method less branchy.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The do_with() means we have an unconditional allocation, so we can
justify the coroutine's allocation (replacing it). Meanwhile,
coroutine::parallel_for_each() reduces an allocation if mutate_locally()
blocks.
Closes#10387
gcc 12 checks some things that clang doesn't, resulting in compile errors.
This series fixes some of theses issues, but still builds (and tests) with clang.
Unfortunately, we still don't have a clean gcc build due to an outstanding bug [1].
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98056Closes#10386
* github.com:scylladb/scylla:
build: disable warnings that cause false-positive errors with gcc 12
utils: result_loop: remove invalid and incorrect constraint
service: forward_service: avoid using deprecated std::bind1st and std::not1
repair: explicityl ignore tombstone gc update response
treewide: abort() after switch in formatters
db: view: explicitly ignore unused result
compaction: leveled_compaction_strategy: avoid compares between signed and unsigned
compaction_manager: compaction_reenabler: disambiguate compaction_state
api: avoid function specialization in req_param
alternator: ttl: avoid specializing class templates in non-namespace scope
alternator: executor: fix signed/unsigned comparison in is_big()
Currently, rpc handlers are all lambdas inside
storage_proxy::init_messaging_service(). This means any stack trace
refers to storage_proxy::init_messaging_service::lambda#n instead of
a meaningful function name, and it makes init_messaging_service()
very intimidating.
Fix that by moving all such lambdas to regular member functions. The
first two patches remove unnecessary captures to make it easy; the
final patch coverts the lambdas to member functions.
Closes#10388
* github.com:scylladb/scylla:
storage_proxy: convert rpc handlers from lambdas to member functions
storage_proxy: don't capture messaging_service in server callbacks
storage_proxy: don't capture migration_manager in server callbacks
We currently does not able to get any error message from subprocess when we specified capture_output=True on subprocess.run().
This is because CalledProcessError does not print stdout/stderr when it raised, and we don't catch the exception, we just let python to cause Traceback.
Result of that, we only able to know exit status and failed command but
not able to get stdout/stderr.
This is problematic especially working on perftune.py bug, since the
script should caused Traceback but we never able to see it.
To resolve this, add wrapper function "out()" for capture output, and
print stdout/stderr with error message inside the function.
Fixes#10390Closes#10391
Checking a concept in a requires-expression requires an additional
requires keyword. Moreover, the constraint is incorrect (at least
all callers pass a T, not a result<T>), so remove it.
Found by gcc 12.
It is typical in switch statements to select on an enum type and
rely on the compliler to complain if an enum value was missed. But
gcc isn't satisified since the enum could have a value outside the
declared list. Call abort() in this impossible situation to pacify
it.
Function specializations are not allowed (you're supposed to use
overloads), but clang appears to allow them.
Here, we can't use an overload since the type doesn't appear in the
parameter list. Use a constraint instead.
The C++ standard disallows class template specialization in non-namespace
scopes. Clang apparently allows it as an extension.
Fix by not using a template - there are just two specializations and
no generic implementation. Use regular classes and std::conditional_t
to choose between the two.
Signed/unsigned comparisons are subject to C promotion rules. In is_big()
in this case the comparison is safe, but gcc warns. Use a cast to silence
the warning.
The sign/unsigned mix and int/size_t size differences still look bad, it
would be good to revisit this code, but that is left for another patch.
Series 59d56a3fd7 introduced
an accidental backward incompatible regression by adding
a column to system_schema.keyspaces and then not even using
it for anything. It's a leftover from the original hackathon
implementation and should never reach master in the first place.
Fortunately, the series isn't part of any stable release yet.
Fixes#10376
Tests: manual, verifying that the system_schema.keyspaces table
no longer contains the extraneous column.
Closes#10377
Currently, rpc handlers are all lambdas inside
storage_proxy::init_messaging_service(). This means any stack trace
refers to storage_proxy::init_messaging_service::lambda#n instead of
a meaningful function name, and it makes init_messaging_service()
very intimidating.
Fix that by moving all such lambdas to regular member functions.
This is easy now that they don't capture anything except `this`,
which we provide during registration via std::bind_front().
A few #includes and forward declarations had to be added to
storage_proxy.hh. This is unfortunate, but can only be solved
by splitting storage_proxy into a client part and a server part.
We'd like to make the server callbacks member functions, rather
than lambdas, so we need to eliminate their captures. This patch
eliminats 'ms' by referringn to the already existing member '_messaging'
instead.
We'd like to make the server callbacks member functions, rather
than lambdas, so we need to eliminate their captures. This patch
eliminates 'mm' by making it a member variable and capturing 'this'
instead. In one case 'mm' was used by a handle_write() intermediate
lambda so we have to make that non-static and capture it too.
uninit_messaging_service() clears the member variable to preserve
the same lifetime 'mm' had before, in case that's important.
* seastar acf7e3523b...5e86362704 (10):
> Merge "Respect taskset-configured cpumask" from Pavel E
Ref #9505.
> rpc_tester: Run CPU hogs on server side too
> std-coroutine: include <coroutine> for LLVM-15
> Revert "Merge "tests: perf: measure coroutines performance" from Benny"
> test: perf_tests: remove [[gnu::always_inline]] attribute from coroutine perf tests
> Merge "tests: perf: measure coroutines performance" from Benny
> Merge "Extend RPC tester" from Pavel E
> rpc: Mark connection trivial getters const noexcept
> seastar-addr2line: Allow use of llvm-addr2line as the command
> file: append_challenged_posix_file: Serialize allocate() to not block concurrent reads or writes
The code would advertise the USES_RAFT feature when the SUPPORTS_RAFT
feature was enabled through a listener registered on the SUPPORTS_RAFT
feature.
This would cause a deadlock:
1. `gossiper::add_local_application_state(SUPPORTED_FEATURES, ...)`
locks the gossiper (it's called for the first time from sstables
format selector).
2. The function calls `on_change` listeners.
3. One of the listeners is the one for SUPPORTS_RAFT.
4. The listener calls
`gossiper::add_local_application_state(SUPPORTED_FEATURES, ...)`.
5. This tries to lock the gossiper.
In turn, depending on timing, this could hang the startup procedure,
which calls `add_local_application_state` multiple times at various
points, trying to take the lock inside gossiper.
This prevents us from testing raft / group 0, new schema change
procedures that use group 0, etc.
For now, simply remove the code that advertises the USES_RAFT feature.
Right now the feature has no other effect on the system than just
becoming enabled. In fact, it's possible that we don't need this second
feature at all (SUPPORTS_RAFT may be enough), but that's
work-in-progress. If needed, it will be easy to bring the enabling code
back (in a fixed form that doesn't cause a deadlock). We don't remove
the feature definitions yet just in case.
Refs: #10355
We start the memory threshold guard (that enables large memory allocation
warnings post-boot) but don't wait for it. I can't imagine it can hurt,
but it does carry a FIXME label.
Closes#10375
This patch series splits up parts of repair pipeline to allow unit testing
various bits of code without having to run full dtest suite. The reason why
repair pipeline has no unit tests is that by definition repair requires multiple
nodes, while unit test environment works only for a single node.
However, it is possible to explicitly define interfaces between various parts of the
pipeline, inject dependencies and test them individually. This patch series is focused
on taking repair_rows_on_wire (frozen mutation representation of changes coming from
another node) and flushing them to an sstable.
The commits are split into the following parts:
- pulling out classes to separate headers so that they can be included (potentially indirectly) from the test,
- pulling out repair_meta::to_repair_rows_list and part of repair_meta::flush_rows_in_working_row_buf so that they can be tested,
- refactoring repair_writer so that the actual writing logic can be injected as dependency,
- creating the unit test.
tests: unit(dev), dtest(incremental_repair_test, read_repair_test, repair_additional_test, repair_test)
Closes#10345
* github.com:scylladb/scylla:
repair: Add unit test for flushing repair_rows_on_wire to disk.
repair: Extract mutation_fragment_queue and repair_writer::impl interfaces.
repair: Make parts of repair_writer interface private.
repair: Rename inputs to flush_rows.
repair: Make repair_meta::flush_rows a free function.
repair: Split flush_rows_in_working_row_buf to two functions and make one static.
repair: Rename inputs to to_repair_rows_list.
repair: Make to_repair_rows_list a free function.
repair: Make repair_meta::to_repair_rows_list a static function
repair: Fix indentation in repair_writer.
repair: Move repair_writer to separate header.
repair: Move repair_row to a separate header.
repair: Move repair_sync_boundary to a separate header.
repair: Move decorated_key_with_hash to separate header.
repair: Move row_repair hashing logic to separate class and file.
* 'raft_group0_early_startup_v3' of https://github.com/ManManson/scylla:
main: allow joining raft group0 before waiting for gossiper to settle
service: raft_group0: make `join_group0` re-entrant
service: storage_service: add `join_group0` method
raft_group_registry: update gossiper state only on shard 0
raft: don't update gossiper state if raft is enabled early or not enabled at all
gms: feature_service: add `cluster_uses_raft_mgmt` accessor method
db: system_keyspace: add `bootstrap_needed()` method
db: system_keyspace: mark getter methods for bootstrap state as "const"
"
Optimize consuming from a single partition.
This gives us significant improvement with single, small mutations,
as shown with perf_mutation_readers, compared to the vector-based
flat_mutation_reader_from_mutations_v2.
These are expected to be common on the write path,
and can be optimized for view building.
results from: perf_mutation_readers -c1 --random-seed=840478750
(userspace cpu-frequency governer, 2.2GHz)
test iterations median mad min max
Before:
combined.one_row 720118 825.668ns 1.020ns 824.648ns 827.750ns
After:
combined.one_mutation 881482 751.157ns 0.397ns 750.211ns 751.912ns
combined.one_row 843270 756.553ns 0.303ns 755.889ns 757.911ns
The grand plan is to follow up
with make_flat_mutation_reader_from_frozen_mutation_v2
so that we can read directly from either a mutation
or frozen_mutation without having to unfreeze it e.g. in
table::push_view_replica_updates.
Test: unit(dev)
Perf: perf_mutation_readers(release)
"
* tag 'flat_mutation_reader_from_mutation-v3' of https://github.com/bhalevy/scylla:
perf: perf_mutation_readers: add one_mutation case
test: mutation_query_test: make make_source static
mutation readers: refactor make_flat_mutation_reader_from_mutation*_v2
mutation readers: add make_flat_mutation_reader_from_mutation_v2
readers: delete slice_mutation.hh
test: flat_mutation_reader_test: mock_consumer: add debug logging
test: flat_mutation_reader_test: mock_consumer: make depth counter signed
"
There's a generic way to start-stop services in scylla, that includes
5 "actions" (some are optional and/or implicit though)
service_config cfg = ...
sharded<service>.start(cfg)
service.invoke_on_all(&service::start)
service.invoke_on_all(&service::shutdown)
service.invoke_on_all(&servuce::stop)
sharded<service>.stop()
and most of the service out there conforms to that scheme. Not snitch
(spoiler: and not tracing), for which there's a couple of helpers that
do all that magic behind the scenes, "configuring" snitch is done with
the help of overloaded constructors. The latter is extra complicated
with the need to register snitch drivers in class-registry for each
constructor overload. Also there's an external shards synchronization
on stop.
This set brings snitch start/stop code to the described standard: the
create/stop helpers are removed, creation acceps the config structure,
per-shard start/stop (snitch has no drain for now) happens in the
simple invoke-on-all manner.
The intended side effect of this change is the ability to add explicit
dependencies to snitch (in the future, not in this set).
tests: unit(dev)
"
* 'br-snitch-config' of https://github.com/xemul/scylla:
snitch: Remove create_snitch/stop_snitch
snitch: Simplify stop (and pause_io)
snitch: Move io_is_stopped to property-file driver
snitch: Remove init_snitch_obj()
snitch: Move instance creation into snitch_ptr constructor
snitch: Make config-based construction of all drivers
snitch: Declare snitch_ptr peering and rework container() method
snitch: Introduce container() method
A node can join group0 without waiting for gossiper if
it is either a fresh node, or it's an existing node, which
is already part of some group0 (i.e. have `group0_id` persisted
in system tables).
In that case the second `join_group0()` call inside the
`storage_service::join_token_ring` will be a no-op.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Measure performance of the single-mutation reader:
make_flat_mutation_reader_from_mutation_v2.
Comparable to the `one_row` case that consumes the
single mutation using the multi-mutatio reader:
make_flat_mutation_reader_from_mutations_v2
perf_mutation_readers shows ~20-30% improvement of
make_flat_mutation_reader_from_mutation_v2
the same single mutation, just given as a single-item vector
to make_flat_mutation_reader_from_mutations_v2.
test iterations median mad min max
Before:
combined.one_row 720118 825.668ns 1.020ns 824.648ns 827.750ns
After:
combined.one_mutation 881482 751.157ns 0.397ns 750.211ns 751.912ns
combined.one_row 843270 756.553ns 0.303ns 755.889ns 757.911ns
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Extract the common parts of the single mutation reader
and the vector-based variant into mutation_reader_base
and reuse from both readers.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
slice_mutations() is currently used only by readers/mutation_readers.cc
so there's no need to expose it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We want to return stop_iteration::yes once we crossed
the initial depth threshold, with an unsigned depth counter,
it might wraparound and look > 1.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The test will now, with probability 1/2, enable forwarding of entries by
followers to leaders. This is possible thanks to the new abort_source&
APIs which we use to ensure that no operations are running on servers
before we destroy them.
Some adjustments were required to the server abort procedure in order to
prevent rare hangs (see first patch). We also translate some low-level
exceptions coming from seastar primitives to high-level Raft API
exceptions (second patch).
* kbr/nemesis-enable-fd-v1:
test: raft: randomized_nemesis_test: enable entry forwarding
test: raft: randomized_nemesis_test: increase logging level on some rare operations
raft: server: translate abort_requested_exception to raft::request_aborted
raft: fsm: when stopping, become follower to reject new requests
This pull request adds support for retrying failed forwarder calls
(currently used to parallelize `select count(*) from ...` queries).
Failed-to-forward sub-queries will be executed locally (on a
super-coordinator). This local execution is meant as a fallback for a
forward_requests that could not be sent to its destined coordinator
(e.g. due gossiper not reacting fast enough). Local execution was chosen
as the safest one - it does not require sending data to another
coordinator.
Due to problems with misscompilations, some parts of the
`forward_service` were uncoroutinized.
Fixes: #10131Closes#10329
* github.com:scylladb/scylla:
forward_service: uncoroutinize dispatch method
forward_service: uncoroutinize retrying_dispatcher
forward_service: rety a failed forwarder call
forward_service: copy arguments/captured vars to local variables
The restriction "WHERE m[NULL] = 2" should result in an invalid request
error, but currently results in an ugly internal server error.
This test reproduces it, and since the bug is still in the code - is
marked as xfail.
Refs #10361
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220412134118.829671-1-nyh@scylladb.com>
This is a reproducer for issue #10359 that a "CONTAINS NULL" and
"CONTAINS KEY NULL" restrictions should not match any set, but currently
do match non-empty or all sets.
The tests currently fail on Scylla, so marked xfail. They also fails on
Cassandra because Cassandra considers such a request an error, which
we consider a mistake (see #4776) - so the tests are marked "cassandra_bug".
Refs #10359.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220412130914.823646-1-nyh@scylladb.com>
We already have a test showing that WHERE v=NULL ALLOW FILTERING is
allowed in Scylla (unlike Cassandra), and matches nothing. Here
we add two further tests that confirm that:
1. Not only is v=NULL allowed - v<NULL, v<=NULL, and so on, is also
allowed and matches nothing.
2. The ALLOW FILTERING is required in in those requests. Without it,
both Scylla and Cassandra generate the same "ALLOW FILTERING is
required" error.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220411214503.770413-1-nyh@scylladb.com>
Protocol v4 added WRITE_FAILURE and READ_FAILURE. When running under v3
we downgrade these exceptions to WRITE_TIMEOUT and READ_TIMEOUT (since
the client won't understand the v4 errors), but we still send the new
error codes. This causes the client to become confused.
Fix by updating the error codes.
A better fix is to move the error code from the constructor parameter
list and hard-code it in the constructor, but that is left for a follow-up
after this minimal fix.
Fixes#5610.
Closes#10362
"
Repair code keeps its history in system keyspace and uses the qctx
global thing to update and query it. This set replaces the qctx with
the explicit reference on the system_keyspace object.
tests: unit(dev), dtest.repair_test(dev)
"
* 'br-repair-vs-qctx' of https://github.com/xemul/scylla:
repair, system_keyspace: Query repair_history with a helper
repair: Update loader code to use system_keyspace entry
repair, system_keyspace: Update repair_history with a helper
repair: Keep system keyspace reference
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.
Found during code review, no user impact.
Fixes#10364.
Message-Id: <20220411224741.644113-1-tgrabiec@scylladb.com>
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.
Found during code review, no known user impact.
Fixes#10363.
Message-Id: <20220411222605.641614-1-tgrabiec@scylladb.com>
This miniseries rewrites a few unnecessary throws into forwarding the exception directly. It's partially possible thanks to the new `co_await coroutine::return_exception` mechanism which allows returning from a coroutine early, without explicitly calling co_return (d5843f6e88).
Closes#10360
* github.com:scylladb/scylla:
sstables: : remove unnecessary throws
schema_tables: remove unnecessary throws
Querying the table is now done with the help of qctx directly. This
patch replaces it with a querying helper that calls the consumer
function with the entry struct as the argument.
After this change repair code can stop including query_context and
mess with untyped_result_set.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Patch the history entry loader to use the recently introduced
history entry. This is just to reduce the churn in the next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Current code works directly on the qctx which is not nice. Instead,
make it use the system keyspace reference. To make it work, the patch
adds a helper method and introduces a helper struct for the table
entry. This struct will also be used to query the table (next patch).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Repair updates (and queries on start) the system.repair_history table
and thus depends on the system_keyspace object
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Adds a "named_file" wrapper type in commitlog, encapsulating file and disk size, the latter being updated automatically on write/truncate/allocate/delete operations. Use this instead of loose vars in segments, and also in recycle/delete lists.
Having the data propagate with the objects means we can dispose of re-reading sizes from disk, which in turn means we know what "our" view of the file sizes is when we try to delete/recycle them -> we can bookkeep accurately (from our view point) without having to resort to the rather horrible recalculation of disk footprint.
This series also drops non-recycled segment handling, since it is not used anywhere, and just makes things harder.
It also adds a parameter to set flush threshold.
These two first patches could be broken out into separate PR:s if need be.
Closes#10084
* github.com:scylladb/scylla:
commitlog: Fold named_file continuations into caller coroutine frame
commitlog: Use named named_file objects in delete/dispose/recycle lists
commitlog: Use named_file size tracking instead of segment var
commitlog: Use named_file in segment
commitlog: Add "named_file" file wrapping type
commitlog: Make flush threshold a config parameter
commitlog: kill non-recycled segment management
With off-strategy, we no longer need LCS explicitly switching to STCS
mode, and even without off-strategy, the dynamic fan-in approach
in compaction manager will cause LCS to automatically switch to
STCS under heavy write load.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220411181322.192830-1-raphaelsc@scylladb.com>
The unit test executes a simplified repair scenario by:
- producing a random stream of mutation mutation_fragments,
- convering them to repair_rows_on_wire,
- convering them to list of repair_rows using the conversion logic
extracted in previous commits from repair_meta,
- flushing the rows to an sstable using the logic extracted in previous
commits from repair_meta,
- comparing the sstable contents with the originally produced mutation
fragments.
The test checks only the flushing part and is not concerned with any
other piece of the repair pipeline.
It allows pulling out the logic of writing internal representation
of repair mutations to disk. This in turn is needed to unit test
this functionality without spinning up clusters, which significantly
improves developer iteration time.
It allows pulling out the logic of convering on-the-wire representation
of repair mutations to an internal representation used later for
flushing repair mutations to disk. This in turn is needed to unit test
the functionality without spinning up clusters, which significantly
improves developer iteration time.
Saves a continuation. That matters very little. But...
Uses a special awaiter type on returns from the "then(...)"-wrapping
named_file methods (which use a then([...update]) to keep internal
size counters up-to-date, making the continuation instead a stored func
into the returned awaiter, executed on successul resume of the caller
co_await.
Changes delete/close queue, as well as deletetion queue into one, using
named_file objects + marker. Recycle list now also contains said named
file type.
This removes the need to re-eval file sizes on disk when deleting etc,
which in turn means we can dispose of recalculate_footprint on errors,
thus making things simpler and safer.
This commit makes subscript an invalid argument to possible_lhs_values.
Previously this function simply ignored subscripts
and behaved as if it was called on the subscripted column
without a subscript.
This behaviour is unexpected and potentially
dangerous so it would be better to forbid
passing subscript to possible_lhs_values entirely.
Trying to handle subscript correctly is impossible
without refactoring the whole function.
The first argument is a column for which we would
like to know the possible values.
What are possible values of a subscripted column c where c[0] = 1?
All lists that have 1 on 0th position?
If we wanted to handle this nicely we would have to
change the arguments.
Such refectoring is best left until the time
when this functionality is actually needed,
right now it's hard to predict what interface
will be needed then.
Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
Closes#10228
Filtering remote rpc errors based on exception type did not work because
the remote errors were reported as std::runtime_error and all rpc
exceptions inherit from it. New rpc propagates remote errors using
special type rpc::remote_verb_error now, so we can filter on that
instead.
Fixes#10339
Message-Id: <YlQYV5G6GksDytGp@scylladb.com>
This series speeds up tools/toolchain/prepare in a few ways:
- builds images in parallel
- allows running on any arch as host
- reduces work in building the image
- removes unneeded layers
Closes#10348
* github.com:scylladb/scylla:
tools: toolchain: prepare: sqush intermediate container layers
tools: toolchain: update container image first thing
tools: toolchain: prepare: build arch images in parallel
tools: toolchain: prepare: aloow running on non-x86
After previous patches both, create_snitch() and stop_snitch() no look
like the classica sharded service start/stop sequence. Finally both
helpers can be removed and the rest of the user can just call start/stop
on locally obtained sharded references.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Both first stop/pause snitch driver on io-ing shard, then proceed with
the rest. This sequence is pretty pointless and here's why.
The only non-trivial stop()/pause_io() method out there is in the
property-file snitch driver. In it, both methods check if the current
shard is the io-ing one, if no -- return back the resolved future, if
yes -- go ahead and stop/pause some IO. With this, for all shards but
io-ing one there's no point in starting after io-ing one is stopped,
they all can start (and finish) in parallel.
So what this patch does is just removes the pre-stop/pause kicking of
the io-ing shard.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Current API to create snitch is not like other services -- there's a
dedicated helper that does sharded<>.start() + invoke_on_all(&start)
calls. These helpers complicate do-globalization of snitch and rework
of services start-stop sequence, things get simpler if snitch uses
the same start-stop API as all the others. The first step towards this
change is moving the non-waiting parts of snitch initialization code
from init_snitch_obj() into snitch_ptr constructor.
A note on this change: after patch #2 the snitch_ptr<->driver linkage
connects local objects with each other, not container() of any. This
is important, because connecting container() would be impossible inside
constructor, as the container pointer is initialized by seastar _after_
the service constructor itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently snitch drivers register themselves in class-registry with all
sorts of construction options possible. All those different constuctors
are in fact "config options".
When later snitch will declare its dependencies (gossiper and system
keyspace), it will require patching all this registrations, which's very
inconvenient.
This patch introduces the snitch_config struct and replaces all the
snitch constructors with the snitch_driver(snitch_config cfg) one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch makes the snitch base class reference local snitch_ptr, not
its sharded<> container and, respectively, makes the base container()
method return _backreference->container() instead.
The motivation of this change is, again, in the next patch, which will
move snitch_ptr<->driver_object linkage into snitch_ptr constructor.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some snitch drivers want the peering_sharded_service::container()
functionality, but they can't directly use it, because the driver
class is in fact the pimplification behind the sharded<snitch_ptr>
service. To overcome this there's a _my_distributed pointer on the
driver base class that points back to sharded<snitch_ptr> object.
This patch replaces the direct _my_distributed usage with the
container() method that does it and also asserts that the pointer
in question is initialized (some drivers already do it, some don't).
Other than making the code more peering_sharded_service-like, this
patch allows changing _my_distributed into _backreference that
points to this shard's snitch_ptr, see next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
The database::shutdown() and ::drain() methods are called inside the
invoke_on_all()s synchronizing with each other via the cross-shard
_stop_barrier.
If either shard throws in between all others may get stuck waiting for
the barrier to collect all arrivals. To fix it the throwing shard
should wake up others, resolving the wait somehow.
The fix is actually patch #4, the first and the second are the abort()
method for the barrier itself.
Fixes: #10304
tests: unit(dev), manual
"
* 'br-barrier-exception-2' of https://github.com/xemul/scylla:
database: Abort barriers on exception
database: Coroutinize close_tables
test: Add test for cross_shard_barrier::abort()
cross-shard-barrier: Add .abort() method
The database::shutdown() and ::drain() methods are called inside the
container().invoke_on_all() and synchronize with each other via the
cross-shard _stop_barrier. If either shard throws in between all others
may get stuck waiting for the barrier to collect all arrivals.
The fix is to abort the barrier on exception thus making all the
shards sitting in shutdown or drain to bail out with exceptions too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
* seastar 2a2a1305...05cdfc2d (5):
> Revert "core: reactor: fix a typo in `smp_pollfn::poll()`"
> core: reactor: fix a typo in `smp_pollfn::poll()`
> coroutine/exception: make it work with co_await
> perftune.py: arfs: allow toggling on/off and allow auto-detection
> coroutine: introduce as_future
Recently I added a test that verified that blobAsInt() accepts a zero-
byte blob and return an "empty" integer. I was asked by one of the
reviewers - what happens if we try to pass a *three* byte blob to
blobAsInt()? Here is a new test that demonstrates that the answer is:
Besides the 0-byte blob, blobAsInt() only allows a 4-byte blob. Trying
3 or 5 bytes will result in an invalid query error being returned.
The test passes on both Cassandra and Scylla, confirming their behavior
is the same. The test checks all fixed-sized integer types - int (4
bytes), bigint (8 bytes), smallint (2 bytes) and tinyint (1 byte).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220411093803.651881-1-nyh@scylladb.com>
This patch implements the previously-unimplemented Select option of the
Query and Scan operators.
The most interesting use case of this option is Select=COUNT which means
we should only count the items, without returning their actual content.
But there are actually four different Select settings: COUNT,
ALL_ATTRIBUTES, SPECIFIC_ATTRIBUTES, and ALL_PROJECTED_ATTRIBUTES.
Five previously-failing tests now pass, and their xfail mark is removed:
* test_query.py::test_query_select
* test_scan.py::test_scan_select
* test_query_filter.py::test_query_filter_and_select_count
* test_filter_expression.py::test_filter_expression_and_select_count
* test_gsi.py::test_gsi_query_select_1
These tests cover many different cases of successes and errors, including
combination of Select and other options. E.g., combining Select=COUNT
with filtering requires us to get the parts of the items needed for the
filtering function - even if we don't need to return them to the user
at the end.
Because we do not yet support GSI/LSI projection (issue #5036), the
support for ALL_PROJECTED_ATTRIBUTES is a bit simpler than it will need
to be in the future, but we can only finish that after #5036 is done.
Fixes#5058.
The most intrusive part of this patch is a change from attrs_to_get -
a map of top-level attributes that a read needs to fetch - to an
optional<attrs_to_get>. This change is needed because we also need
to support the case that we want to read no attributes (Select=COUNT),
and attrs_to_get.empty() used to mean that we want to read *all*
attributes, not no attributes. After this patch, an unset
optional<attrs_to_get> means read *all* attributes, a set but empty
attrs_to_get means read *no* attributes, and a set and non-empty
attrs_to_get means read those specific attributes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220405113700.9768-2-nyh@scylladb.com>
In DynamoDB one can retrieve only a subset of the attributes using the
AttributesToGet or ProjectionExpression paramters to read requests.
Neither allows an empty list of attributes - if you don't want any
attributes, you should use Select=COUNT instead.
Currently we correctly refuse an empty ProjectionExpression - and have
a test for it:
test_projection_expression.py::test_projection_expression_toplevel_syntax
However, Alternator is missing the same empty-forbidding logic for
AttributesToGet. An empty AttributesToGet is currently allowed, and
basically says "retrieve everything", which is sort of unexpected.
So this patch adds the missing logic, and the missing test (actually
two tests for the same thing - one using GetItem and the other Query).
Fixes#10332
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220405113700.9768-1-nyh@scylladb.com>
In the existing test we noticed that list_append(if_not_exists(...))
is allowed, but list_append(list_append(...)) is not. I wasn't sure
whether if_not_exists(if_not_exists(..)) will be allowed - and this
test verifies that it is - it works on both Scylla and DynamoDB, and
gives the same results on both.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220407122729.155648-1-nyh@scylladb.com>
We had in test_null.py a mixture of tests for null values and the
"null" CQL keyword - and tests for empty values. Null and empty
values are *not* the same thing, and there is no reason to keep the
tests for the two things in the same file and further confuse these
two distinct concepts.
This patch just moves code from test_null.py into a new test_empty.py -
there are no functional changes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220407090348.137583-2-nyh@scylladb.com>
In https://github.com/scylladb/scylla-rust-driver/issues/278 we noted
that beyond the concept of a null integer value (which has size -1),
there is also an empty integer value (size 0). This patch adds a test
that it works as expected. And we see that it does - Scylla stores such
a value fine, and the Python driver retrieves it the same as a null
(arguably, this is fine - the important point is to see that we don't
get a crash or an error).
The test passes - I just added it as a regression test for the future.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220407090348.137583-1-nyh@scylladb.com>
Otherwise, rpm dependency resolution starts by installing an older
version of gcc (to satisfy an older preinstalled libgcc dependency),
then updates it. After the change, we install the updated gcc in
the first place.
`prepare` builds a multiarch image using qemu emulation. It turns
out that aarch64 emulation is slowest (due to emulating pointer
authentication) so it makes sense to run it on an aarch64 host. To do
that, we need only to adjust the check for qemu installation.
Unfortunately, docker arch names and Linux arch names are different,
so we have to add an ungainly translation, but otherwise it is a
simple loop.
This series is part of the shared storage project.
The STORAGE option is designed to hold a map of options
used for customizing storage for given keyspace.
The option is kept in a system_schema.scylla_keyspaces table.
This option is guarded with a schema feature, because it's kept in a new schema table: `system_schema.scylla_keyspaces`.
Example of the contents of the new table:
```cql
cassandra@cqlsh> select * from system_schema.scylla_keyspaces;
keyspace_name | storage_options | storage_type
---------------+------------------------------------------------+--------------
ksx | {'bucket': '/tmp/xx', 'endpoint': 'localhost'} | S3
```
Native storage options are not kept in the table, as this format doesn't hold any extra options and it would therefore just be a waste of storage.
Closes#10144
* github.com:scylladb/scylla:
test: regenerate schema_change_test for storage options case
test: improve output of schema_change_test regeneration
docs: add a paragraph on keyspace storage options
test: add test cases for keyspace storage options
database,cql3: add STORAGE option to keyspaces
db: add keyspace-storage-options experimental feature
db,schema_tables: add scylla_keyspaces table
db,gms: add SCYLLA_KEYSPACE schema feature
db,gms: add KEYSPACE_STORAGE_OPTIONS feature
Simplify view_update_builder::build_some by turning it into a coroutine,
and make view_updates::move_to async (also using a coroutine) so it may yield in-between building the updates, since freezing each mutation can be cpu intensive and preparing many updates synchronously may cause reactor stalls.
Test: unit(dev)
DTest: materialized_views_test.py(dev)
Closes#10344
* github.com:scylladb/scylla:
db: view_updates: coroutinize move_to
db: view_update_builder: build_some: maybe yield between updates
db: view_update_builder: build_some: fixup indentation
db: view_update_builder: coroutinize build_some
While reviewing "utils/chunked_managed_vector: Fix corruption in case there is more
than one chunk", I was worried that there could be a correctness issue
when pop_back() pops off the first element of the last chunk, but turns
out I made an off-by-one error in my theory. Anyway, I wrote a unit test
to verify my assumption and I found worth submitting it upstream.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220408133555.12397-2-raphaelsc@scylladb.com>
Keyspace storage options series adds a new schema table:
system_schema.scylla_keyspaces. The regenerated cases ensure
that this new table is taken into account when the schema feature
is available.
Schema change test operates on pre-generated sstables, and sometimes
this set of sstables needs to be regenerated. In order to make the
regeneration process more ergonomic, the output is now directly
copyable as valid C++ representation of UUIDs.
The test cases check if it's possible to set and/or alter
storage options for keyspaces with CQL, and whether the changes
are reflected in the schema tables.
The STORAGE option is designed to hold a map of options
used for customizing storage for given keyspace.
The option is kept in a system_schema.scylla_keyspaces table.
The option is only available if the whole cluster is aware
of it - guarded by a cluster feature.
Example of the table contents:
```
cassandra@cqlsh> select * from system_schema.scylla_keyspaces;
keyspace_name | storage_options | storage_type
---------------+------------------------------------------------+--------------
ksx | {'bucket': '/tmp/xx', 'endpoint': 'localhost'} | S3
```
The table holds scylla-specific information on keyspaces.
The first columns include storage_type and storage_options,
which will be used later to store storage information.
The feature represents the ability to store storage options
in keyspace metadata: represented as a map of options,
e.g. storage type, bucket, authentication details, etc.
If reserve() allocates more than one chunk, push_back() should not
work with the last chunk. This can result in items being pushed to the
wrong chunk, breaking internal invariants.
Also, pop_back() should not work with the last chunk. This breaks when
there is more than one chunk.
Currently, the container is only used in the sstable partition index
cache.
Manifests by crashes in sstable reader which touch sstables which have
partition index pages with more than 1638 partition entries.
Introduced in 78e5b9fd85 (4.6.0)
Fixes#10290
Message-Id: <20220407174023.527059-1-tgrabiec@scylladb.com>
Just delegates work to `service::raft_group0::join_group0()`
so that it can be used in `main` to activate raft group0
early in some cases (before waiting for gossiper to settle).
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Since `gossiper::add_local_application_state` is not
safe to call concurrently from multiple shards (which
will cause a deadlock inside the method), call this
only on shard 0 in `_raft_support_listener`.
This fixes sporadic hangs when starting a fresh node in an
empty cluster where node hangs during startup.
Tests: unit(dev), manual
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
There is a listener in the `raft_group_registry`,
which makes the gossiper to re-publish supported
features app state to the cluster.
We don't need to do this in case `USES_RAFT_CLUSTER_MANAGEMENT`
feature is enabled before the usual time, i.e. before the
gossiper settles. So, short-circuit the listener logic in
that case and do nothing.
Also, don't do anything if raft group registry is not enabled
at all, this is just a generic safeguard.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The main reason for adding rust dependency to scylla is the
wasmtime library, which is written in rust. Although there
exist c++ bindings, they don't expose all of its features,
so we want to do that ourselves using rust's cxx.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
[avi: update toolchain]
[avi: remove example, saving for a follow-on]
Colordiff is problematic when writing the diff into a file for later
examination. Use regular diff instead. One can still get syntax
highlighting by writing the output into `.diff` file (which most editors
will recognize).
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220407080944.324108-1-bdenes@scylladb.com>
There's a public call on replica::table to get back the compaction
manager reference. It's not needed, actually. The users of the call are
distributed loader which already has database at hand, and a test that
creates itw own instance of compaction manager for its testing tables
and thus also has it available.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220406171351.3050-1-xemul@scylladb.com>
The tests runs a loop of arrivals each of which can randomly
throw before arriving. As the result the test expects all shards
to resolve into exception in the same phase.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method makes all the .arrive_and_wait()s in the current phase
to resolve with barrier_aborted_exception() exceptional future.
The barrier turns into a broken state and is not supposed to serve
any subsequence arrivals anyhow reasonably.
The .abort() method is re-entrable in two senses. The first is that
more than one shard can abort a barrier, which is pretty natural.
The second is that the exception-safety fuses like that imply that
if the arrive_and_wait() resolves into exception the caller will try
to abort() the barrier as well, even though the phase would be over.
This case is also "supported".
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
Gossiper calls snitch->gossiper_starting() when being enabled. This
generates a dependency loop -- snitch needs gossiper to gossip its
states and get DC/RACK, gossiper needs snitch to do this kick.
This set removes this notification. The new approach is to kick the
snitch to gossip its states in the same places where gossiper is
enabled() so that only the snitch->gossiper dependency remains.
As a side effect the set ditches a bunch of references to global
snitch instance.
tests: unit(dev)
"
* 'br-snitch-gossiper-starting' of https://github.com/xemul/scylla:
snitch: Remove gossiper_starting()
snitch: Remove gossip_snitch_info()
property-file snitch: Re-gossip states with the help of .get_app_states()
property-file snitch: Reload state in .start()
ec2 multi-region snitch: Register helper in .start()
snitch, storage service: Gossip snitch info once
snitch: Introduce get_app_states() method
property-file snitch: Use _my_distributed to re-shard
storage service: Shuffle snitch name gossiping
Makes final function and initial condition to be optional while
creating UDA. No final function means UDA returns final state
and default initial condition is `null`.
Both items were optional in cql's grammar but they were treated as required in code.
Additionally I've added check if state function returns state.
Fixes#10324Closes#10331
* github.com:scylladb/scylla:
CQL3: check sfunc return type in UDA
cql-pytest: UDA no final_func/initcond tests
cql3: allow no final_func and no initcond in UDA
Failed-to-forward sub-queries will be executed locally (on a
super-coordinator). This local execution is meant as a fallback for
forward_requests that could not be sent to its destined coordinator
(e.g. due gossiper not reacting fast enough). Local execution was chosen
as the safest one - it does not require sending data to another
coordinator.
Adds sub-template for time_parallel with templated result type + optional per-iteration post-process func. Idea is that Res may be a subtype of perf_result, with additional stats, initiated on init, and post-process function can fix up and apply stats -> we can add stats to result.
Then uses this mighty construct to add some IO stats to CL perf.
Closes#10334
* github.com:scylladb/scylla:
perf_commitlog: Add bytes + bytes written stats
perf: Add aio_writes mixin for perf_results
test/perf/perf.hh: Make templated version of test routine to allow extended stats
We had a Python typo ("false" instead of "False") which prevented
tests with the fails_without_raft marker for running on Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220405170337.36321-1-nyh@scylladb.com>
Makes final function and initial condition to be optional while
creating UDA. No final function means UDA returns final state
and defeult initial condition is `null`.
Fixes: #10324
The test will now, with probability 1/2, enable forwarding of entries by
followers to leaders. This is possible thanks to the new abort_source&
APIs which we use to ensure that no operations are running on servers
before we destroy them.
When testing Scylla, cql-pytest does *not* need an external nodetool
command - it uses the REST API instead because it is much faster and
there is no need to install anything. However, if cql-pytest is run
against Cassandra, the tests do want to use the "nodetool" utility and
want to know what it is. The tests use either the NODETOOL environment
variable, or if that doesn't exist, look for "nodetool" in the path.
If nodetool wasn't found in that way, before this patch, we got an ugly
error message with long irrelevant Python backtraces. It wasn't easy
to understand that what actually happened was that the user forgot
to set the NODETOOL environment variable.
This patch cleans up this error handling. Now, if nodetool cannot be
found, every test that tries to run nodetool will report just a one-
line error message, clearly explaining what went wrong and how to
fix it:
Error: Can't find nodetool. Please set the NODETOOL
environment variable to the path of the nodetool utility.
To reiterate, when testing Scylla, nodetool is *not* needed even after
this patch. These errors will not happen even if you don't have the
nodetool utility. You only need nodetool if you plan to test Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220405171835.43992-1-nyh@scylladb.com>
Increase the logging level on the few operations which happen at the end
of the test but make debugging a bit easier if the test hangs for some
reason.
The `wait_for_leader` function would throw a low-level
`abort_requested_aborted` exception from seastar::shared_promise.
Translate it to the high-level raft::request_aborted so we can reduce
the number of different exception types which cross the Raft API
boundary.
Also, add comments on Raft API functions about the exception thrown when
requests are aborted.
After enabling add_entry forwarding in randomized_nemesis_test, the test
would sometimes hang on _rpc->abort() call due to add_entry messages
from followers which waited on log_limiter_semaphore on the leader
preventing _rpc from finishing the abort; the log_limter_semaphore would
not get unblocked because the part of the server was already stopped.
Prevent log_limiter_semaphore from being waited on when stopping the
server by becoming a follower in fsm::stop.
Adds sub-template for time_parallel with templated result type + optional
per-iteration post-process func. Idea is that Res may be a subtype of
perf_result, with additional stats, initiated on init, and post-process
function can fix up and apply stats -> we can add stats to result.
"
Examining sstables of system tables is quite a common task. Having to
dump the schemas of such tables into a schema.cql is annoying knowing
that these schemas are readily available in scylla, as they are
hardcoded. This mini-series adds a method to make use of this fact, by
adding a new option: `--system-schema`, which takes the name of a system
table and looks up its schema.
Tests: unit(dev)
"
* 'scylla-sstable-system-schema/v1' of https://github.com/denesb/scylla:
tools/scylla-sstable: add alternative schema load method for system tables
tools/schema_loader: add load_system_schema()
db/system_distributed_keyspace: add all tables methods
tools/scylla-sstable: reorganize main help text
Allowing to consume the frozen_mutation directly
to a stream rather than unfreezing it first
and then consuming the unfrozen mutation.
Streaming directly from the frozen_mutation
saves both cpu and memory, and will make it
easier to be made async as a follow, to allow
yielding, e.g. between rows.
This is used today only in to_data_query_result
which is invoked on the read-repair path.
Refs #10038Fixes#10021
Test: unit(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220405055807.1834494-1-bhalevy@scylladb.com>
We had an old TODO in the Alternator "Scan" operation code which
suggested that we may need to do something to limit the size of pages
when a row limit ("Limit") isn't given.
But we do already have a built-in limit on page sizes (1 MB),
so this TODO isn't needed and can be removed.
But I also wanted to make sure we have a test that this limit works:
We already had a test that this 1 MB limit works for a single-partition
Query (test_query.py::test_query_reverse_longish - tested both forward
and reversed queries). In this patch I add a similar test for a whole-
table Scan. It turns out that although page size is limited in this case
as well, it's not exactly 1 MB... For small tables can even reach 3 MB.
I consider this "good enough" and that we can drop the TODO, but also
opened issue #10327 to document this surprising (for me) finding.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220404145240.354198-1-nyh@scylladb.com>
This patch adds two xfailing tests for issue #7933. That issue is about
what Scan or Query paging does when encountering a very long string of
consecutive tombstones (partition or row tombstones). Ideally, in that
case the scan could stop on one of these tombstones after already
processing too many. But as these two tests demonstrate, the scan can't
stop in the middle of a long string of tombstones - and as a result
retrieving a single page can take an unbounded amount of time, which is
wrong.
Currently the tests are marked `@veryslow` (they each take more than a
minute) because they each create a huge number of tombstones to
demonstrate a huge amount of work for a single page. When we fix
issue #7933 and have a much smaller limit on the number of tombstones
processed in a single page, we can hopefully make these tests much
shorter and remove the `@veryslow` tag. The `@veryslow` tags means
that although these tests can be used manually (with `--runveryslow`)
they will not yet be run as part of the usual regression tests.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220403070706.250147-1-nyh@scylladb.com>
Bucket awareness in cleanup was introduced in a69d98c3d0.
STCS and TWCS already support it, and now LCS will receive it.
The goal of bucket awareness is to reduce writeamp in cleanup,
therefore reducing operation time. Additionally, garbage collection
becomes more efficient as shadowed data can now be potentially
compacted with the data that shadows it, assuming they're on
the same level.
The implementation for LCS is simple. Will reuse the procedure
for STCS for returning jobs in level 0. And one job will be
returned for each non-empty level > 0. What allows us to do it
is our incremental selection approach used in compaction,
that sets a limit on memory usage and disk space requirement.
Fixes#10097.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220331173417.211257-1-raphaelsc@scylladb.com>
When the highest tombstone is open ended, we must
emit a closing range_tombstone_change at
position_in_partition::after_all_clustered_rows().
Since all consumers need to do it, implement the logic
in the range_tombstone_change_generator itself.
It turned out that mutation::consume doesn't do that,
hence this series, and 5a09e5234ef4e1ee673bc7fca481defbbb2c0384 in particular,
fix the issue.
Change 028b2a8cdfdc12721b2be23d175cbc756d2507de exposes the issue
by generating a richer set of random range_tombstone that include open-ended
range tombstones.
Fixes#10316
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10317
* github.com:scylladb/scylla:
test: random_mutation_generator: make more interesting range tombstones
reader: upgrading_consumer: let range_tombstone_change_generator emit last closing change
range_tombstone_change_generator: flush: emit end_position when upper limit is after all clustered rows
range_tombstone_change_generator: flush: use tri_compare rather than less
range_tombstone_change_generator: flush: return early if empty
Copying captured variables into local variables (that live in a
coroutine's frame) is a mitigation of suspected lifetime issues.
Arguments of forward_service::dispatch are also copied (to prevent
potential undefined behavior or miss-compilation triggered by
referencing the arguments in a capture list of a lambda that produces a
coroutine).
Add missing include of "<list>" which caused compile errors on GCC:
In file included from generic_server.cc:9:
generic_server.hh:91:10: error: ‘list’ in namespace ‘std’ does not name a template type
91 | std::list<gentle_iterator> _gentle_iterators;
| ^~~~
generic_server.hh:19:1: note: ‘std::list’ is defined in header ‘<list>’; did you forget to ‘#include <list>’?
18 | #include <seastar/net/tls.hh>
+++ |+#include <list>
19 |
Note that there are some GCC compilation problems still left apart from
this one.
Closes#10328
Tests for warning and error lines in logfile when user executes
big batch (above preconfigured thresholds in scylla.yaml).
Signed-off-by: Lukasz Sojka <lukasz.sojka@scylladb.com>
Closes#10232
Previous versions of Docker image runs scylla as root, but cb19048
accidently modified it to scylla user.
To keep compatibility we need to revert this to root.
Fixes#10261Closes#10325
A node that runs DDL query while its cluster does not have a quorum
cannot be shutdown since the query is not abortable. The series makes it
abortable and also fixes the order in which components are shutdown to
avoid the deadlock.
* gleb/raft_shutdown_v4 of git@github.com:scylladb/scylla-dev.git:
migration_manager: drain migration manager before stopping protocol servers on shutdown
migration_manager: pass abort source to raft primitives
storage_proxy: relax some read error reporting
When flushing range tombstones up to
position_in_partition::after_all_clustered_rows(),
the range_tombstone_change_generator now emits
the closing range_tombstone_change, so there's
no need for the upgrading_consumer to do so too.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When the highest tombstone is open ended, we must
emit a closing range_tombstone_change at
position_in_partition::after_all_clustered_rows().
Since all consumers need to do it, implement the logic
int the range_tombstone_change_generator itself.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
less is already using tri_compare internally,
and we'll use tri_compare for equality in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We changed supervisor service name at cb19048, but this breaks
compatibility with scylla-operator.
To fix the issue we need to revert the service name to previous one.
Fixes#10269Closes#10323
Related issue scylladb/sphinx-scylladb-theme#395
ScyllaDB Sphinx Theme 1.2 is now released partying_face
We’ve added automatic checks for broken links and introduced numerous UI updates.
You can read more about all notable changes here.
Closes#10313
When a Raft API call such as `add_entry`, `set_configuration` or
`modify_config` takes too long, we need to time-out. There was no way to
abort these calls previously so we would do that by discarding the futures.
Recently the APIs were extended with `abort_source` parameters. Use this.
Also improve debuggability if the functions throw an exception type that
we don't expect. Previously if they did, a cryptic assert would fail
somewhere deep in the generator code, making the problem hard to debug.
Also collect some statistics in the test about the number of successful
and failed ops. I used it to manually check whether there was a
difference in how often operations fail with using the out timeout
method and the new timeout method (there doesn't seem to be any).
* kbr/nemesis-abort-source:
test: raft: randomized_nemesis_test: on timeout, abort calls instead of discarding them
raft: server: translate semaphore_aborted to request_aborted
test: raft: logical_timer: add abortable version of `sleep_until`
test: raft: randomized_nemesis_test: collect statistics on successful and failed ops
Only users are internal and tests.
Tests: unit(dev)
* replica-table-remove-make-reader-v1/v2 of github.com/denesb/scylla.git
replica/table: remove v1 reader factory methods
tests: move away from table::make_reader()
replica/table: add short make_reader_v2() variant:
This is the last place that still uses gossip_snitch_info(). It
can be reworked to use the get_app_states(), then the former
helper can be removed.
Another motivation for this is to stop using the _gossiper_started
boolean from the base class. This, in turn, will allow to remove
the whole gossiper_starting() notification altogether.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In its .start() helper the property-file driver does everything but
registers the reconnectable helper (like the ec2 m.r. one from the
previous patch did). Similarly to ec2 m.r. snitch this one can also
register its helper in .start(), before gossiper_starting() is called.
One thing to care about in this driver is that some tests start this
snitch without starting gossiper, thus an extra protection against
not initialized gossiper is needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This driver registers reconnectable helper in it gossiper_starting()
callback. It can be done earlier -- in the snitch .start() one, as
gossiper doesn't notify listeners until its started for real (event
its shardow round doesn't kick them).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Nowadays snitch states are put into gossiper via .gossiper_starting()
call by gossiper. This, in turn, happens in two places -- on node
ring join code and on re-enabling gossiper via the API call.
The former can be performed by the ring joining code with the help of
recently introduced snitch.get_app_states() helper.
The latter call is in fact not needed. Re-gossiped are DC, RACK and
for some drivers the INTERNAL_IP states that don't change throughout
snitch lifetime and are preserved in the gossiper pre-loaded states.
Thus, once the snitch states are applied by storage service ring join
code, the respective states udpate can be removed from the snitch
gossiper_starting() implementations.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This virtual method returns back the list of app states that snitch
drivers need to gossip around. The exact implementation copies the
gossip_snitch_info() logic of the respective drivers and is unused.
Next patches will make use of it (spoiler: the latter method will be
removed after that).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The driver in question wants to execute some of its actions on shard 0
and it calls smp::invoke(0, ...) for this. The invoked lambda thus needs
to refer to global snitch instance.
There's nicer and shorter way of re-sharding for snith drivers -- the
sharded<snith_ptr>* _my_distributed field on the base class.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
No functional changes, just have the local snitch reference in
the ring joining code. This simplifies next patching.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In test_tracing.py and util.py, we already have three duplicates of code
which looks for the Scylla REST API. We'll soon want to add even more uses
of this REST API, so it's good time to add a single fixture, "rest_api",
which can be use in all tests that need the Scylla REST API instead of
duplicating the same code.
A test using the "rest_api" fixture will be skipped if the server isn't
Scylla, or its port 10000 is not available or not responsive.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220331195337.64352-1-nyh@scylladb.com>
Currently the main help is a big wall of text. This makes it hard to
quickly jump to the section of interest. This patch reorganizes it into
clear sections, each with a title. Sections are now also ordered
according to the part they reference in the command-line.
This should make it easier for answers to questions regarding a certain
topic to be quickly found, without having to read a lot of text.
There are at least 1 actual and 1 potential users for it; this
change converts the existing one.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
"
First migrate all users to the v2 variant, all of which are tests.
However, to be able to properly migrate all tests off it, a v2 variant
of the restricted reader is also needed. All restricted reader users are
then migrated to the freshly introduced v2 variant and the v1 variant is
removed.
Users include:
* replica::table::make_reader_v2()
* streaming_virtual_table::as_mutation_source()
* sstables::make_reader()
* tests
This allows us to get rid of a bunch of conversions on the query path,
which was mostly v2 already.
With a few tests we did kick the can down the road by wrapping the v2
reader in `downgrade_to_v1()`, but this series is long enough already.
Tests: unit(dev), unit(boost/flat_mutation_reader_test:debug)
"
* 'remove-reader-from-mutations-v1/v3' of https://github.com/denesb/scylla:
readers: remove now unused v1 reader from mutations
test: move away from v1 reader from mutations
test/boost/mutation_reader_test: use fragment_scatterer
test/boost/mutation_fragment_test: extract fragment_scatterer into a separate hh
test/boost: mutation_fragment_test: refactor fragment_scatterer
readers: remove now unused v1 reversing reader
test/boost/flat_mutation_reader_test: convert to v2
frozen_mutation: fragment_and_freeze(): convert to v2
frozen_mutation: coroutinize fragment_and_freeze()
readers: migrate away from v1 reversing reader
db/virtual_table: use v2 variant of reversing and forwardable readers
replica/table: use v2 variant of reversing reader
sstables/sstable: remove unused make_crawling_reader_v1()
sstables/sstable: remove make_reader_v1()
readers: add v2 variant of reversing reader
readers/reversing: remove FIXME
readers: reader from mutations: use mutation's own schema when slicing
This patch adds importing the `malloc` and `free` method from the wasm client, and using them for allocating wasm memory for UDF arguments and freeing its result. When the methods are not exported, the old behaviour is used instead. To make that possible, this patch also includes a fix to the usage of pages in wasm memory (methods `size` and `grow`) that were used for allocating memory for arguments until now. (The source codes for the examples didn't work on my machine in their original form, so when updating paging I've also added small unrelated modifications)
Tests:unit(dev)
Closes#10234
* github.com:scylladb/scylla:
wasm: add wasm ABI version 2
wasm: add WASI handling
wasm: add documentation
wasm: add _scylla_abi export for specifying abi for wasm udfs
wasm: update ABI for passing parameters to wasm UDFs
wasm: move common code to a separate function
wasm: use wasm pages for wasm memory
As the name suggests, for UDFs defined as RETURNS NULL ON NULL
INPUT, we sometimes want to return nulls. However, currently
we do not return nulls. Instead, we fail on the null check in
init_arg_visitor. Fix by adding null handling before passing
arguments, same as in lua.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Closes#10298
When a query contains IN restriction on its partition key,
it's currently not eligible for indexing. It was however
erroneously qualified as such, which lead to fetching incorrect
results. This commit fixes the issue by not allowing such queries
to undergo indexing, and comes with a regression test.
Fixes#10300Closes#10302
We have a test for the LIKE restriction with ALLOW FILTERING.
Cassandra does not yet support this combination (it only supports LIKE
with SASI indexes), so this test fails on Cassandra, suggesting either
the test is wrong, or Cassandra is wrong. In this case, Cassandra is
wrong - they have an issue requesting this to be fixed -
https://issues.apache.org/jira/browse/CASSANDRA-17198, and even an
implementation which is being reviewed.
So let's mark this test with "cassandra_bug", meaning it is expected
to fail (xfail) when running against Cassandra. When CASSANDRA-17198
is fixed, we can remove the cassandra_bug mark.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220330211734.4103691-1-nyh@scylladb.com>
Instead of taking an output parameter in the constructor, take just the
desired number of mutations to build and return the mutation list from
`consume_end_of_stream()`.
When protocol servers are stopping they wait for all active queries to
complete, but DDL queries use migration manager internally, so if they
hang there protocol servers will not be able to stop since migration
manager is drained afterwords. The patch moves the migration manager
draining before protocol servers stoppage.
Since after the patch migration managers is drained before messaging
service is stopped we need to make sure that no rpc request triggers new
migration manager requests. We do it by making sure that any attempt to
issue such a request after aborted will return abort_requested_exception.
We want to be able to abort raft operations on migration manager drain.
MM already has an abort source that is signaled on drain, so all that is
left is to pass it to raft calls.
Silence request_aborted read error since it is expected to happen suring
shutdown and report remote rpc errors as warnings instead of errors since
if they are indeed server they should be handled by the rpc client, but
OTOH some non critical errors do expect to happen during shutdown.
The only internal user is the v1 make reader from mutations, we use a
downgrade/upgrade to be able to use the v2 reversing reader there. This
is ugly but the v1 reader from mutations is going away soon too, so not
a real problem.
No external users, only used internally, by make_reader(), who delegates
cases currently unsupported by v2 to it. The code needed from
make_reader_v1() is inlined into make_reader() and the former is
removed.
The v2 format allows for a much simpler reversing mechanism since
clustering fragments can simply be reversed as they are read. Fragments
are directly pushed in the reader's buffer eliminating a separate move
phase.
Existing reverse reader unit tests are converted to test the v2 one.
Instead of the schema that is used for the reader. The schema of
individual mutations might be different (albeit compatible) and in debug
mode this can trigger an assert in mutation partition.
A user pointed out a misleading error message produced when
an indexed column is queried along with an IN relation
on the partition key. The message suggests that such queries are
not supported, but they are supported - just without indexing.
In particular, with ALLOW FILTERING, such queries are perfectly
fine.
Closes#10299
The error message incorrectly stated that the timeout value cannot
be longer than 24h, but it can - the actual restriction is that the
value cannot be expressed in units like days or months, which was done
in order to significantly simplify the parsing routines (and the fact
that timeouts counted in days are not expected to be common).
Fixes#10286Closes#10294
If exception is caught while updating backlog tracker, the backlog
tracker will be disabled for the underlying table, potentially
causing compaction to fall behind.
That being said, let's raise the log level to error, to give it
its due importance and allow tests to detect the problem.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220330151421.49054-1-raphaelsc@scylladb.com>
Because the only available version of wasm ABI did not allow
freeing any allocated memory, a new version of the ABI is
introduced. In this version, the host is required to export
_scylla_malloc and _scylla_free methods, which are later used
for the memory management.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
One of the issues that comes with compiling programs to WebAssembly
is the lack of a default implementation of a memory allocator. As
a result, the only available solutions to the need of memory allocation
are growing the wasm memory for each new allocated segment, or
implementing one's own memory allocator. To avoid both of these
approaches, for many languages, the user may compile a program to
a WASI target. By doing so, the compiler adds default implementations
of malloc and free methods, and the user can use them for dynamic
memory management.
This patch enables executing programs compiled with WASI by enabling
it in the wasmtime runtime.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
The ABI of wasm UDFs changed since the last time the documentation
was written, so it's being update in this patch.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
The method checks that bootstrap state is equal to
`NEEDS_BOOTSTRAP`. This will be used later to check
if we are in the state of "fresh" start (i.e. starting
a node from scratch).
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The `bootstrap_complete()`, `bootstrap_in_progress()`,
`was_decommissioned()` and `get_bootstrap_state()` don't
modify internal state, so eligible to be marked as `const`.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Different languages may require different ABIs for passing
parameters, etc. This patch adds a requirement for all wasm
UDFs to export an _scylla_abi symbol, that is an 32-bit integer
with a value specifying the ABI version.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
WebAssembly uses 32-bit address space, while also
having 64-bit integers as it native types. As a result,
when passing size of an object in memory and its address,
it can be combined into one 64-bit value. As a bonus,
if the object is null, we can signal it by passing -1 as
its size.
This patch implements handling of this new ABI and adjusts
expamples in test_wasm.py.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Both init_nullable_arg_visitor and, in case
of abstract_type, init_arg_visitor were
the same method with one difference. The
common part was moved to init_abstract_arg,
and the difference remained in the operator()
method.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
The memory.grow and memory.size wasm methods return
the memory size in pages, and memory.size takes its
argument in the number of pages. A WebAssembly page
has a size of 64KiB, so during memory allocation
we have to divide our desired size in bytes by page
size and round up. Similarly, when reading memory
size we need to multiply the result by 64KiB to
get the size in bytes.
The change affects current naive allocator for
arguments when calling wasm UDFs and the examples
in wasm_test.py - both commented code and compiled
wasm in text representation.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
"
Following up on the recent split of flat_mutation_reader.hh and friends,
this series applies the same treatment to mutation_reader.hh. Each
readers gets its own header, while definitions are moved into
readers/mutation_readers.cc. There are two exceptions to this: the
combined and multishard reader families each make up more than 1K SLOC,
so these get their own source file, to avoid a SLOC explosion in
mutation_readers.cc.
This series is almost completely mechanical, moving code and patching
inclusion sites.
Tests: unit(dev)
"
* 'mutation-reader-hh-split/v1' of https://github.com/denesb/scylla:
readers: merge fmr_logger and mrlog
tree: remove now empty mutation_reader.{hh,cc}
tree: remove mutation_reader.hh include
mutation_reader: move mrlog (mutation reader logger) to readers/
mutation_reader: move compacting reader into readers/
mutation_reader: move queue reader to readers/
mutation_reader: move mutation source into readers/
mutation_reader: move slicing filtering reader into readers/
mutation_reader: move filtering reader into readers/
readers: move multishard reader & friends to reader/multishard.cc
mutation_reader: remove unused remote_fill_buffer_result
readers: move combined reader into readers/
By folding the former to the latter. Now that all the readers are nicely
co-located in the same folder, no point in having two distinct logger
for them.
In most files it was unused. We should move these to the patch which
moved out the last interesting reader from mutation_reader.hh (and added
the corresponding new header include) but its probably not worth the
effort.
Some other files still relied on mutation_reader.hh to provide reader
concurrency semaphore and some other misc reader related definitions.
"
Quoting patch 3/4:
"This continues the work in a69d98c3d0,
by implementing the cleanup method in TWCS to make it bucket aware.
Till now, the default impl was used which cleanups on file at a
time, starting from the smallest.
The cleanup strategy for TWCS is simple. It's simply calling the
size tiered cleanup method for each bucket, so there will be
one job for each tier in each window.
The next strategies to receive this improvement are LCS and ICS
(the latter one being only available in enterprise).
Refs #10097."
** Simply put, the goal is to reduce writeamp when performing cleanup
on a TWCS table, therefore reducing the operation time. **
tests: unit(dev).
"
* 'twcs_cleanup_bucket_aware/v1' of https://github.com/raphaelsc/scylla:
tests: sstable_compaction_test: Add test for TWCS' bucket-aware cleanup
compaction: TWCS: Implement cleanup method for bucket awareness
compaction: TWCS: change get_buckets() signature to work with const qualified functions
compaction_strategy: get_cleanup_compaction_jobs: accept candidates by value
With v2 having individual bounds of range tombstone as separate
fragments, out-of-order fragments become more difficult to handle,
especially in the presence of active range tombstone.
Scrub in both SKIP and SEGREGATE mode closes the partition on
seeing the first invalid fragment (SEGREAGE re-opens it immediately).
If there is an active range tombstone, scrub now also has to take care
of closing said tombstone when closing the partition. In a normal stream
it could just use the last position-in-partition to create a closing
bound. But when out-of-order fragments are on the table this is not
possible: the closing bound may be found later in the stream, with a
position smaller than that of the current position-in-partition.
To prevent extending range tombstone changes like that, Scrub now aborts
the compaction on the first invalid fragment seen *inside* an active
range tombstone.
Fixing a v2 stream with range tombstone changes is definitely possible,
but non-trivial, so we defer it until there is demand for it.
This series also makes the mutation fragment stream validator check for
open range tombstones on partition-end and adds a comprehensive
test-suite for the validator.
Fixes: #10168
Tests: unit(dev)
* scrub-rtc-handling-fix/v2 of github.com/denesb/scylla.git:
compaction/compaction: abort scrub when attempting to rectify stream with active tombstone
test/boost/mutation_test: add test for mutation_fragment_stream_validator
mutation_fragment_stream_validator: validate range tombstone changes
When a Raft API call such as `add_entry`, `set_configuration` or
`modify_config` takes too long, we need to time-out. There was no way to
abort these calls previously so we would do that by discarding the futures.
Recently the APIs were extended with `abort_source` parameters. Use this.
Also improve debuggability if the functions throw an exception type that
we don't expect. Previously if they did, a cryptic assert would fail
somewhere deep in the generator code, making the problem hard to debug.
This continues the work in a69d98c3d0,
by implementing the cleanup method in TWCS to make it bucket aware.
Till now, the default impl was used which cleanups on file at a
time, starting from the smallest.
The cleanup strategy for TWCS is simple. It's simply calling the
size tiered cleanup method for each bucket, so there will be
one job for each tier in each window.
The next strategies to receive this improvement are LCS and ICS
(the latter one being only available in enterprise).
Refs #10097.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Then caller can decide whether to copy or move candidate set into the
function. cleanup_sstables_compaction_task can move candidates as
it's no longer needed once it retrieves all descriptors.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently, we just passes entire output of perftune.py when getting CPU
mask from the script, but it may cause parse error since the script may
also print warning message.
To avoid that, we need to extract CPU mask from the output.
Fixes#10082Closes#10107
This patch adds a reproducer for the JSON encoding in issue #9061.
The bug was already fixed (it was a Seastar bug, and Seastar was
updated in commit 5d4213e1b8), but
I verified that the test fails before that patch - and passes today.
It is useful to have such a test for regressions, as well as for
testing backports.
Unfortunately, the test isn't pretty. The test uses the toppartitions
API, which instead of having a "start" and "stop" request has a single
synchronous "start for a given duration" request, and we need to run
it with some fixed duration (we took 1 second), and in parallel, one
request.
Refs #9061.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220323180855.3307931-1-nyh@scylladb.com>
The way our boot-time service "controllers" are written, if a
controller's start_server() finds an error and throws, it cannot
the caller (main.cc) to call stop_server(), and must clean up
resources already created (e.g., sharded services) before returning
or risk crashes on assertion failures.
This patch fixes such a mistake in Alternator's initialization.
As noted in issue #10025, if the Alternator TLS configuration is
broken - especially the certificate or key files are missing -
Scylla would crash on an assertion failure, instead of reporting
the error as expected. Before this patch such a misconfiguration
will result in the unintelligible:
<alternator::server>::~sharded() [Service = alternator::server]:
Assertion `_instances.empty()' failed. Aborting on shard 0.
After this patch we get the right error message:
ERROR 2022-03-21 15:25:07,553 [shard 0] init - Startup failed:
std::_Nested_exception<std::runtime_error> (Failed to set up Alternator
TLS credentials): std::_Nested_exception<std::runtime_error> (Could not
read certificate file conf/scylla.crt): std::filesystem::__cxx11::
filesystem_error (error system:2, filesystem error: open failed:
No such file or directory [conf/scylla.crt])
Arguably this error message is a bit ugly, so I opened
https://github.com/scylladb/seastar/issues/1029, but at least it says
exactly what the error is.
Fixes#10025
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220321133323.3150939-1-nyh@scylladb.com>
In commit 964500e47a, in the middle of
a larger series, I fixed a small Alternator bug that I found while working
on that series. The bug was that the ReturnValues=ALL_NEW feature moved out
the read previous_item, which breaks operations that need previous_item,
e.g., an ADD operation. Unfortunately, we never had a regression test for
this fix bug, so in this patch I add one.
This bug was re-discovered on an old branch by a user, at which point
I noticed that we don't have a test for it - so I want to add it now,
even though the bug itself is long gone from Scylla master.
I verified that the new test indeed fails on old versions of Scylla
before the aforementioned commit, and passes when backporting only that
commit.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220327074928.3608576-1-nyh@scylladb.com>
"
There's a static global sharded<local_cache> variable in system keyspace
the keeps several bits on board that other subsystems need to get from
the system keyspace, but what to have it in future<>-less manner.
Some time ago the system_keyspace became a classical sharded<> service
that references the qctx and the local cache. This set removes the global
cache variable and makes its instances be unique_ptr's sitting on the
system keyspace instances.
The biggest obstacle on this route is the local_host_id that was cached,
but at some point was copied onto db::config to simplify getting the value
from sstables manager (there's no system keyspace at hand there at all).
So the first thing this set does is removes the cached host_id and makes
all the users get it from the db::config.
(There's a BUG with config copy of host id -- replace node doesn't
update it. This set also fixes this place)
De-globalizing the cache is the prerequisite for untangling the snitch-
-messaging-gossiper-system_keyspace knot. Currently cache is initialized
too late -- when main calls system_keyspace.start() on all shards -- but
before this time messaging should already have access to it to store
its preferred IP mappings.
tests: unit(dev), dtest.simple_boot_shutdown(dev)
"
* 'br-trade-local-hostid-for-global-cache' of https://github.com/xemul/scylla:
system_keyspace: Make set_local_host_id non-static
system_keyspace: Make load_local_host_id non-static
system_keyspace: Remove global cache instance
system_keyspace: Make it peering service
system_keyspace,snitch: Make load_dc_rack_info non-static
system_keyspace,cdc,storage_service: Make bootstrap manipulations non-static
system_keyspace: Coroutinize set_bootstrap_state
gossiper: Add system keyspace dependency
cdc_generation_service: Add system keyspace dependency
system_keyspace: Remove local host id from local cache
storage_service: Update config.host_id on replace
storage_service: Indentation fix after previous patch
storage_service: Coroutinize prepare_replacement_info()
system_distributed_keyspace: Indentation fix after previous patch
code,system_keyspace: Relax system_keyspace::load_local_host_id() usage
code,system_keyspace: Remove system_keyspace::get_local_host_id()
"
By way of having an implementation of `data_dictionary` and using that.
The schema loader only needs a database to parse cql3 statements, which
are all coordinator-side objects and hence been largely migrated to use
data dictionary instead.
A few hard-dependencies on replica:: objects were found and resolved:
* index::secondary_index_manager
* tombstone_gc
The former was migrated to use `data_dictionary::table` instead of
`replica::table`. This in turn requires disentangling
`replica::data_dictionary_impl` from `replica::database`, as currently
the former can only really be used by the latter.
What all of this achieves us is that we no longer have to instantiate a
`replica::database` object in `tools::load_schema()`. We want to use the
standard allocator in tools, which means they cannot use LSA memory at
all. Database on the other hand creates memtable and row-cache instances
so it had to go.
Refs: #9882
Tests: unit(dev, schema_loader_test:debug,
cql-pytest/test_tools.py:debug)
"
* 'tools-schema-loader-database-impl/v2' of https://github.com/denesb/scylla:
tools/schema_loader: use own data dictionary impl
tombstone_gc: switch to using data dictionary
index/secondary_index_manager: switch to using data dictionary
replica/table: add as_data_dictionary()
replica: disentangle data_dictionary_impl from database
replica: move data_dictionary_impl into own header
In the DynamoDB API, error responses are in JSON format with specific
fields ("__type" and "message" in the x-amz-json-1.0 format currently
used). Alternator tried to be clever and build the string representation
of this JSON itself, instead of using RapidJSON. But this optimization
was a mistake - if the error message contains characters that need
escaping (such as double quotes and newlines), they weren't escaped,
and the resulting JSON was malformed. When the client library boto3
read this malformed JSON it got confused, cosidered the entire error
response to be a string, which resulted in an ugly error message.
The fix is easy - just build the JSON output as usual with RapidJSON
instead of trying to optimize using string operation.
The patch also includes two tests reproducing this bug and checking its
fix. The first test uses boto3 and shows it got confused on the type
of error (not understanding that it is a ValidationException). The
second test bypasses boto3 and shows exactly where the bug happens -
the response is an unparsable JSON.
Fixes#10278
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220327132705.3707979-1-nyh@scylladb.com>
Since our Docker image moved to Ubuntu, we mistakenly copy
dist/docker/etc/sysconfig/scylla-server to /etc/sysconfig, which is not
used in Ubuntu (it should be /etc/default).
So /etc/default/scylla-server is just default configuration of
scylla-server .deb package, --log-to-stdout is 0, same as normal installation.
We don't want keep the duplicated configuration file anyway,
so let's drop dist/docker/etc/sysconfig/scylla-server and configure
/etc/default/scylla-server in build_docker.sh.
Fixes#10270Closes#10280
This reverts commit 37dc31c429. There is no
reason to suppose compacting different tables concurently on different shards
reduces space requirements, apart from non-deterministically pausing
random shards.
However, when data is badly distributed and there are many tables, it will
slow down major compaction considerably. Consider a case where there are
100 tables, each with a 2GB large partition on some shard. This extra
200GB will be compacted on just one shard. With compation rate of 40 MB/s,
this adds more than an hour to the process. With the existing code, these
compactions would overlap if the badly distributed data was not all in one
shard.
It is also counter to tablets, where data is not equally ditributed on
purpose.
Closes#10246
"
Cleanup compaction works by rewriting all sstables that need clean up, one at
a time.
This approach can cause bad write amplification because the output data is
being made incrementally available for regular compaction.
Cleanup is a long operation on large data sets, and while it's happening,
new data can be written to buckets, triggering regular compaction.
Cleanup fighting for resources with regular compaction is a known problem.
With cleanup adding one file at a time to buckets, regular may require multiple
rounds to compact the data in a given bucket B, producing bad writeamp.
To fix this problem, cleanup will be made bucket aware. As each compaction
strategy has its own definition of bucket, strategies will implement their
own method to retrieve cleanup jobs. The method will be implemented such that
all files in a bucket B will be cleaned up together, and on completion,
they'll be made available for regular at once.
For STCS / ICS, a bucket is a size tier.
For TWCS, a bucket is a window.
For LCS, a bucket is a level.
In this way, writeamp problem is fixed as regular won't have to perform
multiple rounds to compact the data in a given bucket. Additionally, cleanup
will now be able to deduplicate data and will become way more efficient at
garbage collecting expired data.
The space requirement shouldn't be an issue, as compacting an entire bucket
happens during regular compaction anyway.
With leveled strategy, compacting an entire level is also not a problem because
files in a level L don't overlap and therefore incremental compaction is
employed to limit the space requirement.
By the time being, only STCS cleanup was made bucket aware. The others will be
using a default method, where one file is cleaned up at a time. Making cleanup
of other strategies bucket aware is relatively easy now and will be done soon.
Refs #10097.
"
* 'cleanup-compaction-revamp/v3' of https://github.com/raphaelsc/scylla:
test: sstable_compaction_test: Add test for strategy cleanup method
compaction: STCS: Implement cleanup strategy
compaction_manager: Wire cleanup task into the strategy cleanup method
compaction_strategy: Allow strategies to define their own cleanup strategy
compaction: Introduce compaction_descriptor::sstables_size
compaction: Move decision of garbage collection from strategy to task type
This implements cleanup strategy for STCS. It will return one descriptor
for each size tier. If a given tier has more than max_threshold
elements, more than 1 job will be returned for that tier. Token
contiguity is preserved by sorting elements of a tier by token.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
As the cleanup process can now be driven by the compaction strategy,
let's move cleanup into a new task type that uses the new
compaction_strategy::get_cleanup_compaction_jobs().
By the time being all strategies are using the default method that
returns one descriptor for each sstable that needs clean up.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
And pass it to the cql3 layer when parsing statements. This allows the
schema loader to cut itself from replica::database, using a local, much
simpler database implementation. This not only makes the code much
simpler but also opens up the way to using the standard allocator in
tools. The real database uses LSA which is incompatible with the
standard allocator (in release builds that is).
The callers are system_keyspace.load_local_host_id and storage service.
The former is non-static since previous patch, the latter has its own
sys.ks. reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
No users of this variable left, all the code relies on system_keyspace
"this" to get it. Respectively, the cache can be a unique_ptr<> on the
system_keyspace instance and the global sharded variable can be removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
And remove a bunch of (_local)?_cache.invoke_on_all() calls. This
is the preparation for removing the global cache instance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's snitch code that needs it. It now takes messaging service
from gossiper, so it can do the same with system keyspace. This
change removes one user of the global sys.ks. cache instance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The users of get_/set_bootstrap_sate and aux helpers are CDC and
storage service. Both have local system_keyspace references and can
just use them. This removes some users of global system ks. cache
and the qctx thing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The gossiper reads peer features from system keyspace. Also the snitch
code needs system keyspace, and since now it gets all its dependencies
from gossiper (will be fixed some day, but not now), it will do the same
for sys.ks.. Thus it's worth having gossiper->system_keyspace explicit
dependency.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
But only on the surface, the only internal function needing the database
(`needs_repair_before_gc()`) still gets a real database because the
replication factor cannot be obtained from the data dictionary
currently. Although this might not look like an improvement, it is
enough to avoid a `real_database()` call for tables that don't have
tombstone gc mode set to repair.
The service uses system keyspace to, e.g., manage the generation id,
thus it depends on the system_keyspace instance and deserves the
explicit reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The config.host_id value is loaded early on start, but when the
storage service prepares to join the cluster to replace a node,
it will change that value (with the host id of the target). This
change only affect the system keyspace, but not the config copy
which is a BUG.
fixes: #10243
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method is nowadays called from several places:
- API
- sys.dist.ks. (to udpate view building info)
- storage service prepare_to_join()
- set up in main
They all, but the last, can use db::config cached value, because
it's loaded earlier than any of them (but the last -- that's the
loading part itself).
Once patched, the load_local_host_id() can avoid checking the cache
for that value -- it will not be there for sure.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The host id is cached on db::config object that's available in
all the places that need it. This allows removing the method in
question from the system_keyspace and not caring that anyone that
needs host_id would have to depend on system_keyspace instance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it a standalone class, instead of private subclass of database.
Unfriend database and instead make wrap/unwrap methods public, so anyone
can use them.
The test runners cql-pytest/run et al. try to automatically find the
last-compile Scylla executable, but this decision can be overriden by
the SCYLLA environment variable. If the user sets by mistake SCYLLA to
something which is not a valid path of an executable, the result was a
long and obscure Python stack trace.
So after this patch, if SCYLLA points to something which is not an
executable, a clear error is produced immediately, directing the user
to set it this variable to a correct executable
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220323164427.3301828-1-nyh@scylladb.com>
Today, all compaction strategies will clean up their files using the
incremental approach of one sstable being rewritten at a time.
Turns out that's not the best approach performance wise. Let's take
STCS for example. As cleanup finishes rewriting one file, the output
file is placed into the sstable set. Regular now can compact that
file with another that was already there (e.g. produced by flush after
cleanup started). Inefficient compactions like this can keep happening
as cleanup incrementally places output file into the candidate list
for regular.
This method will allow strategies to clean up their files in batches.
For example, STCS can clean up all files in smallest tiers in single
round, allowing the output data to be added at once. So next compaction
rounds can be more efficient in terms of writeamp. Another benefit is
that deduplication and GC can happen more efficiently.
The drawback is the space requirement, as we no longer compact one file
a a time. However, the impact is minimized by cleaning up the smallest
tier first. With leveled strategy for example, even though 90% of data
is in highest level, the space requirement is not a problem because
we can apply the incremental compaction on its behalf. The same applies
to ICS. With STCS, the requirement is the size of the tier being
compacted, but that's already expected by its users anyway.
By the time being, all strategies have it unimplemented. so they still
use the old behavior where files are rewritten on at a time.
This will allow us to incrementally implement the cleanup method for
all compaction strategies.
Refs #10097.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Based on perf_simple_query, just bashes data into CL using
normal distribution min/max data chunk size, allowing direct
freeing of segments, _but_ delayed by a normal dist as well,
to "simulate" secondary delay in data persistance.
Needs more stuff.
Some baseline measurements on master:
--min-flush-delay-in-ms 10 --max-flush-delay-in-ms 200
--commitlog-use-hard-size-limit true
--commitlog-total-space-in-mb 10000 --min-data-size 160 --max-data-size 1024
--smp1
median 2065648.59 tps ( 1.1 allocs/op, 0.0 tasks/op, 1482 insns/op)
median absolute deviation: 48752.44
maximum: 2161987.06
minimum: 1984267.90
--min-data-size 256 --max-data-size 16384
median 269385.25 tps ( 2.2 allocs/op, 0.7 tasks/op, 3244 insns/op)
median absolute deviation: 15719.13
maximum: 323574.43
minimum: 228206.28
--min-data-size 4096 --max-data-size 61440
median 67734.22 tps ( 6.4 allocs/op, 2.9 tasks/op, 9153 insns/op)
median absolute deviation: 2070.93
maximum: 82833.17
minimum: 61473.57
--min-data-size 61440 --max-data-size 1843200
median 2281.37 tps ( 79.7 allocs/op, 43.5 tasks/op, 202963 insns/op)
median absolute deviation: 128.87
maximum: 3143.84
minimum: 2140.80
--min-data-size 368640 --max-data-size 6144000
median 679.76 tps (225.5 allocs/op, 116.3 tasks/op, 662700 insns/op)
median absolute deviation: 39.30
maximum: 1148.95
minimum: 586.86
Actual throughput obviously meaningless, as it is run on my slow
machine, but IPS might be relevant.
Note that transaction throughput plummets as we increase median data
sizes above ~200k, since we then more or less always end up replacing
buffers in every call.
Closes#10230
There's a script to automate fetching submodule changes. However, this
script alays fetches remote master branch, which's not always the case.
For example, for branch-5.0/next-5.0 pair the correct scylla-seastar
branch would be the branch-5.0 one, not master.
With this change updating a submodule from a custom branch would be like
refresh-submodules.sh <submodule>:<branch>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220322093623.15748-1-xemul@scylladb.com>
This reverts commit 87df37792c.
Scylla branches are not mapped to seastar branches 1-1, so getting
the upstream scylla branch doesn't point to the correct seastar one.
"
The only real user is view building, which is converted to v2 and then
the v1 version of the mutation from fragments reader is removed.
Tests: unit(dev, release)
"
* 'v2-only-from-fragments-mutations/v1' of https://github.com/denesb/scylla:
readers: remove now unused v1 reader from fragments
test/boost: flat_mutation_reader_test: remove reader from fragments test
replica/table: migrate generate_and_propagate_view_updates() to v2
replica/table: migrate populate_views() to v2
db/view: convert view_update_builder interface to v2
db/view: migrate view_update_builder to v2
For compaction to be able to purge expired data, like tombstones, a
sstable set snapshot is set in the compaction descriptor.
That's a decision that belongs to task type. For example, all regular
compaction enable GC, whereas scrub for example doesn't for safety
reasons.
The problem is that the decision is being made by every instantiation
of compaction_descriptor in the strategies, which is both unnecessary
and also adds lots of boilerplate to the code, making it hard to
understand and work with.
As sstable set snapshot is an implementation detail, a new method
is being added to compaction_descriptor to make the intention
clearer, making the interface easier to understand.
can_purge_tombstones, used previously by rewrite task only, is being
reused for communicating GC intention into task::compact_sstables().
The boilerplate was a pain when adding a new strategy method for
the ongoing work on cleanup, described by issue #10097.
Another benefit is that we'll now only create a set snapshot when
compaction will really run. Before, it could happen that the snapshot
would be discarded if the compaction attempt had to be postponed,
which is a waste of cpu cycles.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
On aarch64 the `std::move(mf)` seems to be reordered w.r.t.
`flush_tombstones()` in certain circumstances. These circumstances
are not clear yet, but while further investigation happens, this patch
makes the tests pass on aarch64, unclogging the promotion pipeline.
Refs: #10248
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220321122209.71685-1-bdenes@scylladb.com>
In CQL, table names are limited to so-called word characters (letters,
numbers and underscores), but column names don't have such a limitation.
When we create a secondary index, its default name is constructed from
the column name - so can contain problematic characters. It can include
even the "/" character. The problem is that the index name is then used,
like a table name, to create a directory with that name.
The test included in this patch demonstrates that before this patch, this
can be misused to create subdirectories anywhere in the filesystem, or to
crash Scylla when it fails to create a directory (which it considers an
unrecoverable I/O error).
In this patch we do what Cassandra does - remove all non-word
characters from the indexed column name before constructing the default
index name. In the included test - which can run on both Scylla and
Cassandra - we verify that the constructed index name is the same as
in Cassandra, which is useful to know (e.g., because knowing the index
name is needed to DROP the index).
Also, this patch adds a second line of defense against the security problem
described above: It is now an error to create a schema with a slash or
null (the two characters not allowed in Unix filenames) in the keyspace
or table names. So if the first line of defense (CQL checking the validity
of its commands) fails, we'll have that second line of defense. I verified
that if I revert the default-index-name fix, the second line of defense
kicks in, and the index creation is aborted and cannot create files in
the wrong place to crash Scylla.
Fixes#3403
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220320162543.3091121-1-nyh@scylladb.com>
On e1b15ba, we introduce user-friendly error message when Exception
occured while generating perftune.yaml.
However, it becomes difficult to investigate bugs since we dropped
traceback.
To resolve this problem, let's print both traceback and user-friendly
messages.
Related #10050Closes#10140
Prior to the change, `USES_RAFT_CLUSTER_MANAGEMENT` feature wasn't
properly advertised upon enabling `SUPPORTS_RAFT_CLUSTER_MANAGEMENT`
raft feature.
This small series consists of 3 parts to fix the handling of supported
features for raft:
1. Move subscription for `SUPPORTS_RAFT_CLUSTER_MANAGEMENT` to the
`raft_group_registry`.
2. Update `system.local#supported_features` directly in the
`feature_service::support()` method.
3. Re-advertise gossiper state for `SUPPORTED_FEATURES` gossiper
value in the support callback within `raft_group_registry`.
* manmanson/track_supported_set_recalculation_v7:
raft: re-advertise gossiper features when raft feature support changes
raft: move tracking `SUPPORTS_RAFT_CLUSTER_MANAGEMENT` feature to raft
gms: feature_service: update `system.local#supported_features` when feature support changes
test: cql_test_env: enable features in a `seastar::thread`
Move the listener from feature service to the `raft_group_registry`.
Enable support for the `USES_RAFT_CLUSTER_MANAGEMENT`
feature when the former is enabled.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Also, change the signature of `support()` method to return
`future<>` since it's now a coroutine. Adjust existing call sites.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Each feature can have an associated `when_enabled` callback
registered, which is assumed to run in the thread context,
so wrap the `enable()` call in a seastar thread.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Flushing the base table triggers view building
and corresponding compactions on the view tables.
Temporarily disable compaction on both the base
table and all its view before flush and snapshot
since those flushed sstables are about to be truncated
anyway right after the snapshot is taken.
This should make truncate go faster.
In the process, this series also embeds `database::truncate_views`
into `truncate` and coroutinizes both
Refs #6309
Test: unit(dev)
Closes#10203
* github.com:scylladb/scylla:
replica/database: truncate: fixup indentation
replica/database: truncate: temporarily disable compaction on table and views before flush
replica/database: truncate: coroutinize per-view logic
replica/database: open-code truncate_view in truncate
replica/database: truncate: coroutinize run_with_compaction_disabled lambda
replica/database: coroutinize truncate
compaction_manager: add disable_compaction method
There's a script to automate fetching submodule changes. However, this
script alays fetches remote master branch, which's not always the case.
The correct branch can be detected by checking the current remote
tracking scylla branch which should coincide with the submodule one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220317085018.11529-1-xemul@scylladb.com>
"
The generating reader is a reader which converts a functor returning
mutation fragments to a mutation reader.
We currently have 2 generating reader implementations: one operating
with a v1 functor and one with a v2 one. This patch-set converts the v1
functor based one to a v2 reader, by adapting the v1 functor to a v2
functor and reusing the v2 reader implementation.
Tests are also added to both variants.
Tests: unit(dev)
"
* 'generating-reader-v2/v1' of https://github.com/denesb/scylla:
test/boost: mutation_reader_test: add tests for generating reader
test: export squash_mutations() into lib/mutation_source_test.hh
readers: add next partition adaptor
readers: implement generating_reader from v1 generator via adaptor
readers: upgrade_to_v2(): reimplement in terms of upgrading_consumer
readers: add upgrading_consumer
readers: generating_reader: use noncopyable_function<>
readers: merge generating.hh into generating_v2.hh
readers/generating.hh: return v2 reader from make_generating_reader()
"
Making the system-keyspace into a standard sharded instance will
help to fix several dependency knots.
First, the global qctx and local-cache both will be moved onto the
sys-ks, all their users will be patched to depend on system-keyspace.
Now it's not quite so, but we're moving towards this state.
Second, snitch instance now sits in the middle of another dependency
loop. To untie one the preferred ip and dc/rack info should be
moved onto system keyspace altogether (now it's scattered over several
places). The sys-ks thus needs to be a sharded service with some
state.
This set makes system-keyspace sharded instance, equipps it with all
the dependencies it needs and passes it as dependency into storage
service, migration manager and API. This helps eliminating a good
portion of global qctx/cache usage and prepares the ground for snitch
rework.
tests: unit(dev)
v1: unit(debug), dtest.simple_boot_shutdown(dev)
"
* 'br-sharded-system-keyspace-instance-2' of https://github.com/xemul/scylla: (25 commits)
system_keyspace: Make load_host_ids non-static
system_keyspace: Make load_tokens non-static
system_keyspace: Make remove_endpoint and update_tokens non-static
system_keyspace: Coroutinize update_tokens
system_keyspace: Coroutinize remove_endpoint
system_keyspace: Make update_cached_values non-static
system_keyspace: Coroutinuze update_peer_info
system_keyspace: Make update_schema_version non-static
schema_tables: Add sharded<system_keyspace> argument to update_schema_version_and_announce
replica: Push sharded<system_keyspace> down to parse_system_tables
api: Carry sharded<system_keyspace> reference along
storage_service: Keep sharded<system_keyspace> reference
migration_manager: Keep sharded<system_keyspace> reference
system_keyspace: Remove temporary qp variable
system_keyspace: Make get_preferred_ips non-static
system_keyspace: Make cache_truncation_record non-static
system_keyspace: Make check_health non-static
system_keyspace: Make build_bootstrap_info non-static
system_keyspace: Make build_dc_rack_info non-static
system_keyspace: Make setup_version non-static
...
This method used to be a static one in
boost/flat_mutation_reader_test.cc. Turns out it is useful for other
tests based on the mutation source test suite, so move it into the
header of the latter to make it accessible.
Adaptor converts the
`noncopyable_function<future<mutation_fragment_opt>>` to the v2
equivalent, so we can have a single generating reader implementation.
The adaptor uses the upgrading_consumer reusable upgrade component to
implement the actual upgrade.
Upgrading a v1 stream to a v2 one is a common task that currently
requires duplicating the upgrade logic in all components that wan to do
this. This patch extract the upgrade logic from `upgrade_to_v2()` into a
reusable component to promote code reuse.
std::function<> requires the functor it wraps to be copyable, which is
an unnecessarily strict requirement. To relax this, we use
noncopyable_function<> instead. Since the former seems to lack some
disambiguation magic of the latter, we add `_v1` and `_v2` postfixes to
manually disambiguate.
This patch adds an ability to pass abort_source to raft request APIs (
add_entry, modify_config) to make them abortable. A request issuer not
always want to wait for a request to complete. For instance because a
client disconnected or because it no longer interested in waiting
because of a timeout. After this patch it can now abort waiting for such
requests through an abort source. Note that aborting a request only
aborts the wait for it to complete, it does not mean that the request
will not be eventually executed.
Message-Id: <YjHivLfIB9Xj5F4g@scylladb.com>
In https://github.com/scylladb/scylla/issues/10218
we see off-strategy compaction happening on a table
during the initial phases of
`distributed_loader::populate_column_family`.
It is caused by triggering offtrategy compaction
too early, when sstables are populated from the staging
directory in a144d30162.
We need to trigger offstrategy compaction only of the base
table directory, never the staging or quarantine dirs.
Fixes#10218
Test: unit(dev)
DTest: materialized_views_test.py::TestInterruptBuildProcess
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220316152812.3344634-1-bhalevy@scylladb.com>
Depending on the bound weight of the position of the last fragment we
expect to read. Currently the range is unconditionally exclusive, which
might lead to an artificial difference between the read and expected
data, due to a fragment being possibly omitted.
Fixes#10229.
Tests: unit(boost/flat_mutation_reader_test:test_flat_mutation_reader_consume_single_partition)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220304133515.74586-1-bdenes@scylladb.com>
"
This patchset adds two new operations to scylla-sstable:
* validate-checksums - helps identifying whether an sstable is intact or
not, but checking the digest and the per-chunk checksums against the
data on disk.
* decompress - helps when one wants to manually examine the content of a
compressed sstable.
Refs: #497
Tests: unit(dev)
"
* 'scylla-sstable-validate-checksums-decompress/v3' of https://github.com/denesb/scylla:
tools/scylla-sstable: consume_sstables(): s/no_skips/use_crawling_reader/
tools/scylla-sstable: add decompress operation
tools/scylla-sstables: add validate-checksums operation
sstables/sstable: add validate_checksums()
sstables/sstable: add raw_stream option to data_stream()
sstables/sstable: make data_stream() and data_read() public
utils/exceptions: add maybe_rethrow_exception()
"
This mini-series contains a few trivial fixes to be able
to build scylla on Fedora 36 Pre-Release, which will
soon enter "Beta" state.
It's mostly fixes due to some changes to external dependencies,
e.g. boost.outcome and libfmt.
Tests: unit(dev)
"
* 'fc36_build_fixes_v1' of https://github.com/ManManson/scylla:
schema: fix build issues with libstdc++ 12
treewide: fix compilation issues with fmtlib 8.1.0+
utils/result.hh: add missing header includes for boost.outcome
The update_table() helper template too. And the update_peer_info as
well. It can stop using global qctx and cache after that
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's called from two places -- .setup() and schema_tables code. Both
have the instance hanging around, so the method can be de-marked
static and set free from global qctx
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All its (indirect) callers had been patched to have it, now it's
possible to have the argument in it. Next patch will make use of it
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method needs to call merge_schema() that will need system keyspace
instance at hand. The parse_s._t. method is boot-time one, pushing the
main-local instance through it is fine
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The main target here is system_keyspace::update_schema_version() which
is now static, but needs to have system_keyspace at "this". Migration
manager is one of the places that calls that method indirectly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This one is a bit more tricky that its four preceeders. The qctx's
qp().execute_cql() is replaced with qp().execute_internal() for
symmetry with the rest. Without data args it's the same.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Yet another same step. Drop static keyword and patch out globals.
Get config.cluster_name from _db while at it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Just remove static mark and stop using global qctx.
Grab config from _db instead of argument while at it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Before patching system_keyspace methods to use query processor from
its instance, the respective call is needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's called only on start and actively uses both qctx and local
cache. Next patches will fix the whole setup code to stop using
global qctx/cache.
For now setup invocation is left in its place, but it must really
happen in start() method. More patching is needed to make it work.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
For now it's a reference, but all users of the cache will be
eventually switched into using system_keyspace.
In cql-test-env cache starting happens earlier than it was
before, but that's OK, it just initializes empty instances.
In main cache starts at the same time as before patching.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Start happens at exactly the same place. One thing to take care
of is that it happens on all shards.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The db::system_keyspace was made a class some time ago, time to create
a standard sharded<> object out of it. It needs query processor and
database. None of those depensencies is started early enough, so the
object for now starts in two steps -- early instances creation and
late start.
The instances will carry qctx and local_cache on board and all the
services that need those two will depend on system-keyspace. Its start
happens at exactly the same place where system_keyspace::setup happens
thus any service that will use system_keyspace will be on the same
safe side as it is now.
In the further future the system_keyspace will be equpped with its
own query processor backed by local replica database instance, instead
of the whole storage proxy as it is now.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Switch from using `std::map::insert` to `std::map::emplace`
in the `get_sharder()` function, since we are constructing
a temporary value anyway.
Also, use `std::make_pair` instead of initializer list because
for some reason Clang 13 w/ libstdc++ 12 argues about not
being able to find a suitable overload.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Due to fd62fba985
scoped enums are not automatically converted to integers anymore,
this is the intended behavior, according to the fmtlib devs.
A bit nicer solution would be to use `std::to_underlying`
instead of a direct `static_cast`, but it's not available until
C++23 and some compilers are still missing the support for it.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Looks like internal boost.outcome headers don't include some
of needed dependencies, so do that manually in our headers.
For some reason it worked before, but started to fail when
building on Fedora 36 setup.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The atomicity was lost in commit a2a5e530f0.
Registration of compacting SSTables now happens in rewrite_sstables_compaction_task
ctor, but that's risky because a regular compaction could pick those
same files if run_with_compaction_disabled() defers after the callback
passed to it returns, and before run__w__c__d() caller has a chance to
run. The deferring point is very much possible, because submit()
(submits a regular job) is called when run__w__c__d() reenables compaction
internally.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220315182857.121479-1-raphaelsc@scylladb.com>
Compaction manager is calling back the table to run off-strategy compaction,
but the logic clearly belongs to manager which should perform the
operation independently and only call table to update its state with the
result.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220315174504.107926-2-raphaelsc@scylladb.com>
is_supported_by checks whether a given restriction
can be supported by some index.
Currently when a subscripted value, e.g `m[1]` is encountered,
we ignore the fact that there is a subscript and ask
whether an index can support the `m` itself.
This looks like unintentional behaviour leftover
from the times when column_value had a sub field,
which could be easily forgotten about.
Scylla doesn't support indexes on collection elements at all,
so simply returning false there seems like a good idea.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#10227
Sstables have two kind of checksums: per-chunk checksums and
full-checksum (digest) calculated over the entire content of Data.db.
The full-checksum (digest) is stored in Digest.crc
(component_type::Digest).
When compression is used, the per-chunk checksum is stored directly
inside Data.db, after each compressed chunk. These are validated on
read, when decompressing the respective chunks.
When no compression is used, the per-chunk checksum is stored separately
in CRC.db (component_type::CRC). Chunk size is defined and stored in said
component as well.
In both compressed and uncompressed sstables, checksums are calculated
on the data that is actually written to disk, so in case of compressed
data, on the compressed data.
This method validates both the full checksum and the per-chunk checksum
for the entire Data.db.
Permits have to wait for re-admission after having been evicted. This
happens via `reader_permit::maybe_wait_readmission()`. The user of this
method -- the evictable reader -- uses it to re-wait admission when the
underlying reader was evicted. There is one tricky scenario however,
when the underlying reader is created for the first time. When the
evictable reader is part of a multishard query stack, the created reader
might in fact be a resumed, saved one. These readers are kept in an
inactive state until actually resumed. The evictable reader shares it
permit with the to-be-resumed reader so it can check whether it has been
evicted while saved and needs to wait readmission before being resumed.
In this flow it is critical that there is no preemption point between
this check and actually resuming the reader, because if there is, the
reader might end up actually recreated, without having waited for
readmission first.
To help avoid this situation, the existing `maybe_wait_readmission()` is
split into two methods:
* `bool reader_permit::needs_readmission()`
* `future<> reader_permit::wait_for_readmission()`
The evictable reader can now ensure there is no preemption point between
`needs_readmission()` and resuming the reader.
Fixes: #10187
Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220315105851.170364-1-bdenes@scylladb.com>
"
Namely the query result writer and the reconcilable result builder, used
for building results for regular queries and mutation queries (used in
read repair) respectively.
With this, there are no users left for the v1 output of the compactor,
so we remove that, making the compactor v2 all-the-way (and simpler).
This means that for regular queries, a downgrade phase is eliminated
completely, as regular queries don't store range tombstone in their
result, so no need to convert them.
Tests: unit(dev, release, debug)
"
* 'result-builders-v2/v1' of https://github.com/denesb/scylla:
reconcilable_result_builder: remove v1 support
query_result_builder: remove v1 support
mutation_compactor: drop v1 related code-paths
mutation_compactor: drop v1 support altogether from the API
tree: migrate to the v2 consumer APIs
test/boost/mutation_test: remove v1 specific test code
querier: switch to v2 compactor output
reconcilable_result_builder: add v2 support
query_result_writer: add v2 support
query_result_builder: make consume(range_tombstone) noop
Flushing the base table triggers view building
and corresponding compactions on the view tables.
Temporarily disable compaction on both the base
table and all its view before flush and snapshot
since those flushed sstables are about to be truncated
anyway right after the snapshot is taken.
This should make truncate go faster.
Refs #6309
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
truncate-views is called only internally from database::truncate.
Next step will be to disable compactions on the base
table and view before flush and snapshot.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The `result_try` and `result_futurize_try` are supposed to handle both
failed results and exceptions in a way similar to a try..catch block.
In order to catch exceptions, the metaprogramming machinery invokes the
fallible code inside a stack of try..catch blocks, each one of them
handling one exception. This is done instead of creating a single
try..catch block, as to my knowledge it is not possible to create
a try..catch block with the number of "catch" clauses depending on a
variadic template parameter pack.
Unfortunately, a "try" with multiple "catches" is not functionally
equivalent to a "try block stack". Consider the following code:
try {
try {
return execute_try_block();
} catch (const derived_exception&) {
// 1
}
} catch (const base_exception&) {
// 2
}
If `execute_try_block` throws `derived_exception` and the (1) catch
handler rethrows this exception, it will also be handled in (2), which
is not the same behavior as if the try..catch stack was "flat".
This causes wrong behavior in `result_try` and `result_futurize_try`.
The following snippet has the same, wrong behavior as the previous one:
return utils::result_try([&] {
return execute_try_block();
}, utils::result_catch<derived_exception>([&] (const auto&& ex) {
// 1
}), utils::result_catch<base_exception>([&] (const auto&& ex) {
// 2
});
This commit fixes the problem by adding a boolean flag which is set just
before a catch handler is executed. If another catch handler is
accidentally matched due to exception rethrow, the catch handler is
skipped and exception is automatically rethrown.
Tests: unit(dev, debug)
Fixes: #10211Closes#10216
Returns a RAII class compaction_reenabler
that conditionally reenables compaction
for the given table when destroyed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In commit afab1a97c6, we added
test_tools.py - tests for the various tools embedded in the Scylla
executable. These tests need to know where the Scylla executable is,
and also where its sstables are stored. For this, the commit added two
test parameters - "--scylla-path" and "--workdir" - with which the
"run" script communicated this knowledge to the test.
However, that implementation meant that these tests only work if the
test was run via the test/cql-pytest/run script - they won't work if
the user ran Scylla/pytest manually, or through some other script not
passing these options.
This patch drops the "--scylla-path" and "--workdir" parameters, and
instead the test figures out this information on its own:
1. To find the Scylla executable, we begin by looking (using the
local_process_id(cql) function from the previous patch) for a
local process which listens to our CQL connection, and then find
the executable's path using /proc.
2. To find the Scylla data directory (which is what we really need, not
workdir which is just a shortcut to set all directories!), we
retrieve this configuration from the system.config table through CQL.
I tested that test_tools.py now works not only through test/cql-pytest/run
but also if I run Scylla manually and then run "pytest test_tools.py"
without any extra parameters.
Fixes#10209
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220314151125.2737815-2-nyh@scylladb.com>
Generally, cql-pytest tests do not, and *should not* rely on looking up
messages in the Scylla log file: Relying on such messages makes it
impossible to run the same test against Cassandra or even a remotely-
installed Scylla, and the tests tend to break when logging (which is not
considered part of our API) changes. Moreover, usually what our dtests
achieve by looking at the log - e.g., figuring out when some event has
happened - can be achieved through official CQL APIs, and this is what
normal users do anyway (users don't normally dig through the log to
figure out when their operation completed).
However, sometimes we do want to write a test to confirm that during a
certain operation, a certain log message gets written to Scylla's log.
A desire to do this was raised by @fruch and @soyacz, so in this patch
I provide a mechanism to do this, and a trivial example - which checks
that a "Creating ..." message appears on the log whenever a table is
created, and "Dropping ..." when the table is deleted.
As is explained in detail in patches in the comment, Scylla's log file
is found automatically, without relying on Scylla's runner (such as
the script test/cql-pytest/run) communicating to the test where the log
file is. If the log file can't be found - e.g., we're testing a remote
Scylla, or if this isn't Scylla, the tests are skipped.
I would like all logfile-testing tests to be in the same file,
test_logs.py. As I explained above, I think it is a mistake for general
tests to check the log file just because they can. I think that the only
tests that should use the log file are tests deliberately written to
check what gets logged - and those can be collected in the same file.
As part of this patch, we add the utility function local_process_id(cql)
to find (if we can) the local process which listens to the connection
"cql". This utility function will later be useful in more places - for
example test_tools.py needs to find Scylla's executable.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220314151125.2737815-1-nyh@scylladb.com>
To make major compaction more resilient to low-
disk space conditions, 342bfbd65a
sorted the tables based on their live disk space used.
However, each shard still makes progress in its own pace.
This change serializes major compaction between tables
so we still compact in parallel on all shards, but one
(distributed) table at a time.
As a follow-up, we can consider serializing even at the single shard
level when disk space is critically low, so we can't even risk
parallel compaction across all shards.
Refs scylladb/scylla-dtest#2653
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220313153814.2203660-1-bhalevy@scylladb.com>
With compact_sstables() now living in compaction_manager::task,
release_exhausted no longer has to live inside compaction_descriptor,
which is a good direction because implementation detail is being
removed from the interface.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220311023410.250149-2-raphaelsc@scylladb.com>
Table submits compaction request into manager, which in turn calls
back table to run the compaction when the time has come, i.e.:
table -> compaction manager -> table -> execute compaction
But manager should not rely on table to run compaction, as compaction
execution procedure sits one layer below the manager and should be
accessed directly by it, i.e:
table -> compaction manager -> execute compaction
This makes code easier to understand and update_compaction_history()
can now be noop for unit tests using table_state.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220311023410.250149-1-raphaelsc@scylladb.com>
"
This series hardens raft_group_registry::stop_servers
and uses it to drain_on_shutdown, called before
the database is stopped in cql_test_env.
(Not needed for main).
raft_group_registry deferred_stop is introduced right after
the service is started to make sure it's properly stopped
even if there's an exception at any point while starting.
Test: unit(dev)
"
* tag 'raft_group_registry-drain_on_shutdown-v1' of https://github.com/bhalevy/scylla:
cql_test_env: raft_group_registry::drain_on_shutdown before stopping the database
raft_group_registry: harden stop_servers
raft_group_registry: delete unused _shutdown_gate
When row_cache::make_reader() and memtable::make_flat_reader() see that the query result is empty, they return empty_flat_reader, which is a trivial implementation of flat_mutation_reader.
Even though empty_flat_reader doesn't do anything meaningful, it still needs to be created, handled in merging_reader and destroyed. Turns out this is costly.
This patch series replaces hot path uses of empty_flat_reader with an empty optional.
Performance effects:
`perf_simple_query --smp 1`
TPS: 138k -> 168k
allocs/op: 80.2 -> 71.1
insns/op: 49.9k -> 45.1k
`perf_simple_query --smp 1 --enable-cache=1 --flush`
TPS: 125k -> 150k
allocs/op: 79.2 -> 71.1
insns/op: 51.7k -> 47.2k
For a cassandra-stress benchmark (localhost, 100% cache reads) this translates to a TPS increase from ~42k to ~48k per hyperthread.
Note that this optimization is effective for single-partition reads where the queried partition is only in cache/sstables or only in memtables. Other queries (e.g. where the partition is in both cache in memtables and needs to be merged) are unaffected.
Closes#10204
* github.com:scylladb/scylla:
replica: Prefer row_cache::make_reader_opt() to row_cache::make_reader()
row_cache: Add row_cache::make_reader_opt()
replica: Prefer memtable::make_flat_reader_opt() to memtable::make_flat_reader()
memtable: Add memtable::make_flat_reader_opt()
[avi: adjust #include for readers/ split]
The flat_mutation_reader files were conflated and contained multiple
readers, which were not strictly necessary. Splitting optimizes both
iterative compilation times, as touching rarely used readers doesn't
recompile large chunks of codebase. Total compilation times are also
improved, as the size of flat_mutation_reader.hh and
flat_mutation_reader_v2.hh have been reduced and those files are
included by many file in the codebase.
With changes
real 29m14.051s
user 168m39.071s
sys 5m13.443s
Without changes
real 30m36.203s
user 175m43.354s
sys 5m26.376s
Closes#10194
When there is nothing to read, make_flat_reader() returns an empty (no-op)
reader. But it turns out that constructing, combining and destroying that
empty reader is quite costly.
As an optimization, add an alternative version which returns an empty optional
instead.
database
We're currently stopping raft_gr before
shutting the database down, but we fail to do that if
anything goes wrong before that, e.g. if
distributed_loader::init_non_system_keyspaces fails.
This change splits drain_on_shutdown out of stop()
to stop the raft groups before the database is stopped
and does the rest in a deferred_stop placed right
after the rafr_gr registry is strated.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
stop_servers should never fail since it's called on
the shutdown path.
Use a local gate in stop_servers() to wait on all
background raft group server aborts.
Also, handle theoretical exceptions from server::abort()
to guarantee success.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Recent PR #10092 (propagating read timeouts on coordinator without
throwing) accidentally removed a line which cancelled
`abstract_read_resolver`'s `_timeout` timer after a read failure.
Because of that, it might happen that after a read failure the timer is
triggered and the `_done_promise` is set twice which triggers an assert
in seastar.
This commit brings back the line which cancels the timeout timer.
Fixes: #10193Closes#10206
This is a translation of Cassandra's CQL unit test source file
validation/operations/BatchTest.java into our our cql-pytest framework.
This test file includes 13 tests for various types of BATCH operations.
All tests pass on Scylla - no known or new bugs were reproduced.
Two of the tests involve very slow testing of TTLs, so after verifying
they work I marked them "skip" for now (we can always turn them on later,
perhaps after reducing the length or number of the sleeps).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220313121634.2611423-1-nyh@scylladb.com>
Commit 1c99ed6ced added tracing logs
about the index chosen for the query, but aggregate queries have
a separate code path, which wasn't taken into account.
After this patch, tracing for aggregate queries also includes
this additional information.
Closes#10195
interrupt() makes it sound like it's interrupting the compaction, but it's
actually called *on* interrupt, to handle the interrupt scenario.
Let's rename it to on_interrupt().
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220311000128.189840-1-raphaelsc@scylladb.com>
The change is mostly mechanical: update all compactor instances to the
_v2 variant and update all call-sites, of which there is not that many.
As a consequence of this patch, queries -- both single-partition and
range-scans -- now do the v2->v1 conversion in the consumers, instead of
in the compactor.
Add a `consume()` overload for range tombstone changes and convert them
internally to range tombstones, as the underlying reconcilable result
is still v1.
Add a consume() overload which takes a range tombstone change and drops
it just like the existing range tombstone overload does: query results
don't care about range tombstones.
The downstream consumer (mutation_querier) already ignores range
tombstones, so no point forwarding them to it. This makes adding v2
support easier too as range tombstone changes can be similarly dropped.
Changing the capture list of a lambda in
forward_service::execute_on_this_shard from [&] to an explicit one
enables grater readability and prevents potential bugs.
Closes#10191
The services' configuration should be performed with the help of
service-specific config that's filled by the service creator. This
is not the case for gossiper that grabs the db::config and keeps
reference on it throughout its lifetime.
This set brings the gossiper configuration to the described form
by putting the needed config bits onto gossip_config (that already
exists and is partially used for gossiper configuration). And two
live-updateable options need extra care.
tests: unit(dev), dtest.simple_boot_shutdown(dev)
* 'br-gossiper-no-db-config' of https://github.com/xemul/scylla:
gossiper: Remove db::config reference from gossiper
gossiper: Keep live-updateable options on gossiper
gossiper: Keep immutable options on gossip_config
Although its API was long converted to v2, its implementation stayed v1
because the memtable and mutation API were still v1. Now that the
memtable flush returns a v2 reader we can have a second look at
converting this. While the mutation API still uses v1, this can easily
be worked around by using going through `mutation_rebuilder_v2`.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220302145945.189607-1-bdenes@scylladb.com>
The series overhauls the compaction_manager::task design and implementation
by properly layering the functionality between the compaction_manager
that deals with generic task execution, and the per-task business logic that is defined
in a set of classes derived from the generic task class.
While at it, the series introduces `task::state` and a set of helper functions to manage it
to prevent leaks in the statistics, fixing #9974.
Two more stats counter were exposed: `completed_tasks` and a new `postponed_tasks`.
Test: sstable_compaction_test
Dtest: compaction_test.py compaction_additional_test.py
Fixes#9974Closes#10122
* github.com:scylladb/scylla:
compaction_manager: use coroutine::switch_to
compaction_manager::task: drop _compaction_running
compaction_manager: move per-type logic to derived task
compaction_manager: task: add state enum
compaction_manager: task: add maybe_retry
compaction_manager: reevaluate_postponed_compactions: mark as noexcept
compaction_manager: define derived task types
compaction_manager: register_metrics: expose postponed_compactions
compaction_manager: register_metrics: expose failed_compactions
compaction_manager: register_metrics: expose _stats.completed_tasks
compaction: add documentation for compaction_type to string conversions
compaction: expose to_string(compaction_type)
compaction_manager: task: standardize task description in log messages
compaction_manager: refactor can_proceed
compaction_manager: pass compaction_manager& to task ctor
compaction_manager: use shared_ptr<task> rather than lw_shared_ptr
compaction_manager: rewrite_sstables: acquire _maintenance_ops_sem once
compaction_manager: use compaction_state::lock only to synchronize major and regular compaction
Saving an allocation for running the functor
as a task in the switched-to scheduling group.
Also, switch to the desired scheduling group at
the beginning of the task so that the higher level logic,
like getting the list of sstables to compact
will be performed under the desired scheduling group,
not only the compaction code itself.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Replace the _compaction_running boolean member
by calculating _state == state::active
now that setup_new_compaction switches state to
`active`
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move the business logic into the task specific classes.
Separating initialization during task construction,
from the compaction_done task, moved into
a do_run() method, and in some cases moving
a lambda function that was called per table (as in
rewrite_sstables) into a private method of the
derived class.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add an enum class representing the task state machine
and a switch_state function to transition between the states
and update the corresponding compaction_manager stats counters.
Refs #9974
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Reading data from sstables without compacting first puts
unnecessary pressure on the cache. The mutation streams
need to be resolved anyway before passing to subsequent
consumers, so it's better to do it as close to the
source as possible.
Fixes: #3568Closes#10188
"
This patch-set converts the sstable writer to v2, then prepares the
ground for users actually being able to use the v2 variant. Finally it
converts all users to do so and then decommissions the v1 variant.
For users to be able to use the v2 writer API, we first have to add a v2
output to the compactor first, as some users write to sstables via the
compactor.
Tests: unit(dev, release)
"
* 'sstable-writer-v2/v2' of https://github.com/denesb/scylla:
sstables/sstable: remove now unused v1 write_components() variant
mutation_compactor: remove now unused compact_for_compaction
test/boost/mutation_test: migrate to compact_for_mutation_v2
streaming: migrate to v2 variant of sstable writer API
memtable-sstable: migrate to v2 variant of sstable writer API
test: migrate to the v2 variant of the sstable writer API
sstables/sstable: expose v2 variant of write_components()
sstables: convert mx writer to v2
sstables/metadata_collector: use position_in_partition for min/max keys
test/boost/mutation_test: test_compactor_range_tombstone_spanning_many_pages extend to check v2 output too
mutation_reader: convert compacting reader v2
mutation_compactor: add v2 output
mutation_compactor: make _last_clustering_pos track last input
range_tombstone_change: add set_tombstone()
test/lib/mutation_source_test: log name of each run_mutation_source()
To be used in the next patch to generate
a string dscription from the compaction_type.
In theory, we could use compaction_name()
btu the latter returns the compaction type
in all-upper case and that is very different from
what we print to the log today. The all-upper
strings are used for the api layer, e.g. to
stop tasks of a particular compaction type.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Define task::describe and use it via operator<<
to print the task metadata to the log in a standard way.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move the task-internal parts of can_proceed
to a respective compaction_manager::task method,
preparing for turning it into a class with
a proper hierarchy of access to private members.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And use it to get the compaction state of the
table to compact.
It will be used in a later patch
to manage the task state from task
methods.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Like all other maintenance operations, acquire the _maintenance_ops_sem
once for the whole task, rather than for each sstable.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Maintenance operations like cleanup, upgrade, reshape, and reshard
are serialized serialized with major compaction using the _maintenance_ops_sem
and they need no further synchronization with regular compaction
by acquiring the per-table read lock..
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The sstables::sstable class has two methods for writing sstables:
1) sstable_writer get_writer(...);
2) future<> write_components(flat_mutation_reader, ...);
(1) directly exposes the writer type, so we have to update all users of
it (there is not that many) in this same patch. We defer updating
users of (2) to a follow-up commits.
Instead of naked clustering keys. Working with the latter is dangerous
because it cannot accurately represent the entire clustering domain: it
cannot represent positions between (before/after) keys. For this reason
the metadata collector had a separate update_min_max_components()
overload for range tombstones because the positions of these cannot be
represented by clustering keys alone.
Moving to position_in_partition solves this problem and it is now enough
to have a single overload with position_in_partition_view. This is also
more future proof as it will work with range tombstone changes without
any additional changes.
The output version is selected via compactor_output_format, which is a
template parameter of `compact_mutation_state` and all downstream types.
This is to ensure a compaction state created to emit a v2 stream will
not be accidentally used with a v1 consumer.
When using a v2 output, the current active tombstone has to be tracked
separately for the regular and for the gc consumer (if any), so that
each can be closed properly on EOS. The current effective tombstone is
tracked separately from these two. The reason is that purged tombstones
are still applied to data, but are not emitted to the regular consumer.
Instead of updating _last_clustering_pos whenever a clustering fragment
is pushed to the consumers, we now update it whenever a clustering
fragment enters the compactor. Not only is this much more robust, but it
also makes more sense. Just because a range tombstone is purged (and
therefore the consumer doesn't see it), it still moves the logical
clustering position in the stream. Also, tracking the input side avoids
any ambiguity related to cases where we have two consumers (regular + gc
consumer).
Although we have a log in run_mutation_reader_tests(), it is useful to
know where it was called from, when trying to find the test scenario
that failed.
"
Add per-action help content for each action. Main description now points
to these for more details.
"
* 'scylla-types-improvements/v1' of https://github.com/denesb/scylla:
tools/types: update main description
tools/scylla-types: per-action help content
tools/scylla-types: description: remove -- from action listing
tools/scylla-types: use fmt::print() instead of std::cout <<
"
Also convert the foreign_reader used by it in the process.
Tests: unit(dev)
"
* 'multishard-writer-v2/v1' of https://github.com/denesb/scylla:
mutation_writer/multishard_writer: remove now unused v1 factory overloads
test/boost/mutation_writer_test: test the v2 variant of distribute_reader_and_consume_on_shards()
flat_mutation_reader: add v2 variant of make_generating_reader()
mutation_reader: multishard_writer: migrate implementation to v2
mutation_reader: convert foreign_reader to v2
streaming/consumer: convert to v2
mutation_writer/multishard_writer: add v2 variant of distribute_reader_and_consume_on_shards()
It was a suggestion from @psarna, done to get more info about the abort from #10174.
Closes#10185
* github.com:scylladb/scylla:
query: do not assert in `operator<<(ostream&, const forward_result::printer&)`
query: transform asserts into on_internal_error in forward_result::merge
There are two issues with current implementation of remove/remove_if:
1) If it happens concurrently with get_ptr(), the latter may still
populate the cache using value obtained from before remove() was
called. remove() is used to invalidate caches, e.g. the prepared
statements cache, and the expected semantic is that values
calculated from before remove() should not be present in the cache
after invalidation.
2) As long as there is any active pointer to the cached value
(obtained by get_ptr()), the old value from before remove() will be
still accessible and returned by get_ptr(). This can make remove()
have no effect indefinitely if there is persistent use of the cache.
One of the user-perceived effects of this bug is that some prepared
statements may not get invalidated after a schema change and still use
the old schema (until next invalidation). If the schema change was
modifying UDT, this can cause statement execution failures. CQL
coordinator will try to interpret bound values using old set of
fields. If the driver uses the new schema, the coordinaotr will fail
to process the value with the following exception:
User Defined Type value contained too many fields (expected 5, got 6)
The patch fixes the problem by making remove()/remove_if() erase old
entries from _loading_values immediately.
The predicate-based remove_if() variant has to also invalidate values
which are concurrently loading to be safe. The predicate cannot be
avaluated on values which are not ready. This may invalidate some
values unnecessarily, but I think it's fine.
Fixes#10117
Message-Id: <20220309135902.261734-1-tgrabiec@scylladb.com>
It was done to show more context in case of forward_result::merge
arguments size mismatch and also to prevent aborts caused by another
nodes sending malformed data.
This patch gets rid annoying pytest configuration warnings when running
test/cql-pytest/run. These started to happen after commit
afab1a97c6, due to a pytest bug:
In that commit, we added new "--scylla-path" and "--workdir" parameters
to our pytest tests, and test/cql-pytest/run started passing them,
and test/cql-pytest/run sometest runs pytest as:
pytest --host something --workdir somedir --scylla-path somepath sometest
Pytest wants to find a configuration file (pytest.ini or tox.ini) in the
directory where the tests live, but its logic to find that directory is
buggy: It (_pytest/config/findpaths.py::determine_setup()) looks at
the command line for directory names, and looks for config files in
these directories or any of their parents. It ignores parameters
beginning with "-", but in our case the various arguments like
"--scylla-path" are each followed by another option, and this one is
not ignored! So instead of looking for the config file in sometest's
parent directories (and finding test/cql-pytest/pytest.ini), pytest
sees the directory given after "scylla-path", and finds the completely
irrelevant tox.ini there - and uses that, which (depending what you
have installed) can generate warnings.
The solution is to change the run script to use "--scylla-path=..."
as one parameter instead of "--scylla-path ..." as two parameters.
When it's just one parameter, the pytest determine_setup() logic skips
it entirely, and finds just the actual test directory.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220309132726.2311721-1-nyh@scylladb.com>
Checking if the type is string is subtly broken for reversed types,
and these types will not be recognized as strings, even though they are.
As a result, if somebody creates a column with DESC order and then
tries to use operator LIKE on it, it will fail because the type
would not be recognized as a string.
Fixes#10183Closes#10181
* github.com:scylladb/scylla:
test: add a case for LIKE operator on a descending order column
types: fix is_string for reversed types
This case is a regression test for issue #10181, where it turned out
that a clustering column with descending order is not properly
recognized as a string.
This test case used to fail with:
cassandra.InvalidRequest:
Error from server: code=2200 [Invalid query]
message="LIKE is allowed only on string types, which b is not"
...until it got fixed by the previous commit.
Checking if the type is string is subtly broken for reversed types,
and these types will not be recognized as strings, even though they are.
As a result, if somebody creates a column with DESC order and then
tries to use operator LIKE on it, it will fail because the type
would not be recognized as a string.
Since regular compaction may run in parallel no lock
is required per-table.
We still acquire a read lock in this patch, for backporting
purposes, in case the branch doesn't contain
6737c88045.
But it can be removed entirely in master in a follow-up patch.
This should solve some of the slowness in cleanup compaction (and
likely in upgrade sstables seen in #10060, and
possibly #10166.
Fixes#10175
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10177
The problem was incompatibility with cassandra, which accepts bool
as a string in `fromJson()` UDF. The difference between Cassandra and
Scylla now is Scylla accepts whitespaces around word in string,
Cassandra don't. Both are case insensitive.
Fixes: https://github.com/scylladb/scylla/issues/7915Closes#10134
* github.com:scylladb/scylla:
CQL3/pytest: Updating test_json
CQL3: fromJson accepts string as bool
Cassandra generally does not allow empty strings as partition keys (note, by the way, that empty strings are allowed as clustering keys, as well as in individual components of a compound partition key).
However, Cassandra does allow empty strings in _regular_ columns - and those regular columns can be indexed by a secondary index, or become an empty partition-key column in a materialized view. As noted in issues #9375 and #9364 and verified in a few xfailing cql-pytest tests, Scylla didn't allow these cases - and this patch series fixes that.
Before the last patch in this series finally enables empty-string partition keys in materialized views, we first need to solve a couple of bugs in our code related to handling empty partition keys:
The first patch fixes issue #10178 - a bug in `key_view::tri_compare()` where comparing two empty keys returned a random result instead of "equal".
The second patch fixes issue #9352: our tokenizer has an inconsistency where for an empty string key, two variants of the same function return different results:
1. One variant `murmur3_partitioner::get_token(bytes_view key)` returned `minimum_token()` for the empty string.
2. Another variant `murmur3_partitioner::get_token(const schema& s, partition_key_view key)` did not have this special case, and called the normal hash-function calculation on the empty string (the resulting token is 0).
Variant 2 was an unintentional bug, because Cassandra always does what variant does 1. So the "obvious" fix here would be to fix variant 2 to do what variant 1 does. Nevertheless, we decided to do the opposite: Change variant 1 to match variant 2. The reasoning is as follows:
The `minimum_token()` is `token{token::kind::before_all_keys, 0 }` - it's not a real token. Since we intend in this patch allow real data to exist with the empty key, we need this real data to have a real token. For example, this token needs to be located on the token ring (so the empty-key partition will have replicas) and also belong to one of the shards, and it's not clear that `minimum_token()` will be handled correctly in this context.
After changing the token of the empty string to 0, we note that some places in the code assume that `dht::decorated_key(dh
t::minimum_token(), partition_key::make_empty())` is a legal decorated key. However, as far as I can tell, none of these places actually assume that the partition-key part (the `make_empty()`) really matches the token - this decorated key is only used to start an iteration (ignoring this key itself) or to indicate a non-existent key (in modern code `std::optional` should be used for that).
While normally changing the token of a key is a big faux-pas, which can result in old data no longer being readable, in this case this change is safe because:
1. Scylla previously disallowed empty partition keys (in both base tables and views), so we cannot have had such a partition key saved in any sstable.
3. Cassandra does allow empty partition keys in _views_ and _secondary indexes_, but we do not support migrating sstables of those into Scylla - users are expected to only migrate the base table and then re-create the view or index. So however Cassandra writes those empty-key partitions, we don't care.
The third patch finally fixes the materialized views implementation to not drop view rows with an empty-string partition key (#9375). This means we basically revert commit ec8960df45 - which fixed#3262 by disallowing empty partition keys in views, whereas this patch fixes the same problem by handling the empty partition keys correctly.
The fix for the secondary index bug (#9364) comes "for free" because it is based on materialized views.
We already had xfailing test cases for empty strings in materialized views and indexes, and after this series they begin to pass so the "xfail" mark is removed. The series also adds additional test cases that validate additional corner cases discovered during the debugging.
Fixes#9352Fixes#9364Fixes#9375Fixes#10178Closes#10170
* github.com:scylladb/scylla:
compound_compat.hh: add missing methods of iterator
materialized views: allow empty strings in views and indexes
murmur3: fix inconsistent token for empty partition key
compound_compat.hh: fix bug iterating on empty singular key
While debugging legacy_compound_view, I noticed that it cannot be used
as a C++20 std::ranges::input_range because it is missing some trivial
methods. So let's fix this, and make the life of future developers a
little bit easier.
The two trivial methods we need to implement:
1. A postfix increment operator. We already had a prefix increment
operator, but the C++20 concept weakly_iterable also needs postfix.
2. By mistake (this will be corrected in https://wg21.link/P2325R3),
weakly_iterable also required the default_initialized concept, so
our iterator type also needs a default constructor.
We'll never actually use this silly constructor, and when this C++20
standard mistake is corrected, we can remove this constructor.
After this patch, a legacy_compound_view is accepted for the C++20
ranges::input_range concept.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Although Cassandra generally does not allow empty strings as partition
keys (note they are allowed as clustering keys!), it *does* allow empty
strings in regular columns to be indexed by a secondary index, or to
become an empty partition-key column in a materialized view. As noted in
issues #9375 and #9364 and verified in a few xfailing cql-pytest tests,
Scylla didn't allow these cases - and this patch fixes that.
The patch mostly *removes* unnecessary code: In one place, code
prevented an sstable with an empty partition key from being written.
Another piece of removed code was a function is_partition_key_empty()
which the materialized-view code used to check whether the view's
row will end up with an empty partition key, which was supposedly
forbidden. But in fact, should have been allowed like they are allowed
in Cassandra and required for the secondary-index implementation, and
the entire function wasn't necessary.
Note that the removed function is_partition_key_empty() was *NOT* required
for the "IS NOT NULL" feature of materialized views - this continues to
work as expected after this patch, and we add another test to confirm it.
Being null and being an empty string are two different things.
This patch also removes a part of a unit test which enshrined the
wrong behavior.
After this patch we are left with one interesting difference from
Cassandra: Though Cassandra allows a user to create a view row with an
empty-string partition key, and this row is fully visible in when
scanning the view, this row can *not* be queried individually because
"WHERE v=''" is forbidden when v is the partition key (of the view).
Scylla does not reproduce this anomaly - and such point query does work
in Scylla after this patch. We add a new test to check this case, and mark
it "cassandra_bug", i.e., it's a Cassandra behavior which we consider
wrong and don't want to emulate.
This patch relies on #9352 and #10178 having been fixed in previous patches,
otherwise the WHERE v='' does not work when reading from sstables.
We add to the already existing tests we had for empty materialized-views
keys a lookup with WHERE v='' which failed before fixing those two issues.
Fixes#9364Fixes#9375
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Traditionally in Scylla and in Cassandra, an empty partition key is mapped
to minimum_token() instead of the empty key's usual hash function (0).
The reasons for this are unknown (to me), but one possibility is that
having one known key that maps to the minimal token is useful for
various iterations.
In murmur3_partitioner.cc we have two variants of the token calculation
function - the first is get_token(bytes_view) and the second is
get_token(schema, partition_key_view). The first includes that empty-
key special case, but the second was missing this special case!
As Kamil first noted in #9352, the second variant is used when looking
up partitions in the index file - so if a partition with an empty-string
key is saved under one token, it will be looked up under a different
token and not found. I reproduced exactly this problem when fixing
issues #9364 and #9375 (empty-string keys in materialized views and
indexes) - where a partition with an empty key was visible in a
full-table scan but couldn't be found by looking up its key because of
the wrong index lookup.
I also tried an alternative fix - changing both implementations to return
minimum_token (and not 0) for the empty key. But this is undesirable -
minimum_token is not supposed to be a valid token, so the tokenizer and
sharder may not return a valid replica or shard for it, so we shouldn't
store data under such token. We also have have code (such as an increasing-
key sanity check in the flat mutation reader) which assumes that
no real key in the data can be minimum_token, and our plan is to start
allowing data with an empty key (at least for materialized views).
This patch does not risk a backward-incompatible disk format changes
for two reasons:
1. In the current Scylla, there was no valid case where an empty partition
key may appear. CQL and Thrift forbid such keys, and materialized-views
and indexes also (incorrectly - see #9364, #9375) drop such rows.
2. Although Cassandra *does* allow empty partition keys, they is only
allowed in materialized views and indexes - and we don't support reading
materialized views generated by Cassandra (the user must re-generate
them in Scylla).
When #9364 and #9375 will be fixed by the next patch, empty partition keys
will start appearing in Scylla (in materialized views and in the
materialized view backing a secondary index), and this fix will become
important.
Fixes#9352
Refs #9364
Refs #9375
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When iterating over a compound key with legacy_compound_view<>, when the
key is "singular" (i.e., a single column) we need to iterate over just the
component's actual bytes - without the two length bytes or end-of-component
byte. In particular, when the component is an *empty string*, the iteration
should return zero bytes. In other words, we should have begin() == end().
Unfortunately, this is not what happened - for an empty singular key, the
iterator returned for begin() was slightly different from end() - so
code using this iterator would not know there is nothing to iterate.
So in this patch we fix begin() and end() to return the same thing
if we have an empty singular key.
The bug in legacy_compound_view<> (which we fix here) caused a bug in
sstables::key_view::tri_compare(const schema& s, partition_key_view other),
causing it to return wrong results when comparing two empty keys. As a
result we were unable to retrieve a partition with an empty key from the
sstable index. So this patch is necessary to fix support for
empty-string keys in sstables (part of issue #9375).
This patch also includes a unit-test for this bug. We test it in the
context of sstables::key_view::tri_compare(), where it was first
discovered, and also test the legacy_compound_view itself. The included
test used to fail in both places before this patch, and pass after it.
Fixes#10178
Refs #9375
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
"
There's a _operation_mode enum sitting on storage_service that indicates the
top-level state of the scylla node. Next to it there's a bunch of booleans
that define (and duplicate) some sub-modes. These booleans just make the code
more obscure and complicated. This set removes all those booleans and patches
all the relevant checks/calls/methods to rely only on the operation mode.
Also, the switching between modes is simplified down to some bare minimum.
tests: unit(dev) dtest.simple_boot_shutdown(dev) manual(dev)
Manual test included start-stop, nodetool enablegossip, disablegosip and drain
commands, scylla-cly is_initialized and is_joined calls
As noticed in v2, this set changes the log messages that are checked by
dtests. The fix for dtest, that's compatible with both -- current scylla and
this patchset -- is already in dtest master.
"
* 'br-remove-bools-from-storage-service-3-rebase' of https://github.com/xemul/scylla:
storage_service: Relax operation modes switch
storage_service: Remove _ms_stopped
storage_service: Remove _is_bootstrap_mode
storage_service: Remove _initialized and is_initialized()
storage_service: Remove _joined and is_joined()
storage_service: Replace is_starting() with get_operation_mode()
storage_service: Make get_operation_mode() return mode itself
storage_service: Relax repeating set_mode-s
It seams that batch prepared statements always return false for
depends_on_keyspace and depends_on_column_family, this in turn
renders the removal criteria from the cache to always be false
which result by the queries not being evicted.
Here we change the functions to return the true state meaning,
they will return true if any of the sub queries is dependant upon
the keyspace or column family.
In this fix we first make the API more coherent and then use this new API to implement
the batch statement's dependency test.
Fixes#10129
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Closes#10132
* github.com:scylladb/scylla:
prepared_statements: Invalidate batch statement too
cql3 statements: Change dependency test API to express better it's purpose
The set_mode() tries to combine mode switching and extended logging,
but there are no places left that do need this flexibility. It's
simpler and nicer to make set_mode() _just_ switch the mode and
log some generic "entering ... mode" message.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This boolean protects do_stop_ms from re-entrability. However, this
method is only called from stop_transport() which handles re-entring
itself, so the _ms_stopped can be just removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This "state" is the sub-state of the STARTING mode that's activated
when the storage_service::bootstrap() is called. Instead of the
separate boolean the new mode can be used. To stop it from reverting
the BOOTSTRAP mode back to JOINING some calls to set_mode() should
be converted into regular logging.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This bit is hairy. First, it indicates that the storage service
entered the init_server() method. But, once the node is up and
running it also indicates whether the gossiper is enabled or not
via the APi call.
To rely on the operation mode, first, the NONE mode is introduced
at which the server starts. Then in init_server() is switches to
STARTING.
Second change is to stop using the bit in enable/disable gossiper
API call, instead -- check the gossiper.is_enabled() itself.
To keep the is_initialized API call compatible, when the operation
mode is NORMAL it would return true/false according to the status
of the gossiper. This change is simple because storage service API
handlers already have the gossiper instance hanging around.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The is_joined() status can be get with get_operation_mode(). Since
it indicates that the operation mode is JOINING, NORMAL or anything
above, the operation mode the enum class should be shuffled to get
the simple >= comparison.
Another needed change is to set mode few steps earlier than it
happens now to cover the non-bootstrap startup case.
And the third change is to partially revert the d49aa7ab that made
the .is_joined() method be future-less. Nowadays the is_joined() is
called only from the API which is happy with being future-full in
all other storage service state checks.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is trivial change, since the only user is in API and the
get_operation_mode + mode values are at hand.
One thing to pay attention to -- the new method checks the mode to
be <= STARTING, not for equality. Now this is equivalent change,
but next patch will introduce NONE mode that should be reported
as is_starting() too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now it reports back formatted mode. For future convenience it's
needed to return the raw value, all the more so the mode enum class
is already public.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In several places the call to set_mode(...) is used as a (format-less)
replecement for regular logging. Mode doesn't really change there, because
it had been changed before. Patch all those places to use regular logging,
next patches will make full use of it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Info messages are logged when compaction jobs start and finish
but there is no message logged when the job is interrupted, e.g.
when stopped by the compaction_manager.
Refs scylladb/scylla-dtest#2468
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a new CDC generation is created (during bootstrap or otherwise), it
is assigned a timestamp. The timestamp must be propagated as soon as
possible, so all live nodes can learn about the generation before their
clocks reach the generation's timestamp. The propagation mechanism for
generation timestamps is gossip.
When bootstrap RBNO was enabled this was not the case: the generation
timestamp was inserted into gossiper state too late, after the repair
phase finished. Fix this.
Also remove an obsolete comment.
Fixes https://github.com/scylladb/scylla/issues/10149.
Closes#10154
* github.com:scylladb/scylla:
service: storage_service: announce new CDC generation immediately with RBNO
service: storage_service: fix indentation
Following up on a57c087c89,
compare_atomic_cell_for_merge should compare the ttl value in the
reverse order since, when comparing two cells that are identical
in all attributes but their ttl, we want to keep the cell with the
smaller ttl value rather than the larger ttl, since it was written
at a later (wall-clock) time, and so would remain longer after it
expires, until purged after gc_grace seconds.
Fixes#10173
Test: mutation_test.test_cell_ordering, unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220306091913.106508-1-bhalevy@scylladb.com>
Currently a subscripted column is expressed using the struct `column_value`:
```c++
/// A column, optionally subscripted by a value (eg, c1 or c2['abc']).
struct column_value {
const column_definition* col;
std::optional<expression> sub; ///< If present, this LHS is col[sub], otherwise just col.
}
```
It would be better to have a generic AST node for expressing arbitrary subscripted values:
```c++
/// A subscripted value, eg list_colum[2], val[sub]
struct subscript {
expression val;
expression sub;
};
```
The `subscript` struct would allow us to express more, for example:
* subscripted `column_identifier`, not only `column_definition` (needed to get rid of `relation` class)
* nested subscripts: `col[1][2]`
Adding `subscript` to `expression` variant immediately would require to implement all `expr::visit` handlers immediately in the same commit, so I took a different approach. At first the struct is just there and visit handlers are implemented one by one in advance, then at the end `subscript` is added to the `expression`. This way all the new code can be neatly divided into commits and everything is still bisectable.
There were a few cases where the existing behaviour seemed to make little sense, but I didn't change it to keep the PR focused on refactoring. I left a `FIXME` comments there and I will submit separate patches to fix them.
Closes#10139
* github.com:scylladb/scylla:
cql3: expr: Remove sub from column_value
cql3: Create a subscript in single_column_relation
cql3: expr: Add subscript to expression
cql3: Handle subscript in multi_column_range_accumulator
cql3: Handle subscript in selectable_process_selection
cql3: expr: Handle subscript in test_assignment
cql3: expr: Handle subscript in prepare_expression
cql3: Handle subscript in prepare_selectable
cql3: expr: Handle subscript in extract_clustering_prefix_restrictions
cql3: expr: Handle subscript in extract_partition_range
cql3: expr: Handle subscript in fill_prepare_context
cql3: expr: Handle subscript in evaluate
cql3: expr: Handle subscript in extract_single_column_restrictions_for_column
cql3: expr: Handle subscript in search_and_replace
cql3: expr: Handle subscript in recurse_until
cql3: expr: Implement operator<< for subscript
cql3: expr: Handle subscript in possible_lhs_values
cql3: expr: Handle subscript in is_supported_by
cql3: expr: Handle subscript in is_satisifed_by
cql3: expr: Remove unused attribute
cql3: expr: Use column_maybe_subscripted in is_one_of()
cql3: expr: Use column_maybe_subscripted in limits()
cql3: expr: Use column_maybe_subscripted in equal()
cql3: expr: add get_subscripted_column(column_maybe_subscripted)
cql3: expr: Add as_column_maybe_subscripted
cql3: expr: Make get_value_comparator work with column_maybe_subscripted
cql3: expr: Make get_value work with column_maybe_subscripted
cql3: expr: Add column_maybe_subscripted
cql3: expr: Add get_subscripted_column
cql3: expr: Add subscript struct
When a node starts it does not immediately becomes a candidate since it
waits to learn about already existing leader and randomize the time it
becomes a candidate to prevent dueling candidates if several nodes are
started simultaneously.
If a cluster consist of only one node there is no point in waiting
before becoming a candidate though because two cases above cannot
happen. This patch checks that the node belongs to a singleton cluster
where the node itself is the only voting member and becomes candidate
immediately. This reduces the starting time of a single node cluster
which are often used in testing.
Message-Id: <YiCbQXx8LPlRQssC@scylladb.com>
When setting up clusters in regression tests, a bunch of servers were
created, each starting with a singleton configuration containing itself.
This is wrong: servers joining to an existing cluster should be started
with an empty configuration.
It 'worked' because the first server, which we wait for to become a leader
before creating the other servers, managed to override the logs and
configurations of other servers before they became leaders in their
configurations.
But if we want to change the logic so that servers in single-server clusters
elect themselves as leaders immediately, things start to break. So fix
the bug.
Message-Id: <20220303100344.6932-1-kbraun@scylladb.com>
Referring to issue #7915, cassandra also works with unprepared
statement. There was missing `fromJson()`, the test was inserting
string into boolean column.
The problem was incompatibility with cassandra, which accepts bool
as a string in `fromJson()` UDF. The difference between Cassandra and
Scylla now is Scylla accepts whitespaces around word in string,
Cassandra don't. Both are case insensitive.
Fixes: #7915
If the sstable is marked for deletion, e.g. when
writing the sstable fails for any reason before it's
sealed, make sure to remove the sstable's temporary
directory, if present, besides the sstables files.
This condition is benign as these empty temp dirs
are removed when scylla starts up, but the do accumulate
and we better remove them too.
Fixes#9522
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302161827.2448980-1-bhalevy@scylladb.com>
This makes host id mismatch cause a warning and stop being fatal,
to un-break node replacement dtests.
Should be revisited if/when the underlying problem (double setting of
local host id on a replacing node) is fixed.
Refs #10148
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20220303085049.186259-1-michael.livshin@scylladb.com>
Unlike atomic_cell_or_collection::equals, compare_atomic_cell_for_merge
currently returns std::strong_ordering::equal if two cells are equal in
every way except their ttl:s.
The problem with that is that the cells' hashes are different and this
will cause repair to keep trying to repair discrepancies caused by the
ttl being different.
This may be triggered by e.g. the spark migrator that computes the ttl
based on the expiry time by subtracting the expiry time from the current
time to produce a respective ttl.
If the cell is migrated multiple times at different times, it will generate
cells that the same expiry (by design) but have different ttl values.
Fixes#10156
Test: mutation_test.test_cell_ordering, unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com>
Add missing const qualifiers in serialize_to_bytes and
serialize_to_managed_bytes. Lack of those qualifiers caused GCC
compilation error:
./types/map.hh: In instantiation of ‘static bytes map_type_impl::serialize_to_bytes(const Range&) [with Range = std::map<seastar::basic_sstring<signed char, unsigned int, 31, false>, seastar::basic_sstring<signed char, unsigned int, 31, false>, serialized_compare>; bytes = seastar::basic_sstring<signed char, unsigned int, 31, false>]’:
cql3/type_json.cc:138:45: required from here
./types/map.hh:72:41: error: loop variable ‘elem’ of type ‘const std::pair<seastar::basic_sstring<signed char, unsigned int, 31, false>, seastar::basic_sstring<signed char, unsigned int, 31, false> >&’ binds to a temporary constructed from type ‘const std::pair<const seastar::basic_sstring<signed char, unsigned int, 31, false>, seastar::basic_sstring<signed char, unsigned int, 31, false> >’ [-Werror=range-loop-construct]
72 | for (const std::pair<bytes, bytes>& elem : map_range) {
| ^~~~
./types/map.hh:72:41: note: use non-reference type ‘const std::pair<seastar::basic_sstring<signed char, unsigned int, 31, false>, seastar::basic_sstring<signed char, unsigned int, 31, false> >’ to make the copy explicit or ‘const std::pair<const seastar::basic_sstring<signed char, unsigned int, 31, false>, seastar::basic_sstring<signed char, unsigned int, 31, false> >&’ to prevent copying
Adding those const qualifiers there is correct, as the definition of
those functions specifies that the range is of
std::pair<const bytes, bytes> elements, not std::pair<bytes, bytes>
(before the change):
requires std::convertible_to<std::ranges::range_value_t<Range>,
std::pair<const bytes, bytes>>
Note that there are some GCC compilation problems still left apart
from this one.
Closes#10157
No need to check first the the cells' expiry is different
or that deletion_time is different before comparing them
with `<=>`.
If they are the same the function returns std::strong_ordering::equal
anyhow and that is the same as `<=>` comparing identical values.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302113833.2308533-1-bhalevy@scylladb.com>
Passing integer which exceeds corresponding type's bounds to
`fromJson()` was causing silent overflow, e.g. inserting
`fromJson('2147483648')` to `int` coulmn stored `-2147483648`.
Now, this will cause marshal_exception. All integer types are testing agains their bounds.
Tests referring issue https://github.com/scylladb/scylla/issues/7914 in `test/cql-pytest/cassandra_tests/validation/entities/json_test.py` won't pass because the expected error's messages differ from the thrown ones. I was wondering what the message should be, because expected messages in tests aren't consistent, for instance:
- bigint overflow expects `Expected a bigint value, but got a` message
- short overflow expects `Unable to make short from` message
For now the message is `Value {} out of bound`.
Fixes: https://github.com/scylladb/scylla/issues/7914Closes#10145
* github.com:scylladb/scylla:
CQL3/pytest: Updating test_json
CQL3: fromJson out of range integer cause as error
When compiling utils/rjson.cc on GCC, the compilation triggers the
following warning (which becomes a compilation error):
utils/rjson.cc: In function ‘seastar::future<> rjson::print(const value&, seastar::output_stream<char>&, size_t)’:
utils/rjson.cc:239:15: error: typedef ‘using Ch = char’ locally defined but not used [-Werror=unused-local-typedefs]
239 | using Ch = char;
| ^~
This warning is a false positive. 'using Ch' is actually used internally
by rapidjson::Writer. This is a known GCC bug
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61596), which has not been
fixed since 2014.
I disabled this warning only locally as other code is not affected by
this warning and no other code already disables this warning.
Note that there are some GCC compilation problems still left apart
from this one.
Closes#10158
Also const-ify the db::config reference argument and std::move
the gossip_config argument while at it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
These options need to have updateable_value<> instance referencing
them from gossiper itself. The updateable_value<> is shard-aware in
the sense that it should be constructed on correct shard. This patch
does this -- the db::config reference is carried all the way down
to the gossiper constructor, then each instance gets its shard-local
construction of the updateable_value<>s.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add missing include of "<experimental/source_location>" which caused
compile errors on GCC:
In file included from raft/fsm.hh:12,
from raft/fsm.cc:8:
raft/raft.hh:251:30: error: ‘std::experimental’ has not been declared
251 | state_machine_error(std::experimental::source_location l = std::experimental::source_location::current())
| ^~~~~~~~~~~~
raft/raft.hh:251:59: error: expected ‘)’ before ‘l’
251 | state_machine_error(std::experimental::source_location l = std::experimental::source_location::current())
| ~ ^~
Note that there are some GCC compilation problems still left apart from
this one.
Closes#10155
When a new CDC generation is created (during bootstrap or otherwise), it
is assigned a timestamp. The timestamp must be propagated as soon as
possible, so all live nodes can learn about the generation before their
clocks reach the generation's timestamp. The propagation mechanism for
generation timestamps is gossip.
When bootstrap RBNO was enabled this was not the case: the generation
timestamp was inserted into gossiper state too late, after the repair
phase finished. Fix this.
Also remove an obsolete comment.
Fixes#10149.
Simplify the function by implementing it as a coroutine,
ensuring the input vector, holding the shared task ptrs, is
kept alive throughout the lifetime of the function
(instead of using do_with to achieve that)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302081547.2205813-2-bhalevy@scylladb.com>
task_stop is called exclusively from stop_tasks,
Now that stop_tasks calls task::stop() directly,
there is no need for this separation, so open-code
task_stop in stop_tasks, using coroutines.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302081547.2205813-1-bhalevy@scylladb.com>
It was assumed that offstrategy compaction is always triggered by streaming/repair
where it would inherit the caller's scheduling group.
However, offstrategy is triggered by a timer via table::_off_strategy_trigger so I don't see
how the expiration of this timer will inherit anything from streaming/repair.
Also, since d309a86, offstrategy compaction
may be triggered by the api where it will run in the default scheduling group.
The bottom line is that the compaction manager needs to explicitly perform offstrategy compaction
in the maintenance scheduling group similar to `perform_sstable_scrub_validate_mode`.
Fixes#10151
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302084821.2239706-1-bhalevy@scylladb.com>
Passing integer which exceeds corresponding type's bounds to
`fromJson()` was causing silent overflow, e.g. inserting
`fromJson('2147483648')` to `int` coulmn stored `-2147483648`.
Now, this will cause marshal_exception with value out of bound
message. Also, all integer types are testing agains their bounds.
Fixes: #7914
"
The set puts the code in question into a helper, coroutinizes it, removes
some code duplication, improves a corner case and relaxes logging.
tests: unit(dev), dtest.simple_boot_shutdown(v1, dev)
"
* 'br-join-ring-wait-sanitize-2' of https://github.com/xemul/scylla:
storage_service: De-bloat waiting logs
storage_service: Indentation fix after previous changes
storage_service: Negate loop breaking check
storage_service: Fix off-by-one-second waiting
storage_service: Pack schema waiting loop
storage_service: Out-line schema waiting code
storage_service: Make int delay be std::chrono::milliseconds
First thing is that logging can be done with logger methods,
not with set_mode() because the mode is already set at this
place.
Second thing is that pre-update_pending_ranges logs are excessive,
as the update_pending_ranges logs its progress itself.
Third is that post-logging is also exsessive -- there are more
logs after those lines.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In simple words turn
while {
if (continue) {
do_something
} else {
break
}
}
into
while {
if (!continue) {
break;
}
do_something
}
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The waiting loop needs to abort once a minute passes and does
it in one second steps. However, the expiration check happens
after sleep, which effectively throws this last second away.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The newly created method looks like this
wait_for_schema_agreement
update_pending_ranges
while (consistent_range_movement) {
pause
wait_for_schema_agreement
update_pending_range
}
This patch packs the wait_for_schema_agreement+update_pending_range
pairs into a single loop.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
And coroutinize while moving. No other changes.
While the code in question runs in a thread context and can enjoy
synchronous .get() calls, it's still better if it doesn't make any
assumptions about its environment. The ring joining code is changing
and new intermediate helpers should better be on the safe side from
the very beginning.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's milliseconds and is converted back and forth in join_token_ring().
Having a chrono type for it makes things (mostly code reading) simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Recently, coordinator_result was introduced as an alternative for
exceptions. It was placed in the main "exceptions/exceptions.hh" header,
which virtually every single source file in Scylla includes.
But unfortunately, it brings in some heavy header files and templates,
leading to a lot of wasted build time - ClangBuildAnalyzer measured that
we include exceptions.hh in 323 source files, taking almost two seconds
each on average.
In this patch, we split the coordinator_result feature into a separate
header file, "exceptions/coordinator_result", and only the few places
which need it include the header file. Unfortunately, some of these
few places are themselves header, so the new header file ends up being
included in 100 source files - but 100 is still much less than 323 and
perhaps we can reduce this number 100 later.
After this patch, the total Scylla object-file size is reduced by 6.5%
(the object size is a proxy for build time, which I didn't directly
measure). ClangBuildAnalyzer reports that now each of the 323 includes
of exceptions.hh only takes 80ms, coordinator_result.hh is only included
100 times, and virtually all the cost to include it comes from Boost's
result.hh (400ms per inclusion).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220228204323.1427012-1-nyh@scylladb.com>
Introduced in commit: ddd693c6d7
We're not emplacing newer windows in the tracker, causing
std::out_of_range when replacing sstables for windows.
Let's fix the logic and add an unit test to cover this.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220301194944.95096-1-raphaelsc@scylladb.com>
The underlying implementation behind the v1 and v2 variants if said
methods is the same, but we want to move to using the v2 variant in the
test as the v1 variant is going away soon.
Currently the output_run_identifier is assigned right
after the calling setup_new_compaction.
Move setting the uuid to setup_new_compaction to simplify
the flow.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220301083643.1845096-1-bhalevy@scylladb.com>
Instead, rely solely on compaction_data.abort source
that is task::stop now uses to stop the task.
This makes task stopping permanent, so it can't be undone
(as used to be the case where task_stop
set stopping to false after waiting for compaction_done,
to allow rerite_sstables's task to be created before
calling run_with_compaction_disabled, and start
running after it - which is no longer the case)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220301083535.1844829-1-bhalevy@scylladb.com>
Currently, rewrite_sstables retrieved the sstables under
run_with_compaction_disabled, *after* it's created a task for itself.
This makes little sense as this task have not started running yet
and therefore does not need to be stopped by
run_with_compaction_disabled.
This is currently worked around by setting task->stopping = false
in task_stop().
This change just moves task create in rewrite_sstables till
after the sstables are retrieved and the deferred cleanup
of _stats.pending_tasks till after it's first adjusted.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220301083409.1844500-1-bhalevy@scylladb.com>
This series contains:
- lister: move to utils
- tidy up the clutter in the root dir
Based on Avi's feedback to `[PATCH 1/1] utils: directory_lister: close: always abort queue` that was sent to the mailing list:
- directory_lister: drop abort method
- lister: do not require get after close to fail
- test: lister_test: test_directory_lister_close simplify indentation
- cosmetic cleanup
Closes#10142
* github.com:scylladb/scylla:
test: lister_test: test_directory_lister_close simplify indentation
lister: do not require get after close to fail
directory_lister: drop abort method
lister: move to utils
This PR consists of two changes.
The first fixes the flat_mutation_reader and flat_mutation_reader_v2, so that they can be destructed without being closed (if no action has been initiated). This has been discussed in the referenced issue.
The second one changes scanning and flush readers so that they implement the second version of the API.
It also contains unit test fixes, dealing with flat mutation reader assertions (where the v1 asserter failed to consume range tombstones intelligently enough in some flows) and several sstable_3_x tests (where sstables that contain range tombstones were expected to be byte-by-byte equivalent to a reference, aside from semantic validation).
Fixes#9065.
Closes#9669
* github.com:scylladb/scylla:
flat_reader_assertions: do not accumulate out-of-range tombstones
flat_reader_assertions: refactor resetting accumulated tombstone lists
flat_mutation_reader_test: fix "test_flat_mutation_reader_consume_single_partition"
memtable::make_flush_reader(): return flat_mutation_reader_v2
memtable::make_flat_reader(): return flat_mutation_reader_v2
flat_mutation_reader_v2: add consume_partitions()
introduce the MutationConsumer concept
mutation_source: clone shortcut constructors for flat_mutation_reader_v2
flat_mutation_reader_v2: add delegating_reader_v2
memtable: upgrade scanning_reader and flush_reader to v2
flat_mutation_reader: allow destructing readers which are not closed and didn't initiate any IO.
tests: stop comparing sstables with range tombstones to C* reference
tests: flat_reader_assertions: improve range tombstone checking
Also remove the incorrect difference in range tombstone checking
behavior between `produces_range_tombstone()` and `produces(const
range_tombstone&)` by having both turn on checking.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Since `flat_reader_assertions::produces(const range_tombstone&,...)`
records the range tombstone for checking, be sure to explicitly pass
in a clustering range that does not extend beyond the mock-read part
of the mutation.
Also (provisionally) change the assertion method to accept clustering
ranges.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
This change is a part of effort to migrate existing readers from old API
to the new one. The corresponding make_flush_reader and
make_flat_reader functions still return flat_mutation_reader.
In functions such as upgrade_to_v2 (excerpt below), if the constructor
of transforming_reader throws, r needs to be destroyed, however it
hasn't been closed. However, if a reader didn't start any operations, it
is safe to destruct such a reader. This issue can potentially manifest
itself in many more readers and might be hard to track down. This commit
adds a bool indicating whether a close is anticipated, thus avoiding
errors in the destructor.
Code excerpt:
flat_mutation_reader_v2 upgrade_to_v2(flat_mutation_reader r) {
class transforming_reader : public flat_mutation_reader_v2::impl {
// ...
};
return make_flat_mutation_reader_v2<transforming_reader>(std::move(r));
}
Fixes#9065.
As flat mutation reader {up,down}grades get added to the write path,
comparing range-tombstone-containing (at least) sstables byte-by-byte
to a reference is starting to seem like a fool's errand.
* When a flat mutation reader is {up,down}graded, information may get
lost while splitting range tombstones. Making those splits revertable
should in theory be possible but would surely make {up,down}graders
slower and more complex, and may also possibly entail adding
information to in-memory representation of range tombstones and
range rombstone changes. Such investment for the sake of 7 unit tests
does not seem wise, given that the plan is to get rid of reader
{up,down}grade logic once the move to flat mutation reader v2 is
completed.
* All affected tests also validate their written sstables
semantically.
* At least some of the offending reference sstables are not
"canonical" wrt range tombstones to begin with -- they contain range
tombstones that overlap with clustering rows. The fact that Scylla
does not "canonicalize" those in some way seems purely incidental.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
`produces_range_tombstone()` is smart enough to not just try to read
one range tombstone from the input and compare it to the passed
reference, but to read as many range tombstones as the reader is
looking at (including none) using `may_produce_tombstones()` and
record those appropriately.
When `produces(const schema&, const mutation_fragment&)` is passed a
range tombstone as the second argument, it does not do anything
special -- it just reads one fragment, disregards it (!), and applies
its second argument to both "expected" and "encountered" range
tombstone lists. The right thing here is to use the same logic as
`produces_range_tombstone()`; upcoming memtable-related reader
changes (which result in more split range tombstones) cause some unit
tests to fail without fixing this.
Refactor the relevant logic into a private method (`apply_rt()`) and
use that in both places.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
The header file "exceptions/exceptions.hh" and the exception types in it
is used by virtually every source file in Scylla, so excessive includes
and templated code generation in this header could slow down the build
considerably.
Before this patch, all of the exceptions' constructors were inline in
exceptions.hh, so source file using one of these exceptions will need
to recompile the code, which is fairly heavy, using the fmt templates
for various types. According to ClangBuildAnalyzer, 323 source files
needed to materialize prepare_message<db::consistency_level,int&,int&>,
taking 0.3 seconds each.
So this patch moves the exception constructors from the header file
exceptions.hh to the source file exceptions.cc. The header file no longer
uses fmt.
Unfortunately, the actual build-time savings from this patch is tiny -
around 0.1%... It turns out that most of the prepare_message<>
compilation time comes from fmt compilation time, and since virtually
all source files use fmt for other header reasons (intentionally or
through other headers), no compilation time can be saved. Nevertheless,
I hope that as we proceed with more cleanups like this and eliminate
more unnecessary code-generation-in-headers, we'll start seeing build
time drop.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
"
Some templates put constraints onto the involved types with the help of
static assertions. Having them in form of concepts is much better.
tests: unit(dev)
"
* 'br-static-assert-to-concept' of https://github.com/xemul/scylla:
sstables: Remove excessive type-match assertions
mutation_reader: Sanitize invocable asserion and concept
code: Convert is_future result_of assertions into invoke_result concept
code: Convert is_same+result_of assertions into invocable concepts
code: Convert nothrow construction assertions into concepts
code: Convert is_integral assertions to concepts
In a previous patch, we added a test for the case of Scylla trying to
assign the JSON value 1e6 into an integer - which should be allowed
because 1e6 is indeed a whole number, in the range of int.
We already fixed that in commit efe7456f0a,
but this patch adds another test which demonstrates that an even more
esoteric problem remains:
If we are reading a JSON value into a bigint (CQL's 64-bit integer),
*and* if the number is between 2^53 and 2^63-1 *and* if the number
is written using scientific notation, e.g., 922337203685477580.7e1
(which is 2^63-1), then the bigint is set incorrectly, with some
digits being lost. The problem is that RapidJSON reads this integer
into the "double" type, which only keeps 53 significant bits.
Because this is an open issue (#10137), the test included here is
marked as expected failure (xfail). The test is also known to
fail in Cassandra - which doesn't allow scientific notation for
JSON integers at all despite the JSON standard - so the test is
also marked "cassandra_bug".
Refs #10137
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Rather than using a std::optional<compacting_sstable_registration>
for lazy construction, construct the object early
and call register_compacting when the sstables to register
are available.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than a compaction_manager* so that in the next
patch it could be constructed with just that and
the caller can call register_compacting when
it has the sstables to register ready.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There's no need anymore for an indented block
to destroy tnhe directory_lister since the other
sub-case was deleted.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, the lister test expected get() to
always fail after close(), but it unexpectedly
succeeded if get() was never called before close,
as seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/next/4587/artifact/testlog/x86_64_debug/lister_test.test_directory_lister_close.4001.log
```
random-seed=1475104835
Generated 719 dir entries
Getting 565 dir entries
Closing directory_lister
Getting 0 dir entries
Closing directory_lister
test/boost/lister_test.cc(190): fatal error: in "test_directory_lister_close": exception std::exception expected but not raised
```
This change relaxes this requirement to keep
close() simple, based on Avi's feedback:
> The user should call close(), and not do it while get() is running, and
> that's it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Based on Avi's feedback:
> We generally have a public abort() only if we depend on an external
> event (like data from a tcp socket) that we don't control. But here
> there are no such external events. So why have a public abort() at all?
If needed in the future, we can consider adding
get(abort_source&) to allow aborting get() via
an external event.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There's nothing specific to scylla in the lister
classes, they could (and maybe should) be part of
the seastar library.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Just the factory function itself. The underlying machinery stays v1 for
now. Behind the scenes the v2 variant still invokes the v1 one, with the
necessary conversions.
This allows migrating users to the v2 interface, migrating the machinery
later.
column_value::sub has been replaced by the subscript struct
everywhere, so we can finally remove it.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
When `val[sub]` is parsed, it used to be the case
that column_value with a sub field was created.
Now this has been changed to creating a subscript struct.
This is the only place where a subscripted value can be created.
All the code regarding subscripts now operates using only the
subscript struct, so we will be able to remove column_value::sub soon.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
All handlers for subscript have finally been implemented
and subscript can now be added to expression without
any trouble.
All the commented out code that waited for this moment
can now be uncommented.
Every such piece of code had a `TODO(subscript)` note
and by grepping this phrase we can make sure that
we didn't forget any of them.
Right now there is two ways to express a subscripted
column - either by a column_value with a sub field
or by using a subscript struct.
The grammar still uses the old column_value way,
but column_value.sub will be removed soon
and everything will move to the subscript struct.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
extract_clustering_prefix_restrictions collects restrictions
on clustering key columns.
In case we encounter col[sub] we treat it as a restriction on col
and add it to the result.
This seems to make some sense and is in line with the current behaviour
which doesn't check whether a column is subscripted at all.
The code has been copied from column_value& handler.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
extract_parition_range collects restrictions on partition key columns.
In case we encounter col[sub] we treat it as a restriction on col
and add it to the result.
This seems to make some sense and is in line with the current behaviour
which doesn't check whether a column is subscripted at all.
The code has been copied from column_value& handler.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
fill_prepare_context collects useful information about
the expression involved in query restrictions.
We should collect this information from subscript as well,
just like we do from column_value and its sub.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
extract_single_column_restrictions_for_column finds all restrictions
for a column and puts them in a vector.
In case we encounter col[sub] we treat it as a restriction on col
and add it to the result.
This seems to make some sense and is in line with the current behaviour
which doesn't check whether a column is subscripted at all.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Prepare a handler for subscript in search_and_replace.
Some of the code must be commented out for now
because subscript hasn't been added to expression yet.
It will uncommented later.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
possible_lhs_values returns set of possible values
for a column given some restrictions.
Current behaviour in case of a subscripted column
is to just ignore the subscript and treat
the restriction as if it were on just the column.
This seems wrong, or at least confusing,
but I won't change it in this patch to preserve the existing behaviour.
Trying to change this to something more reasonable
breaks other code which assumes that possible_lhs_values
returns a list of values.
(See partition_ranges_from_EQs() in cql3/restrictions/statement_restrictions.cc)
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
is_supported_by checks whether the given expression
is supported by some index.
The current behaviour seems wrong, but I kept
it to avoid making changes in a refactor PR.
Scylla doesn't have indexes on map entries yet,
so for a subscript the answer is always no.
I think we should just return false there.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
For the most part subscript can be handled
in the same way as column_value.
column_value has a sub argument and all
called functions evaluate lhs value using
get_value() which is prepared to handle
subscripted columns.
These functions now take column_maybe_subscripted
so we can pass &subscript to them without a problem.
The difference is in CONTAINS, CONTAINS_KEY and LIKE.
contains() and contains_key() throw an exception
when the passed column has a subscript, so now
we just throw an exception immediately.
like() doesn't have a check for subscripted value,
but from reading its code it's clear that
it's not ready to handle such values,
so an exception is now thrown as well.
It shouldn't break any tests because when one tries
to perform a query like:
`select * from t where m[0] like '%' allow filtering;`
an exception is throw somewhere earlier in the code.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Functions that were previously marked as unused to make the code
compile are now used and we can remove the markings.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
is_one_of() used to take column_value which could be subscripted as an argument.
column_value.sub will be removed so this function needs to take column_maybe_subscripted now.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
limits() used to take column_value which could be subscripted as an argument.
column_value.sub will be removed so this function needs to take column_maybe_subscripted now.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
equal() used to take column_value which could be subscripted as an argument.
column_value.sub will be removed so this function needs to take column_maybe_subscripted now.
To get lhs value the code uses get_value() which is ready to handle subscripted columns.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add a function that extracts the column_value
from column_maybe_subscripted.
There were already overloads for expression and subscript,
but this one will be needed as well.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add a convenience function that allows to convert
a reference to expression to column_maybe_subscripted.
It will be useful in a moment.
For now part of it must be commented out
because subscript is not in the expression variant yet.
It will be uncommented once subscript is finally added
to expression.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
There is get_value_comparator(column_value) but soon
we will also need get_value_comparator(column_maybe_subscripted).
Implement it by copying code from get_value_comparator(column_value).
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
There is a get_value(column_value), but soon we will also
need get_value(column_maybe_subscripted).
Implement get_value(column_maybe_subscripted) by checking
whether the argument is a column_value or subscript
and calling the right code.
Code for handling the subscript case is copied from
get_value(column_value) where sub has value.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
column_maybe_subscripted is a variant that
can be either a column_value or a subscript.
It will be used as an argument to functions
which used to take column_value.
Right now column_value has a sub field,
but this will be removed soon once
the subscript struct takes over.
Changing the argument type is a smaller change
than rewriting all these functions, although
if they were rewritten the resulting code
would probably be nicer.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Even though the new subscript allows for subscripting anything,
the only thing that is really allowed to be subscripted is a column.
Add a utility function that extracts the column_value
from an expression with is a column_value or subscript.
It will came in handy in the following commits.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Currently any unhandled error during deferred shutdown
is rethrown in a noexcept context (in ~deferred_action),
generating a core dump.
The core dump is not helpful if the cause of the
error is "environmental", i.e. in the system, rather
than in scylla itself.
This change detects several such errors and calls
_Exit(255) to exit the process early, without leaving
a coredump behind. Otherwise, call abort() explicitly,
rather than letting terminate() be called implicitly
by the destructor exception handling code.
Fixes#9573
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220227101054.1294368-1-bhalevy@scylladb.com>
It seams that batch prepared statements always return false for
depends_on, this in turn renders the removal criteria from the
prepared statements cache to always be false which result by the
queries not being evicted.
Here we change the function to return the true state meaning,
they will return true if one of the sub queries is dependant
upon the keyspace and/ or column family.
Fixes#10129
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
purpose
Cql statements used to have two API functions, depends_on_keyspace and
depends_on_column_family. The former, took as a parameter only a table
name, which makes no sense. There could be multiple tables with the same
name each in a different keyspace and it doesn't make sense to
generalize the test - i.e to ask "Does a statement depend on any table
named XXX?"
In this change we unify the two calls to one - depends on that takes a
keyspace name and optionally also a table name, that way every logical
dependency tests that makes sense is supported by a single API call.
Add a struct called subscript, which will be used in expression
variant to represent subscripted values e.g col[x], val[sub].
It will replace the sub field of column_value.
Having a separate struct in AST for this purpose
is cleaner and allows to express subscripting
values other than column_value.
It is not added to the expression variant yet, because
that would require immediately implementing all visitors.
The following commits will implement individual visitors
and then subscript will finally be added to expression.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
We add a `peers()` method to `discovery` which returns the peers
discovered until now (including seeds). The caller of functions which
return an output -- `tick` or `request` -- is responsible for persisting
`peers()` before returning the output of `tick`/`request` (e.g. before
sending the response produced by `request` back). The user of
`discovery` is also responsible for restoring previously persisted peers
when constructing `discovery` again after a restart (e.g. if we
previously crashed in the middle of the algorithm).
The `persistent_discovery` class is a wrapper around `discovery` which
does exactly that.
For storage we use a simple local table.
A simple bugfix is also included in the first patch.
* kbr/discovery-persist-v3:
service: raft: raft_group0: persist discovered peers and restore on restart
db: system_keyspace: introduce discovery table
service: raft: discovery: rename `get_output` to `tick`
service: raft: discovery: stop returning peer_list from `request` after becoming leader
cached_page::on_evicted() is invoked in the LSA allocator context, set in the
reclaimer callback installed by the cache_tracker. However,
cached_pages are allocated in the standard allocator context (note:
page content is allocated inside LSA via lsa_buffer). The LSA region
will happily deallocate these, thinking that they these are large
objects which were delegated to the standard allocator. But the
_non_lsa_memory_in_use metric will underflow. When it underflows
enough, shard_segment_pool.total_memory() will become 0 and memory
reclamation will stop doing anything, leading to apparent OOM.
The fix is to switch to the standard allocator context inside
cached_page::on_evicted(). evict_range() was also given the same
treatment as a precaution, it currently is only invoked in the
standard allocator context.
The series also adds two safety checks to LSA to catch such problems earlier.
Fixes#10056
\cc @slivne @bhalevy
Closes#10130
* github.com:scylladb/scylla:
lsa: Abort when trying to free a standard allocator object not allocated through the region
lsa: Abort when _non_lsa_memory_in_use goes negative
tests: utils: cached_file: Validate occupancy after eviction
test: sstable_partition_index_cache_test: Fix alloc-dealloc mismatch
utils: cached_file: Fix alloc-dealloc mismatch during eviction
Just like scylla-sstable, have a separate --help content for reach
action. The existing description is shortened and is demoted to summary:
this now only appears in the listing in the main description.
"
Problem statement
=================
Today, compaction can act much more aggressive than it really has to, because
the strategy and its definition of backlog are completely decoupled.
The backlog definition for size-tiered, which is inherited by all
strategies (e.g.: LCS L0, TWCS' windows), is built on the assumption that the
world must reach the state of zero amplification. But that's unrealistic and
goes against the intent amplification defined by the compaction strategy.
For example, size tiered is a write oriented strategy which allows for extra
space amplification for compaction to keep up with the high write rate.
It can be seen today, in many deployments, that compaction shares is either
close to 1000, or even stuck at 1000, even though there's nothing to be done,
i.e. the compaction strategy is completely satisfied.
When there's a single sstable per tier, for example.
This means that whenever a new compaction job kicks in, it will act much more
aggressive because of the high shares, caused by false backlog of the existing
tables. This translates into higher P99 latencies and reduced throughput.
Solution
========
This problem can be fixed, as proposed in the document "Fixing compaction
aggressiveness due to suboptimal definition of zero backlog by controller" [1],
by removing backlog of tiers that don't have to be compacted now, like a tier
that has a single file. That's about coupling the strategy goal with the
backlog definition. So once strategy becomes satisfied, so will the controller.
Low-efficiency compaction, like compacting 2 files only or cross-tier, only
happens when system is under little load and can proceed at a slower pace.
Once efficient jobs show up, ongoing compactions, even if inefficient, will get
more shares (as efficient jobs add to the backlog) so compaction won't fall
behind.
With this approach, throughput and latency is improved as cpu time is no longer
stolen (unnecessarily) from the foreground requests.
[1]: https://docs.google.com/document/d/1EQnXXGWg6z7VAwI4u8AaUX1vFduClaf6WOMt2wem5oQ
Results
=======
Test sequentially populates 3 tables and then run a mixed workload on them,
where disk:memory ratio (usage) reaches ~30:1 at the peak.
Please find graphs here:
https://user-images.githubusercontent.com/1409139/153687219-32368a35-ac63-461b-a362-64dbe8449a00.png
1) Patched version started at ~01:30
2) On population phase, throughput increase and lower P99 write latency can be
clearly observed.
3) On mixed phase, throughput increase and lower P99 write and read latency can
also be clearly observed.
4) Compaction CPU time sometimes reach ~100% because of the delay between each
loader.
5) On unpatched version, it can be seen that backlog keeps growing even when
though strategies become satisfied, so compaction is using much more CPU time
in comparison. Patched version correctly clears the backlog.
Can also be found at:
github.com/raphaelsc/scylla.git compaction-controller-v5
tests: UNIT(dev, debug).
"
* 'compaction-controller-v5' of https://github.com/raphaelsc/scylla:
tests: Add compaction controller test
test/lib/sstable_utils: Set bytes_on_disk for fake SSTables
compaction/size_tiered_backlog_tracker.hh: Use unsigned type for inflight component
compaction: Redefine compaction backlog to tame compaction aggressiveness
compaction_backlog_tracker: Batch changes through a new replacement interface
table: Disable backlog tracker when stopping table
compaction_backlog_tracker: make disable() public
compaction_backlog_tracker: Clear tracker state when disabled
compaction: Add normalized backlog metric
compaction: make size_tiered_compaction_strategy static
tri_compare_opt can avoid casting bool to int for spaceshipping
int - int <=> 0 looks nicer and shorter as int <=> int
data_type::compare from serialized_tri_compare already returns strong_ordering
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220224125556.13138-1-xemul@scylladb.com>
This patch adds metrics to the Alternator TTL feature (aka the "expiration
service").
I put these metrics deliberately in their own object in ttl.{hh,cc}, and
also with their own prefix ("expiration_*") - and *not* together with the
rest of the Alternator metrics (alternator/stats.{hh,cc}). This is
because later we may want to use the expiration service not only in
Alternator but also in CQL - to support per-item expiration with CDC
events also in CQL. So the implementation of this feature should not be
too tangled with that of Alternator.
The patch currently adds four metrics, and opens the path to easily add
more in the future. The metrics added now are:
1. scylla_expiration_scan_passes: The number of scan passes over the
entire table. We expect this to grow by 1 every
alternator_ttl_period_in_seconds seconds.
2. scylla_expiration_scan_table: The number of table scans. In each scan
pass, we scan all the tables that have the Alternator TTL feature
enabled. Each scan of each table is counted by this counter.
3. scylla_expiration_items_deleted: Counts the number of items that
the expiration service expired (deleted). Please remember that
each item is considered for expiration - and then expired - on
only one node, so each expired item is counted only once - not
RF times.
4. scylla_expiration_secondary_ranges_scanned: If this counter is
incremented, it means this node took over some other node's
expiration scanning duties while the other node was down.
This patch also includes a couple of unrelated comment fixes.
I tested the new metrics manually - they aren't yet tested by the
Alternator test suite because I couldn't make up my mind if such
tests would belong in test_ttl.py or test_metrics.py :-)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220224092419.1132655-1-nyh@scylladb.com>
The flush of hints and batchlog are needed only for the table with
tombstone_gc_mode set to repair mode. We should skip the flush if the
tombstone_gc_mode is not repair mode.
Fixes#10004Closes#10124
This patch adds a reproducing test for issue #10081. That issue is about
a conditional (LWT) UPDATE operation that chose a non-existent row via WHERE,
and its condition refers to both static and regular columns: In that case,
the code incorrectly assumes that because it didn't read any row, all columns
are null - and forgets that the static column is *not* null.
The test, test_lwt.py::test_lwt_missing_row_with_static
passes on Cassandra but fails on Scylla, so is marked xfail.
Refs #10081
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220215215243.660087-1-nyh@scylladb.com>
The background scan for expired Alternator items (the TTL feature)
should bypass the cache to avoid poluting it with the entire content
of the table being scanned.
I tested that the flag added in this patch really works by adding a printout
to the code in table.cc which creates the reader. Although we do have a
metric for uses of BYPASS CACHE, unfortunately this metric counts usage
of "BYPASS CACHE" in CQL statements - and not does not account the low-
level calls that we use in the ttl scanner.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The document docs/alternator/compatibility.md suggested that Alternator
does not support the TTL feature at all. The real situation is more
optimistic - this feature is supported, but as experimental feature.
So let's update compatibility.md with the real status of this feature.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Before this patch, the experimental TTL (expiration time) feature in
Alternator scans tables for expiration in a tight loop - starting the
next scan one second after the previous one completed.
In this patch we introduce a new configuration option,
alternator_ttl_period_in_seconds, which determines how frequently
to start the scan. The default is 24 hours - meaning that the next
scan is started 24 hours after the previous one started.
The tests (test/alternator/run) change this configuration back to one
second, so that expiration tests finish as quickly as possible.
Please note that the scan is *not* slowed down to fill this 24 hours -
if it finishes in one hour, it will then sleep for 23 hours. Additional
work would be needed to slow down the scan to not finish too quickly.
One idea not yet implemented is to move the expiration service from
the "maintenance" scheduling group which it uses today to a new
scheduling group, and modifying the number of shares that this group
gets.
Another thing worth noting about the configurable period (which defaults
to 24 hours) is that when TTL is enabled on an Alternator table, it can
take that amount of time until its scan starts and items start expiring
from it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
allocated through the region
It indicates alloc-dealloc mismatch, and can cause other problems in
the systems like unable to reclaim memory. We want to catch this at
the deallocation site to be able to quickly indentify the offender.
Misbehavior of this sort can cause fake OOMs due to underflow of
_non_lsa_memory_in_use. When it underflows enough,
shard_segment_pool.total_memory() will become 0 and memory reclamation
will stop doing anything.
Refs #10056
There's no automated test for controller, it's time to have one.
Let's start with a basic one that verifies the assumption that
perfectly compacted tiers should produce 0 backlog.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Not precise, as bytes_on_disk accounts for all components, but good enough
for testing purposes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Today, compaction can act much more aggressive than it really has to, because
the strategy and its definition of backlog are completely decoupled.
The backlog definition for size-tiered, which is inherited by all
strategies (e.g.: LCS L0, TWCS' windows), is built on the assumption that the
world must reach the state of zero amplification. But that's unrealistic and
goes against the intent amplification defined by the compaction strategy.
For example, size tiered is a write oriented strategy which allows for extra
space amplification for compaction to keep up with the high write rate.
It can be seen today, in many deployments, that compaction shares is either
close to 1000, or even stuck at 1000, even though there's nothing to be done,
i.e. the compaction strategy is completely satisfied.
When there's a single sstable per tier, for example.
This means that whenever a new compaction job kicks in, it will act much more
aggressive because of the high shares, caused by false backlog of the existing
tables. This translates into higher P99 latencies and reduced throughput.
Solution
========
This problem can be fixed, as proposed in the document "Fixing compaction
aggressiveness due to suboptimal definition of zero backlog by controller" [1],
by removing backlog of tiers that don't have to be compacted now, like a tier
that has a single file. That's about coupling the strategy goal with the
backlog definition. So once strategy becomes satisfied, so will the controller.
Low-efficiency compaction, like compacting 2 files only or cross-tier, only
happens when system is under little load and can proceed at a slower pace.
Once efficient jobs show up, ongoing compactions, even if inefficient, will get
more shares (as efficient jobs add to the backlog) so compaction won't fall
behind.
With this approach, throughput and latency is improved as cpu time is no longer
stolen (unnecessarily) from the foreground requests.
[1]: https://docs.google.com/document/d/1EQnXXGWg6z7VAwI4u8AaUX1vFduClaf6WOMt2wem5oQFixes#4588.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This new interface allows table to communicate multiple changes in the
SSTable set with a single call, which is useful on compaction completion
for example.
With this new interface, the size tiered backlog tracker will be able to
know when compaction completed, which will allow it to recompute tiers
and their backlog contribution, if any. Without it, tiered tracker
would have to recompute tiers for every change, which would be terribly
expensive.
The old remove/add interface are being removed in favor of the new one.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The primitive_consumer method templates overcomplicate the
declaration of the fact that one of the method arguments is
the sub-type of a template argument
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are both in the filtering_reader template, leave only
the concept and convert it into one-line invocable check
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Backlog tracker is managed by compaction strategy, and we'd like to
have it disabled in table::stop(), to make sure that all state is
cleared. For example, a reference to a shared sstable, in the
tracker implementation, could prevent the sstable manager from being
stopped as it relies on all sstables managed by it being closed
first. By calling tracker's disable() method, table::stop() will
guarantee that state is cleared by completion.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If the tracker is disabled, we never get to access the underlying
implementation anymore. It makes sense to clear _impl on
disable(). So table::stop() can call its backlog tracker's disable
method, clearing all its state. This is important for clean
shutdown, as any sstable in tracker state may cause sstable
manager to hang when being stopped.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Normalized backlog metric is important for understanding the controller
behavior as the controller acts on normalized backlog for yielding an
output, not the raw backlog value in bytes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
on_evicted() is invoked in the LSA allocator context, set in the
reclaimer callback instaled by the cache_tracker. However,
cached_pages are allocated in the standard allocator context (note:
page content is allocated inside LSA via lsa_buffer). The LSA region
will happilly deallocate these, thinking that they these are large
objects which were delegated to the standard allocator. But the
_non_lsa_memory_in_use metric will underflow. When it underflows
enough, shard_segment_pool.total_memory() will become 0 and memory
reclamation will stop doing anything, leading to apparent OOM.
The fix is to switch to the standard allocator context inside
cached_page::on_evicted(). evict_range() was also given the same
treatment as a precaution, it currently is only invoked in the
standard allocator context.
Fixes#10056
More and more places are using the repair[uuid]: format for logging
repair jobs with the uuid. Convert more places to use the new format to
unify the log format.
This makes it easier to grep a specific repair job in the log.
Closes#10125
Memtables are a replica-side entity, and so are moved to the
replica module and namespace.
Memtables are also used outside the replica, in two places:
- in some virtual tables; this is also in some way inside the replica,
(virtual readers are installed at the replica level, not the
cooordinator), so I don't consider it a layering violation
- in many sstable unit tests, as a convenient way to create sstables
with known input. This is a layering violation.
We could make memtables their own module, but I think this is wrong.
Memtables are deeply tied into replica memory management, and trying
to make them a low-level primitive (at a lower level than sstables) will
be difficult. Not least because memtables use sstables. Instead, we
should have a memtable-like thing that doesn't support merging and
doesn't have all other funky memtable stuff, and instead replace
the uses of memtables in sstable tests with some kind of
make_flat_mutation_reader_from_unsorted_mutations() that does
the sorting that is the reason for the use of memtables in tests (and
live with the layering violation meanwhile).
Test: unit (dev)
Closes#10120
This PR propagates the read coordinator logic so that read timeout and read failure exceptions are propagated without throwing on the coordinator side.
This PR is only concerned with exceptions which were originally thrown by the coordinator (in read resolvers). Exceptions propagated through RPC and RPC timeouts will still throw, although those exceptions will be caught and converted into exceptions-as-values by read resolvers.
This is a continuation of work started in #10014.
Results of `perf_simple_query --smp 1 --operations-per-shard 1000000` (read workload), compared with merge base (10880fb0a7):
```
BEFORE:
125085.13 tps ( 80.2 allocs/op, 12.2 tasks/op, 49010 insns/op)
125645.88 tps ( 80.2 allocs/op, 12.2 tasks/op, 49008 insns/op)
126148.85 tps ( 80.2 allocs/op, 12.2 tasks/op, 49005 insns/op)
126044.40 tps ( 80.2 allocs/op, 12.2 tasks/op, 49005 insns/op)
125799.75 tps ( 80.2 allocs/op, 12.2 tasks/op, 49003 insns/op)
AFTER:
127557.21 tps ( 80.2 allocs/op, 12.2 tasks/op, 49197 insns/op)
127835.98 tps ( 80.2 allocs/op, 12.2 tasks/op, 49198 insns/op)
127749.81 tps ( 80.2 allocs/op, 12.2 tasks/op, 49202 insns/op)
128941.17 tps ( 80.2 allocs/op, 12.2 tasks/op, 49192 insns/op)
129276.15 tps ( 80.2 allocs/op, 12.2 tasks/op, 49182 insns/op)
```
The PR does not introduce additional allocations on the read happy-path. The number of instructions used grows by about 200 insns/op. The increase in TPS is probably just a measurement error.
Closes#10092
* github.com:scylladb/scylla:
indexed_table_select_statement: return some exceptions as exception messages
result_combinators: add result_wrap_unpack
select_statement: return exceptions as errors in execute_without_checking_exception_message
select_statement: return exceptions without throwing in do_execute
select_statement: implement execute_without_checking_exception_message
select_statement: introduce helpers for working with failed results
query_pager: resultify relevant methods
storage_proxy: resultify (do_)query
storage_proxy: resultify query_singular
storage_proxy: propagate failed results through query_partition_key_range
storage_proxy: resultify query_partition_key_range_concurrent
storage_proxy: modify handle_read_error to also handle exception containers
abstract_read_executor: return result from execute()
abstract_read_executor: return and handle result from has_cl()
storage_proxy: resultify handling errors from read-repair
abstract_read_executor::reconcile: resultise handling of data_resolver->done()
abstract_read_executor::execute: resultify handling of data_resolver->done()
result_combinators: add result_discard_value
abstract_read_executor: resultify _result_promise
abstract_read_executor: return result from done()
abstract_read_resolver: fail promises by passing exception as value
abstract_read_resolver: resultify promises
exceptions: make it possible to return read_{timeout,failure}_exception as value
result_try: add as_inner/clone_inner to handle types
result_try: relax ConvertWithTo constraint
exception_container: switch impl to std::shared_ptr and make copyable
result_loop: add result_repeat
result_loop: add result_do_until
result_loop: add result_map_reduce
utils/result: add utilities for checking/creating rebindable results
Refs #10087
Add validation of all params for the keyspace_scrub api.
The validation method is generic and should be used by all apis eventually,
but I'm leaving that as follow-up work.
While at it, fixed the exception types thrown on invalid `scrub_mode` or `quarantine_mode` values from `std::runtime_error` to `httpd::bad_param_exception` so to generate the `bad_request` http status.
And added unit tests to verify that, and the handling of an unknown parameter.
Test: unit(dev)
DTest: nodetool_additional_test.py::TestNodetool::{test_scrub_with_one_node_expect_data_loss,test_scrub_with_multi_nodes_expect_data_rebuild,test_scrub_sstable_with_invalid_fragment,test_scrub_ks_sstable_with_invalid_fragment,test_scrub_segregate_sstable_with_invalid_fragment,test_scrub_segregate_ks_sstable_with_invalid_fragment}
Closes#10090
* github.com:scylladb/scylla:
api: storage_service: scrub: validate parameters
api: storage_service: refactor parse_tables
api: storage_service: refactor validate_keyspace
test: rest_api: add test_storage_service_keyspace_scrub tests
api: storage_service: scrub: throw httpd::bad_param_exception for invalid param values
In CQL table names must be composed only of letters, digits, or underscores,
but some Cassandra documentation is unclear whether these "letters" refer only
to the Latin alphabet, or maybe UTF-8 names composed of letters in other
alphabets should be allowed too.
This patch adds a test that confirms that both Scylla and Cassandra only
accept the Latin alphabet in table names, and for example UTF-8 names
with French or Hebrew letters are rejected.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220222134220.972413-1-nyh@scylladb.com>
Add support for specifing integers in scientific format (for example 1.234e8) in INSERT JSON statement:
```
INSERT INTO table JSON '{"int_column": 1e7}';
```
Before the JSON parsing library was switched to RapidJSON from JsonCpp, this statement used to work correctly, because JsonCpp transparently casts double to integer value.
Inserting a floating-point number ending with .0 is allowed, as the fractional part is zero. Non-zero fractional part (for example 12.34) is disallowed. A new test is added to test all those behaviors.
This behavior differs from Cassandra, which disallows those types of numbers (1e7, 123.0 and 12.34), however some users rely on that behavior and JSON specification itself does not distinct between floating-point numbers and integer numbers (only a single "number" type is defined).
This PR also fixes two minor issues I noticed while looking at the code: wrong blob validation and missing `IsString()` checks that could result in assertion error.
Fixes#10100Fixes#10114Fixes#10115Closes#10101
* github.com:scylladb/scylla:
type_json: support integers in scientific format
type_json: add missing IsString() checks
type_json: fix wrong blob JSON validation
"
The table lists connected clients. For this the clients are
stored in real table when they connect, update their statuses
when needed and remove^w tombstone themselves when they
disconnect. On start the whole table is cleared.
This looks weird. Here's another approach (inspired by the
hackathon project) that makes this table a pure virtual one.
The schema is preserved so is the data returned.
The benefits of doing it virtual are
- no on-disk updates while processing clients
- no potentially failing updates on non-failing disconnect
- less usage of the global qctx thing
- less calls to global storage_proxy
- simpler support for thrift and alternator clients (today's
table implementation doesn't track them)
- the need to make virtual tables reg/unreg dynamic
branch: https://github.com/xemul/scylla/tree/br-clients-virtual-table-4
tests: manual(dev), unit(dev)
The manual test used 80-shards node and 1M connections from
1k different IP addresses.
"
* 'br-clients-virtual-table-4' of https://github.com/xemul/scylla:
test: Add cql-pytest sanity test for system.clients table
client_data: Sanitize connection_notifier
transport: Indentation fix after previous patch
code: Remove old on-disk version of system.clients table
system_keyspace: Add clients_v virtual table
protocol_server: Add get_client_data call
transport: Track client state for real
transport: Add stringifiers to client_data class
generic_server: Gentle iterator
generic_server: Type alias
docs: Add system.clients description
Adjusts the indexed_table_select_statement so that it uses the
result-aware methods in storage_proxy and propagates failed results as
result_message::exception.
Adds a helper combinator utils::result_wrap_unpack which, in contrast to
utils::result_wrap, uses futurize_apply instead of futurize_invoke to
call the wrapped callable.
In short, if utils::result_wrap is used to adapt code like this:
f.then([] {})
->
f_result.then(utils::result_wrap([] {}))
Then utils::result_wrap_unpack works for the following case:
f.then_unpack([] (arg1, arg2) {})
->
f_result.then(utils::result_wrap_unpack([] (arg1, arg2) {}))
Modifies the remaining logic of execute_without... (apart from the
do_execute call) so that the result-aware versions of storage_proxy's
methods are called and failed results are converted to
result_message::exception.
The select_statement will be able to propagate coordinator failures
without throwing, so it's important to override the default
implementations of execute and excecute_without... so that the first
calls the latter and not the other way around.
Adds:
- Includes for result-related helper methods (to be used in later
commits),
- Alias for coordinator_result,
- The wrap_result_to_error_message function - a bit similar to
utils::result_wrap. Adapts a callable T -> shared_ptr<result_message>
to take result<T> -> shared_ptr<result_message>. If the result is
failed, it converts it into result_message::exception and returns.
Adjusts do_query so that it propagates and returns failed results. The
query_result method is added which is result-aware, and the old query
method was changed to call query_result.
Now, the logic of handling exceptions returned in reconcile() from
data_resolver->done() was changed so that the failed result does not
need to be converted to an exceptional future.
Now, the logic of handling exceptions returned in execute() from
data_resolver->done() was changed so that the failed result does not
need to be converted to an exceptional future.
Adds a utils::result_discard_value, which is an alternative to
future::discard_result which just ignores the "success" value of the
provided result and does not ignore the exception.
Adds read_timeout_exception and read_failure exception to the list of
exceptions supported by the coordinator_exception_container.
Those exceptions are not yet returned-as-value anywhere, but they will
be in the commits that follow.
Adds two methods to result_try's exception handles:
- as_inner: returns a {l,r}-value reference either to the exception
container, or the exception_ptr. This allows to use them in operations
which work on both types, e.g. logging.
- clone_inner: returns a copy of the underlying exception container or
exception ptr.
Currently, the catch handlers in result_futurize_try are required to
return a future, although they are always being called with
seastar::futurize_invoke, so if their result is not future it could be
converted to one anyway. This commit relaxes the ConvertsWithTo
constraint in order to allow this conversion.
The exception_container is supposed to be a cheaper, but possibly harder
to use alternative to std::exception_ptr. Before this commit, the
exception was kept behind foreign_ptr<std::unique_ptr<>> so that moving
the container is very cheap. However, the original std::exception_ptr
supports copying in a thread-safe manner, and it turns out that some of
the read coordinator logic intentionally copies the pointer in order to
be able to fail two different promises with the same exception.
The pointer type is changed to std::shared_ptr. Although it uses atomics
for reference counting, this is also probably what std::exception_ptr
does, so the performance should not be worse. The exception stored
inside the container is immutable, so this allows for a non-throwing
implementation of copying.
To encourage moves instead of copying, the copy constructor is deleted
and instead the `clone()` method should be used if it is really
necessary.
Adds a result-aware counterpart to seastar::repeat. The new function
does not base on seastar::repeat, but rather is a rewrite of the
original (using a coroutine instead of an open-coded task). The main
consequence of using a coroutine is that exceptions from AsyncAction
need to be thrown once more.
Adds a result-aware counterpart to seastar::do_until. The new function
does not base on seastar::do_until, but rather is a rewrite of the
original (using a coroutine instead of an open-coded task). The main
consequence of using a coroutine is that exceptions from StopCondition
or AsyncAction need to be thrown once more.
Adds result-aware counterparts to all seastar::map_reduce overloads.
Fortunately, it was possible to implement the functions by basing them
on seastar::map_reduce and get the same number of allocation. The only
exception happens when reducer::get() returns a non-ready future, which
doesn't seem to happen on the read coordinator path.
Adds:
- ResultRebindableTo<L, R>: concept which is satisfied by a pair of
results which do not necessarily share the same value, but have the
same error and policy types; a failed result L can be converted to a
failed result R.
- rebind_result<T, R>: given a value type T and another result R,
returns a result which can hold T as value and both the same error and
policy as R.
Add support for specifing integers in scientific format (for example
1.234e8) in INSERT JSON statement:
INSERT INTO table JSON '{"int_column": 1e7}';
Inserting a floating-point number ending with .0 is allowed, as
the fractional part is zero. Non-zero fractional part (for example
12.34) is disallowed. A new test is added to test all those behaviors.
Before the JSON parsing library was switched to RapidJSON from JsonCpp,
this statement used to work correctly, because JsonCpp transparently
casts double to integer value.
This behavior differs from Cassandra, which disallows those types of
numbers (1e7, 123.0 and 12.34).
Fix typo in if condition: "if (value.GetUint64())" to
"if (value.IsUint64())".
Fixes#10100
timestamped_val (and two other type aliases) are nested inside loading_cache,
but indented as if they were top-level names. Adjust the indent to
avoid confusion.
Closes#10118
Use exponential_backoff_retry::retry(abort_source&)
when sleeping between retries and request abort
when the task is stopped.
Fixes#10112
Test: unit(dev)
Closes#10113
* github.com:scylladb/scylla:
compaction_manager: allow stopping sleeping tasks
compaction_manager: task: add make_compaction_stopped_exception
compaction_manager: task: refactor stop
All entries from a single partition can be found in a
single summary page.
Because of that, in cases when we know we want to read
only one partition, we can limit the underyling file
input_stream to the range of the page.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Currently, when advancing one of index_reader's bounds,
we're creating a new index_consume_entry_context with a new
underlying file input_stream for each new page.
For either bound, the streams can be reused, because
the indexes of pages that we are reading are never
decreasing.
This patch adds a index_consume_entry_context to each of
index_reader's bounds, so that for each new page, the same
file input_stream is used.
As a result, when reading consecutive pages, the reads that
follow the first one can be satisfied by the input_stream's
read aheads, decreasing the number of blocking reads and
increasing the throughput of the index_reader.
Fixes#2388
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Use exponential_backoff_retry::retry(abort_source&)
when sleeping between retries and request abort
when the task is stopped.
Fixes#10112
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Provide a function to make a sstables::compaction_stopped_exception
based on the information in the stopped task.
To be reused by the next patch that will
also throw this exception from the retry sleep path,
when the task is stopped.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add missing IsString() checks to parsing date, time, uuid and inet
types by introducing validated_to_string_view function which checks
whether the value is of string type and otherwise throws
marshal_exception.
Without this check, a malformed input to those types would result in
nasty ServerError with RapidJSON assertion instead of marshal_exception
with detail about the problem.
Add new tests checking passing non-string values for those types.
Fixes#10115
Fixes wrong condition for validating whether a JSON string representing
blob value is valid. Previously, strings such as "6" or "0392fa" would
pass the validation, even though they are too short or don't start with
"0x". Add those test cases to json_cql_query_test.cc.
Fixes#10114
This patch adds two tests for two interesting edge cases in the behavior
of static columns in Scylla. We already have a lot of tests for static
columns in other frameworks (C++ unit tests, cql and dtest), but the two
cases here are issues where specifically we weren't sure how Cassandra
behaves in those cases - and this can most easily be checked in the
test/cql-pytest framework.
The first test, test_static_not_selected, is a reproducer for issue #10091.
This issue was reported by a user @aohotnik, who was surprised by the
fact that Scylla returns empty values, instead of nothing, when selecting
regular columns of a non-existent row if the partition has a static
column set. The test demonstrates a difference between Scylla and
Cassandra, so it is marked "xfail" - it passes on Cassandra and fails on
Scylla. If later we decide that both Scylla's and Cassandra's behaviours
are reasonable and both can be considered "correct", we can change this
test to except Scylla's result as well and it will beging to pass.
The second test, test_missing_row_with_static, shows that SELECT of a
non-existent row returns nothing - even if the partition has a static
column. The behavior in this case is identical in Scylla and Cassandra,
so this test passes. This contrasts with the analogous situation in LWT
UPDATE from issue #10081, where the IF condition is expected to see the
static column value.
Refs #10081
Refs #10091
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220220120418.831540-1-nyh@scylladb.com>
"
Currently the mutation compactor supports v1 and v2 output and has a v1
output. The next step is to add a v2 output but this would lead to a
full conversion matrix which we want to avoid. So in preparation we drop
the v1 input support. Most inputs were already v2, but there were some
notable exceptions: tests, the compacting reader and the multishard
query code. The former two was a simple mechanical update but the latter
required some further work because it turned out the v2 version of
evictable reader wasn't used yet and thus it managed to hide some bugs
and dropped features. While at it, we migrate all evictable and
multishard reader users to the v2 variant of the respective readers and
drop the v1 variant completely.
With this the road is open to a v2 compactor output and therefore to a
v2 sstable writer.
Tests: unit(dev, release), dtest(paging_additional_test.py)
"
* 'compact-mutation-v2-only-input/v5' of https://github.com/denesb/scylla:
test/lib/test_utils: return OK from check() variants
repair/row_level: use evictable reader v2
db/view/view_updating_consumer: migrate to v2
test/boost/mutation_reader_test: add v2 specific evictable reader tests
test: migrate to evictable reader v2 and multishard combining reader v2
compact_mutation: drop support for v1 input
test: pass v2 input to mutation_compaction
test/boost/mutation_test: simplify test_compaction_data_stream_split test
mutation_partition: do_compact(): do drop row tombstones covered by higher order tombstones
multishard_mutation_query: migrate to v2
mutation_fragment_v2: range_tombstone_change: add memory_usage()
evictable_reader_v2: terminate active range tombstones on reader recreation
evictable_reader_v2: restore handling of non-monotonically increasing positions
evictable_reader_v2: simplify handling of reader recreation
mutation: counter_write_query: use v2 reader
mutation: migrate consume() to v2
mutation_fragment_v2,flat_mutation_reader_v2: mirror v1 concept organization
mutation_reader: compacting_reader: require a v2 input reader
db/view/view_builder: use v2 reader
test/lib/flat_mutation_reader_assertions: adjust has_monotonic_positions() to v2 spec
The various require() and check() methods in test_utils.hh were
introduced to replace BOOST_REQUIRE() and BOOST_CHECK() respectively in
multi-shard concurrent tests, specifically those in
tests/boost/multishard_mutation_query_test.cc.
This was done literally, just replacing BOOST_REQUIRE() with require()
and BOOST_CHECK() with check(). The problem is that check() is missing a
feature BOOST_CHECK() had: while BOOST_CHECK() doesn't cause an
immediate test failure, just logging an error if the condition fails, it
remembers this failure and will fail the test in the end. check() did
not have this feature and this caused potential errors to just be logged
while the test could still pass fine, causing false-positive tests
passes. This patch fixes this by returning a [[nodiscard]] bool from the
check() methods. The caller can & these together over all calls to
check() methods and manually fail the test in the end. We choose this
method over a hidden global (like BOOST_CHECK() does) for simplicity
sake.
Not a completely mechanical transition. The consumer has to generate its
mutation via a mutation_rebuilder_v2 as mutation fragment v2 cannot be
applied to mutations directly yet.
One is a reincarnation of the recently removed
test_multishard_combining_reader_non_strictly_monotonic_positions. The
latter was actually targeting the evictable reader but through the
multishard reader, probably for historic reasons (evictable reader was
part of the multishard reader family).
The other one checks that active range tombstones changes are properly
terminated when the partition ends abruptly after recreating the reader.
This test has very elaborate infrastructure essentially duplicating
mutation, mutation::apply() and mutation::operator==. Drop all this
extra code and use mutations directly instead. This makes migrating the
test to v2 easier.
The comment on the public methods calling said method promises to do so
but doesn't actually follows through. This patch fixes this for row
tombstones, to mirror the behaviour of the mutation compactor. This is
especially important for tests that compare mutations compacted with
different methods.
Mostly mechanical transformation. The main difference is in the detached
compaction state, from which we now get the range tombstone change,
instead of the range tombstone list. The code around this is a bit
awkward, will become simpler when compactor drops v1 support.
Reader recreation messes with the continuity of the mutation fragment
stream because it breaks snapshot isolation. We cannot guarantee that a
range tombstone or even the partition started before will continue after
too. So we have to make sure to wrap up all loose threads when
recreating the reader. We already close uncontinued partitions. This
commit also takes care of closing any range tombstone started by
unconditionally emitting a null range tombstone. This is legal to do,
even if no range tombstone was in effect.
We thought that unlike v1, v2 will not need this. But it does.
Handled similarly to how v1 did it: we ensure each buffer represents
forward progress, when the last fragment in the buffer is a range
tombstone change:
* Ensure the content of the buffer represents progress w.r.t.
_next_position_in_partition, thus ensuring the next time we recreate
the reader it will continue from a later position.
* Continue reading until the next (peeked) fragment has a strictly
larger position.
The code is just much nicer because it uses coroutines.
The evictable reader has a handful of flags dictating what to do after
the reader is recreated: what to validate, what to drop, etc. We
actually need a single flag telling us if the reader was recreated or
not, all other things can be derived from existing fields.
This patch does exactly that. Furthermore it folds do_fill_buffer() into
fill_buffer() and replaces the awkward to use `should_drop_fragment()`
with `examine_first_fragments()`, which does a much better job of
encapsulating all validation and fragment dropping logic.
This code reorganization also fixes two bugs introduced by the v2
conversion:
* The loop in `do_fill_buffer()` could become infinite in certain
circumstances due to a difference between the v1 and v2 versions of
`is_end_of_stream()`.
* The position of the first non-dropped fragment is was not validated
(this was integrated into the range tombstone trimming which was
thrown out by the conversion).
The underlying mutation format is still v1, so consume() ends up doing
an online conversion. This allows converting all downstream code to v2,
leaving the conversion close to the code that is yet to be migrated to
v2 native: the mutation itself.
Currently all concepts are in mutation_fragment_v2.hh and
flat_mutation_reader_v2.hh. Organize concepts similar to how the v1 ones
are: move high-level consume concepts into
mutation_consumer_concepts.hh.
Before we add a v2 output option to the compactor, we want to get rid of
all the v1 inputs to make it simpler. This means that for a while the
compacting reader will be in a strange place of having a v2 input and a
v1 output. Hopefully, not for long.
The v2 spec allows for non-strictly monotonically increasing positions,
but has_monotonic_positions() tried to enforce it. Relax the check so it
conforms to the spec.
directory_lister provides a simpler interface compared to lister.
After creating the directory_lister,
its async get() method should be called repeatedly,
returning a std::optional<directory_entry> each call,
until it returns a disengaged entry or an error.
This is especially suitable for coroutines
as demonstrated in the unit tests that were added.
For example:
```c++
auto dl = directory_lister(path);
while (auto de = co_await dl.get()) {
co_await process(*de);
}
```
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#9835
* github.com:scylladb/scylla:
sstable_directory: process_sstable_dir: use directory_lister
sstable_directory: process_sstable_dir: fixup indentation
sstable_directory: coroutinize process_sstable_dir
lister: add directory_lister
The regression test we have for Alternator's issue #9487 (where a reverse
query without a Limit given was broken into 100MB pages instead of the
expected 1MB) is test_query.py::test_query_reverse_long. But this is a
very long test requiring a 100MB partition, and because of its slowness
isn't run by default.
This patch adds another version of that test, test_query_reverse_longish,
which reproduces the same issue #9487 with a partition 50 times shorter
(2MB) so it only takes a fraction of a second and can be enabled by
default. It also requires much less network traffic which is important
when running these tests non-locally.
We leave the original test test_query_reverse_long behind, it can be
still useful to stress Scylla even beyond the 100MB boundary, but it
remains in @veryslow mode so won't run in default test runs.
Refs #9487
Refs #7586
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220220161905.852994-1-nyh@scylladb.com>
A frozen schema can be quite large (in #10071 we measured 500 bytes per
column, and there can be thousands of columns in extreme tables). This
can cause large contiguous allocations and therefor memory stalls or
even failures to allocate.
Switch to bytes_ostream as the internal representation. Fortunately
frozen_schema is internally implemented as bytes_ostream, so the
change is minimal.
Ref #10071.
Test: unit (dev)
Closes#10105
this patch adds two gauges:
scylla_gossip_live - how many live nodes the gossiper sees
scylla_gossip_unreachable - how many nodes the gossiper tries to connect
to but cannot.
Both metrics are reported once per node (i.e., per node, not per shard) it
gives visibility to how a specific node sees the cluster.
For example, a split-brain 6 nodes cluster (3 and 3). Each node would
report that it sees 2 nodes, but the monitoring system would see that
there are, in fact, 6 nodes.
Example of two nodes cluster, both running:
``
scylla_gossip_live{shard="0"} 1.000000
scylla_gossip_unreachable{shard="0"} 0.000000
``
Example of two nodes cluster, one is down:
``
scylla_gossip_live{shard="0"} 0.000000
scylla_gossip_unreachable{shard="0"} 1.000000
``
Fixes#10102
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Closes#10103
[avi: remove whitespace change and correct spelling]
The _file_name and _index_file fields in index_consume_entry_context
are no longer used anywhere in the class (_file_name isn't even set,
and _index_file was previously used when creating a promoted_index,
which doesn't store the file object anymore)
The JSON standard specifies numbers without making a distinction of what
is "an integer" and what is "floating point". The value 1e6 is a valid
number, and although it is customary in C that 1e6 is a floating-point
constant, as a JSON constant there is nothing inherently "non-integer" about
it - it is a whole number. This is why I believe CQL commands such as
CREATE TABLE t(pk int PRIMARY KEY, v int);
INSERT INTO t JSON '{"pk": 1, "v": 1e6}';
should be allowed, as 1e6 is a whole number and fits in the range of
Scylla's int.
The included tests show that, unfortunately, 1e6 is *not* currently
allowed to be assigned to an integer. The test currently fail on both
Scylla and Cassandra - and we believe this failure to be a bug in both,
so the test is marked with xfail (known to fail) and cassandra-bug
(known failure on Cassandra considered to be a bug).
Refs #10100
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220220141602.843783-1-nyh@scylladb.com>
There is a bug in `expr::visit`. When trying to return a reference from a visitor it actually returns a reference to some temporary location.
So trying to do something like:
```c++
const expression e = new_bind_variable(123);
const bind_variable& ref = visit(overloaded_functor {
[](const bind_variable& bv) -> const bind_variable& { return bv; },
[](const auto&) -> const bind_variable& { throw std::runtime_error("Unreachable"); }
}, e);
std::cout << ref << std::endl;
```
Would actually print a random stack location instead of the value inside of `e`.
Additionally trying to return a non-const reference doesn't compile.
Current implementation of `expr::visit` is:
```c++
auto visit(invocable_on_expression auto&& visitor, const expression& e) {
return std::visit(visitor, e._v->v);
}
```
For reference, `std::visit` looks like this:
```c++
template<typename _Res, typename _Visitor, typename... _Variants>
constexpr _Res
visit(_Visitor&& __visitor, _Variants&&... __variants)
{
return std::__do_visit<_Res>(std::forward<_Visitor>(__visitor),
std::forward<_Variants>(__variants)...);
}
```
The problem is that `auto` can evaluate to `int` or `float`, but not to `int&`.
It has now been changed to `decltype(auto)`, which is able to express references.
I also added a missing `std::forward` on the visitor argument.
The new version looks like this:
```c++
template <invocable_on_expression Visitor>
decltype(auto) visit(Visitor&& visitor, const expression& e) {
return std::visit(std::forward<Visitor>(visitor), e._v->v);
}
```
I added some tests of `expr::visit` in `boost/expr_test`, but sadly they are not as throughout as they could be, Ideally I could return a refernce from `std::visit` and `expr::visit` and then check that they both point to the same address in memory.
I can't do this because it would require to access a private field of `expression`.
Some test pass before the fix, even though they shouldn't, but I'm not sure how to make them better without making field of expression public.
I played around with some code, it can be found here: https://github.com/cvybhu/attached-files/blob/main/visit/visit_playground.cppCloses#10073
* github.com:scylladb/scylla:
cql3: expr: Add a test to show that std::forward is needed in expr::visit
cql3: expr: add std::forward in expr::visit
cql3: expr: Add tests for expr::visit
cql3: expr: Fix expr::visit so that it works with references
Adds a test with a vistior that can only be used as a rvalue.
Without std::forward in expr::visit this test doesn't compile.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
expr::visit was missing std::forward on the visitor.
In cases where the visitor was passed as an rvalue it wouldn't
be properly forwarded to std::visit.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add tests for new expr::visit to ensure that it is working correctly.
expr::visit had a hidden bug where trying to return a reference
actually returned a reference to freed location on the stack,
so now there are tests to ensure that everything works.
Sadly the test `expr_visit_const_ref` also passes
before the fix, but at lest expr_visit_ref doesn't compile before the fix.
It would be better to test this by taking references returned
by std::visit and expr::visit and checking that they point
to the same address in memory, but I can't do this
because I would have to access private field of expression.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Check that SELECT {columns} FROM system.clients returns back only local
connection of cql type (because there are no others during the test).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now the connection_notifier is all gone, only the client_data bits are left.
To keep it consistent -- rename the files.
Also, while at it, brush up the header dependencies and remove the not
really used constexprs for client states.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This includes most of the connection_notifier stuff as well as
the auxiliary code from system_keyspace.cc and a bunch of
updating calls from the client state changing.
Other than less code and less disk updates on clients connection
paths, this removes one usage of the nasty global qctx thing.
Since the system.clients goes away rename the system.clients_v
here too so the table is always present out there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This table mirrors the existing clients one but temporarily
has its own name. The schema is the same as in system.clients.
The table gets client_data's from the registered protocol
servers, which in turn are obtained from the storage service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The call returns a chunked_vector with client_data's. For now
only the native transport implements it, others return empty
vector.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now when the client state changes the respective update is
performed on the system.clients table. While doing it some bits
from this state are lost from the in-memory structures. For the
sake of exporting this information we need to track whether the
connected client goes authenticating or is already ready.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two fields on the client_data that are not mapped to
string with the help of standard fmt library. Add two methods
that turn client state and type into strings.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add the ability to iterate over the list of connections in a "gentle"
manner, i.e. -- preempting the loop when required.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a document that sums up the tables from system keyspace and
its missing the clients table. This set is going to reimplement the
table keeping the schema intact, so it's good time to document it
right at the beginning.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The Alternator CreateTable operation currently performs several schema-
changing operations separately - one by one: It creates a keyspace,
a table in that keyspace and possibly also multiple views, and it sets
tags on the table. A consequence of this is that concurrent CreateTable
and DeleteTable operations (for example) can result in unexpected errors
or inconsistent states - for example CreateTable wants to create the
table in the keyspace it just created, but a concurrent DeleteTable
deleted it. We have two issues about this problem (#6391 and #9868)
and three tests (test_table.py::test_concurrent_create_and_delete_table)
reproducing it.
In this patch we fix these problems by switching to the modern Scylla
schema-changing API: Instead of doing several schema-changing
operations one by one, we create a vector of schema mutation performing
all these operations - and then perform all these mutations together.
When the experimental Raft-based schema modifications is enabled, this
completely solves the races, and the tests begin to pass. However, if
the experimental Raft mode is not enabled, these tests continue to fail
because there is still no locking while applying the different schema
mutations (not even on a single node). So I put a special fixture
"fails_without_raft" on these tests - which means that the tests
xfail if run without raft, and expected to pass when run on Raft.
Indeed, after this patch
test/alternator/run --raft test_table.py::test_concurrent_create_and_delete_table
shows three passing tests (they also pass if we drastically improve the
number of iterations), while
test/alternator/run test_table.py::test_concurrent_create_and_delete_table
shows three xfailing tests.
All other Alternator tests pass as before with this patch, verifying
that the handling of new tables, new views, tags, and CDC log tables,
all happen correctly even after this patch.
A note about the implementation: Before this patch, the CreateTable code
used high-level functions like prepare_new_column_family_announcement().
These high-level functions become unusable if we write multiple schema
operations to one list of mutations, because for example this function
validates that the keyspace had already been created - when it hasn't
and that's the whole point. So instead we had to use lower-level
function like add_table_or_view_to_schema_mutation() and
before_create_column_family(). However, despite being lower level,
these functions were public so I think it's reasonable to use them,
and we probably have no other alternative.
Fixes#6391Fixes#9868
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Replacing the previous text output with the exception of the dump-data
command. The text output was supposed to be human-friendly but it is not
really human friendlier than a well formatted JSON, the latter having
the additional advantage of being machine friendly too. Although the
text output already exists, having just one output format makes the code
much simpler and easier to maintain so we chose not to pay the higher
maintenance price for a format that is not expected to see much (if any)
use.
Although the JSON written by the tool is not formatted, it can easily be
formatted by e.g. piping it through `jq`. The latter also allows lookup
of specific field(s).
The JSON schema of each command is documented in the --help output of
the respective command (e.g. scylla sstable data-dump --help) .
We keep the text output of the dump-data command as this is using
scylla's built-in printer that we also use in logging and tests. Some
people might be used to this format, so leave it in: the code already
exists for it and lives in scylla core, so we don't need to maintain it
separately. The default output-format of dump-data is now JSON.
A smoke test suite is added for the dump commands too. The tests only
check that some output is present and that it is valid JSON.
Refs: #9882
Tests: unit(dev)
Also on: https://github.com/denesb/scylla.git scylla-sstable-json/v2
Changelog
v3:
* Rebase on recent master (which has the required seastar fixes for
debug tests)
v2:
* Document the JSON schema of each command.
* Use the SAX-style API of rapidjson to generate streaming JSON, instead
of hand-generating it.
Closes#10074
* github.com:scylladb/scylla:
test/cql-pytest: add tests for scylla-sstable's dump commands
test/cql-pytest: prepare for tool tests
tools/schema_loader: auto-create the keyspace for all statements
tools/scylla-sstable: change output of dump-scylla-metadata to json
tools/scylla-sstable: change output of dump-statistics to json
tools/scylla-sstable: change output of dump-summary to json
tools/scylla-sstable: change output of dump-compression-info to json
tools/scylla-sstable: change output of dump-index to json
tools/scylla-sstable: add json support in --dump-data
tools/scylla-sstable: add json_writer
tools/scylla-sstable: use fmt::print in --dump-data
tools/scylla-sstable: prepare --dump-data for multiple output formats
expr::visit had a bug where if we wanted to return
a reference in the visitor, the reference would be
to a temporary stack location instead of the passed
argument.
So trying to do something like this:
```
const bind_variable& ref = visit(overloaded_functor {
[](const bind_variable& bv) -> const bind_variable& { return bv; },
[](const auto&) -> const bind_variable& { ... }
}, e);
std::cout << ref << std::endl;
```
Would actually print a random location on stack instead
of valid value inside of e.
Additionally trying to return a non-const reference
doesn't even compile.
The problem was that the return type of expr::visit
was defined as `auto`, which can be `int`, but not `int&`.
This has been changed to `decltype(auto)` which can be both `int` and `int&`
New version of `expr::visit` works for `const expression&` and `expression&`
no matter what the visitor returns.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The tests are smoke-tests: they mostly check that scylla doesn't crash
while dumping and it produces *some* output. When dumping json, the test
checks that it is valid json.
We want to add tool tests. These tests will have to invoke scylla
executable (as tools are hosted by the latter) and they want access to
the scylla data directories. Propagate the scylla path and data
directory used from `run` into the test suite via pytest request
parameters.
Currently the keyspace is only auto-created for create type statements.
However the keyspace is needed even without UDTs being involved: for
example if the table contains a collection type. So auto-create the
keyspace unconditionally before preparing the first statement.
Also add a test-case with a create table statement which requires the
keyspace to be present at prepare time.
Wrapping a rapidjson::Writer<> and mirrors the latter's API, providing
more convenient overloads for the Key() and String() methods, as well as
providing some extra, scylla-sstable specific methods too.
Extract the actual dumping code into a separate class, which also
implements sstable_consumer interface. The dumping consumer now just
forwards calls to actual dumper through the abstract consumer interface,
allowing different concrete dumpers to be instantiated.
This method is not overrided by any of the derived classes, so it does
not need to be virtual.
(cherry picked from commit b7fb93dc46531bca8db535301a069df52991f9d9)
When you interrupt a process in gdb using Ctrl-C or attach gdb to a
running process, usually gdb will show the current frame as
`syscall()` (no source information). But in some less usual setups
gdb may happen to know that `syscall()` is implemented in assembly,
and even knows which line is current in which assembly file.
An unfortunate effect of gdb knowing that the current frame's source
language is assembly is that since assembly is not C++, gdb's
expression parser switches to "auto" while in the `syscall()` stack
frame. And in the "auto" language explicit C++ global namespace
references like "::debug::the_database" are not syntactically valid,
which renders much of scylla-gdb.py unusable unless you remember to go
up the call stack before doing anything.
But since scylla-gdb.py is there to help debug Scylla, and Scylla is
written in C++, we can just set gdb source language to "c++" and avoid
the problem.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20220216235301.1206341-1-michael.livshin@scylladb.com>
"
This series implements support for the ME sstable format (introduced
in C* 3.11.11).
Tests: unit(dev)
"
* tag 'me-sstable-format-v5' of https://github.com/cmm/scylla:
sstables: validate originating host id
sstable: add is_uploaded() predicate
config: make the ME sstable format default
scylla-gdb.py: recognize ME sstables
sstables: store originating host id in stats metadata
system_keyspace: cache local host id before flushing
database_test: ensure host id continuity
sstables_manager: add get_local_host_id() method and support
sstables_manager: formalize inheritability
system_keyspace, main: load (or create) local host id earlier
sstable_3_x_test: test ME sstable format too
add "ME_SSTABLE" cluster feature
add "sstable_format" config
add support for the ME sstable format
scylla-sstable: add ability to dump optionals and utils::UUID
sstables: add ability to write and parse optionals
globalize sstables::write(..., utils::UUID)
Add an additional sstable validation step to check that originating
host id matches the local host id.
This is only done for ME-and-up sstables, which do not come from
upload/, and when the local host id is known.
When local host id is unknown, check that the sstable belongs to a
system keyspace, i.e. whether it is plausible that Scylla is still
booting up and hasn't loaded/generated the local host id yet.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
With this change, ME sstables start carrying their originating host
id, which makes ME format feature-complete so it can be made default.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Later in this series the ME sstable format is made default, which
means that `system.local` will likely be written as ME.
Since, in ME, originating host id is a part of sstable stats metadata,
the local host id needs to either already be cached by the time
`system.local` is flushed, or to somehow be special-case-ignored when
flushing `system.local`.
The former (done here) is optimistic (cache before flush), but the
alternative would be an abstraction violation and would also cost a
little time upon each sstable write.
(Cache-before-flush could be undone by catching any exceptions during
flush and un-caching, but inability to `co_await` in catch clauses
makes the code look rather awkward. And there is no need to bother
because bootstrap failures should be fatal anyway)
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
The "populate_from_quarantine_works" test case creates sstables with
one db config, then reads them with another. Ensure that both configs
have the same host id so the sstables pass validation.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Since ME sstable format includes originating host id in stats
metadata, local host id needs to be made available for writing and
validation.
Both Scylla server (where local host id comes from the `system.local`
table) and unit tests (where it is fabricated) must be accomodated.
Regardless of how the host id is obtained, it is stored in the db
config instance and accessed through `sstables_manager`.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
The class is already inherited from in tests (along with overriding a
non-virtual method), so this seems to be called for.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
We want it to be cached before any sstable is written, so do it right
after system_keyspace::minimal_setup().
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Initialize it to "md" until ME format support is
complete (i.e. storing originating host id in sstable stats metadata
is implemented), so at present there is no observable change by
default.
Also declare "enable_sstables_md_format" unused -- the idea, going
forward, being that only "sstable_format" controls the written sstable
file format and that no more per-format enablement config options
shall be added.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
(that is, instances of `std::optional`).
The ME sstable format includes optional originating host id in stats
metadata. We know how to write and parse uuids, but not how to write
and parse optionals.
The format is (used by C* in this case, and also happens to be
consistent with how booleans are serialized): first a boolean
indicating whether the contents are present (0 or 1, as a byte), then
the contents (if any).
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Throwing std::runtime_error results in
http status 500 (internal_server_error), but the problem
is with the request parameters, nt with the server.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This commit changes the behavior of `exception_container::accept`. Now,
instead of throwing an `utils::bad_exception_container_access` exception
when the container is empty, the provided visitor is invoked with that
exception instead. There are two reasons for this change:
- The exception_container is supposed to allow handling exceptions
without using the costly C++'s exception runtime. Although an empty
container is an edge case, I think it the new behavior is more aligned
with the class' purpose. The old behavior can be simulated by
providing a visitor which throws when called with bad access
exception.
- The new behavior fixes a bug in `result_try`/`result_futurize_try`.
Before the change, if the `try` block returned a failed result with an
empty exception container, a bad access exception would either be
thrown or returned as an exceptional future without being handled by
the `catch` clauses. Although nobody is supposed to return such
result<>s on purpose, a moved out result can be returned by accident
and it's important for the exception handling logic to be correct in
such a situation.
Tests: unit(dev)
Closes#10086
CDC registers to the table-creation hook (before_create_column_family)
to add a second table - the CDC log table - to the same keyspace.
The handler function (on_before_update_column_family() in cdc/log.cc)
wants to retrieve the keyspace's definition, but that does NOT WORK if
we create the keyspace and table in one operation (which is exactly what
we intend to do in Alternator to solve issue #9868) - because at the
time of the hook, the keyspace does not yet exist in the schema.
It turns out that on_before_update_column_family() does not REALLY need
the keyspace. It needed it to pass it on to make_create_table_mutations()
but that function doesn't use the keyspace parameter passed to it! All
it needs is the keyspace's name - which is in the schema anyway and
doesn't need to be looked up.
So in this patch we fix make_create_table_mutations() to not require the
unused keyspace parameter - and fix the CDC code not to look for the
keyspace that is no longer needed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220215162342.622509-1-nyh@scylladb.com>
Prerequisite for the "ME sstable format support" series (which has been
posted to the mailing list) -- to be merged or rejected together with
that.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Closes#9939
In commit a664ac7ba5, the Alternator
schema-modifying code (e.g., delete_table()) was reorganized to support
the new Raft-based schema modifications. Schema modifications now work
with an "optimistic locking" approach: We retrieve the current schema
version id ("group0_guard"), reads the current schema and verifies it
can do the changes it wants to do, and then does them with
mm.announce(group0_guard) - which will fail if the schema version is not
current because some other concurrent modification beat us in the race.
This means that we need to do this whole read-modify-write (group0_guard,
checking the schema, creating mutations, calling mm.announce()) in a
*retry loop*. We have such a loop in the CQL code but it's missing in the
Alternator code. In this patch we don't add the loop yet, but add FIXMEs
to remind us where it's missing.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220214154435.544125-1-nyh@scylladb.com>
We add reproducing tests for two known Alternator issues, #6391 and #9868,
which involve the non-atomicity of table creation. Creating a table
currently involves multiple steps - creating a keyspace, a table,
materialized views, and tags. If some of these steps succeed and some
fail, we get an InternalServerError and potentially leave behind some
half-built table.
Both issues will be solved by making better use of the new Raft-based
capabilities of making multiple modifications to the schema atomically,
but this patch doesn't fix the problem - it just proves it exist.
The new tests involve two threads - one repeatedly trying to create a
table with a GSI or with tags - and the other thread repeatedly trying
to delete the same table under its feet. Both bugs are reproduced almost
immediately.
Note that like all test/alternator tests, the new tests are usually run on
just one node. So when we fix the bug and these tests begin to pass,
it will not be a proof that concurrent schema modification works safely
on *different* nodes. To prove that, we will also need a multi-node test.
However, this test can prove that we used Raft-based schema modification
correctly - and if we assume that the Raft-based schema modification
feature is itself correct, then we can be sure that CreateTable will be
correct also across multiple nodes. Although it won't hurt to check it
directly.
Refs #6391
Refs #9868
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220207223100.207074-1-nyh@scylladb.com>
We add a `peers()` method to `discovery` which returns the peers
discovered until now (including seeds). The caller of functions which
return an output -- `tick` or `request` -- is responsible for persisting
`peers()` before returning the output of `tick`/`request` (e.g. before
sending the response produced by `request` back). The user of
`discovery` is also responsible for restoring previously persisted peers
when constructing `discovery` again after a restart (e.g. if we
previously crashed in the middle of the algorithm).
The `persistent_discovery` class is a wrapper around `discovery` which
does exactly that.
This table will be used to persist the list of peers discovered by the
`discovery` algorithm that is used for creating Raft group 0 when
bootstrapping a fresh cluster.
The name `get_output` suggests that this is the only way to get output
from `discovery`. But there is a second public API: `request`, which also
provides us with a different kind of output.
Rename it to `tick`, which describes what the API is used for:
periodically ticking the discovery state machine in order to make
progress.
In `raft_group0::discover_group0`, when we detect that we became a
leader, we destroy the `discovery` object, create a group 0, and respond
with the group 0 information to all further requests.
However there is a small time window after becoming a leader but before
destroying the `discovery` object where we still answer to discovery
requests by returning peer lists, without informing the requester that
we become a leader.
This is unsafe, and the algorithm specification does not allow this. For
example, consider the seed graph 0 --> 1. It satisfies the property
required by the algorithm, i.e. that there exists a vertex reachable
from every other vertex. Now `1` can become a leader before `0` contacts it.
When `0` contacts `1`, it should learn from `1` that `1` created a group 0, so
`0` does not become a leader itself and create another group 0. However,
with the current implementation, it may happen that `0` contacts `1` and
receives a peer list (instead of group 0 information), and also becomes
a leader because it has the smallest ID, so we end up with two group 0s.
The correct thing to do is to stop returning peer lists to requests
immediately after becoming a leader. This is what we fix in this commit.
This reverts commit 4c05e5f966.
Moving cleanup to maintenance group made its operation time up to
10x slower than previous release. It's a blocker to 4.6 release,
so let's revert it until we figure this all out.
Probably this happens because maintenance group is fixed at a
relatively small constant, and cleanup may be incrementally
generating backlog for regular compaction, where the former is
fighting for resources against the latter.
Fixes#10060.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220213184306.91585-1-raphaelsc@scylladb.com>
This reverts commit 23da2b5879. It causes
the node to quickly run out of memory when many schema changes are made
within a small time window.
Fixes#10071.
Adds `utils::result_try` and `utils::result_futurize_try` - functions which allow to convert existing try..catch blocks into a version which handles C++ exceptions, failed results with exception containers and, depending on the function variant, exceptional futures using the same exception handling logic.
For example, you can convert the following try..catch block:
try {
return a_function_that_may_throw();
} catch (const my_exception& ex) {
return 123;
} catch (...) {
throw;
}
...to this:
return utils::result_try([&] {
return a_function_that_may_throw_or_return_a_failed_result();
}, utils::result_catch<my_exception>([&] (const Ex&) {
return 123;
}), utils::result_catch_dots([&] (auto&& handle) {
return handle.into_result();
});
Similarly, `utils::result_futurize_try` can be used to migrate `then_wrapped` or `f.handle_exception()` constructs.
As an example of the usability of the new constructs, two places in the current code which need to simultaneously handle exceptions and failed results are converted to use `result_try` and `result_futurize_try`.
Results of `perf_simple_query --smp 1 --operations-per-shard 1000000 --write`:
```
127041.61 tps ( 67.2 allocs/op, 14.2 tasks/op, 52422 insns/op)
126958.60 tps ( 67.2 allocs/op, 14.2 tasks/op, 52409 insns/op)
127088.37 tps ( 67.2 allocs/op, 14.2 tasks/op, 52411 insns/op)
127560.84 tps ( 67.2 allocs/op, 14.2 tasks/op, 52424 insns/op)
127826.61 tps ( 67.2 allocs/op, 14.2 tasks/op, 52406 insns/op)
126801.02 tps ( 67.2 allocs/op, 14.2 tasks/op, 52420 insns/op)
125371.51 tps ( 67.2 allocs/op, 14.2 tasks/op, 52425 insns/op)
126498.51 tps ( 67.2 allocs/op, 14.2 tasks/op, 52427 insns/op)
126359.41 tps ( 67.2 allocs/op, 14.2 tasks/op, 52423 insns/op)
126298.27 tps ( 67.2 allocs/op, 14.2 tasks/op, 52410 insns/op)
```
The number of tasks and allocations is unchanged. The number of instructions per operations seems similar, it may have increased slightly (by 10-20) but it's hard to tell for sure because of the noisiness of the results.
Tests: unit(dev)
Closes#10045
* github.com:scylladb/scylla:
transport: use result_try in process_request_one
storage_proxy: use result_futurize_try in mutate_end
storage_proxy: temporarily throw exception from result in mutate_end
utils: add result_try and result_futurize_try
After the mechanical change in fcb8d040e8
("treewide: use Software Package Data Exchange (SPDX) license identifiers"),
a few stray license blurbs or fragments thereof remain. In two cases
these were extra blurbs in code generators intended for the generated code,
in others they were just missed by the script.
Clean them up, adding an SPDX license identifier where needed.
Closes#10072
This PR rewrites the `utils::result_parallel_for_each`'s implementation to resemble the original `seastar::parallel_for_each` more closely instead of using the less efficient `seastar::map_reduce`. It uses less tasks and allocations now, as demonstrated in the results from the `perf_result_query` benchmark, attached at the end of the cover letter.
The main drawback of the new implementation is that it needs to rethrow exceptions propagated as exceptional futures from the parallel sub-invocations. Contrary to the original `seastar::parallel_for_each` which uses a custom task to collect results, the new `utils::result_parallel_for_each` uses a coroutine and there doesn't currently seem to be a way to co_await for a future and inspect its state without either rethrowing or handling it in then_wrapped (which allocates a continuation). Fortunately, rethrowing is not needed for exceptions returned in failed result<>, which are already intended to be a more performant alternative to regular exceptions.
As a bonus, definitions from `utils/result.hh` are now split across three different headers in order to improve (re)compilation times.
Results from `perf_simple_query --smp 1 --operations-per-shard 1000000 --write` (before vs. after):
```
126872.54 tps ( 67.2 allocs/op, 14.2 tasks/op, 52404 insns/op)
126532.13 tps ( 67.2 allocs/op, 14.2 tasks/op, 52408 insns/op)
126864.99 tps ( 67.2 allocs/op, 14.2 tasks/op, 52428 insns/op)
127073.10 tps ( 67.2 allocs/op, 14.2 tasks/op, 52404 insns/op)
126895.85 tps ( 67.2 allocs/op, 14.2 tasks/op, 52411 insns/op)
127894.02 tps ( 66.2 allocs/op, 13.2 tasks/op, 52036 insns/op)
127671.51 tps ( 66.2 allocs/op, 13.2 tasks/op, 52042 insns/op)
127541.42 tps ( 66.2 allocs/op, 13.2 tasks/op, 52044 insns/op)
127409.10 tps ( 66.2 allocs/op, 13.2 tasks/op, 52052 insns/op)
127831.30 tps ( 66.2 allocs/op, 13.2 tasks/op, 52043 insns/op)
```
Test: unit(dev, debug)
Closes#10053
* github.com:scylladb/scylla:
utils/result: optimize result_parallel_for_each
utils/result: split into `combinators` and `loop` file
This series continues the effort of https://github.com/scylladb/scylla/pull/9844 to reduce `seastar::async` usage and coroutinize in the gossiper code.
There are mostly trivial conversions from using `.get()` to `co_await`, where appropriate, as well, as elimination of `seastar::async()` wrappers.
A few more functions are not yet converted, though (e.g. `apply_new_states`, `do_apply_state_locally`, `apply_state_locally`, `apply_state_locally_without_listener_notification`, maybe a few others, as well).
The motivation is to be able to call every public API function of `gossiper` class without requiring `seastar::async` context.
Tests: unit(debug, dev), dtest (topology-related tests)
Closes#10032
* github.com:scylladb/scylla:
gms: gossiper: coroutinize `wait_for_gossip`
gms: gossiper: coroutinize `advertise_token_removed`
gms: gossiper: coroutinize `advertise_removing`
gms: gossiper: don't wrap `convict` calls into `seastar::async`
gms: gossiper: coroutinize `handle_major_state_change`
gms: gossiper: coroutinize `handle_shutdown_msg`
gms: gossiper: coroutinize `mark_as_shutdown` and `convict`
gms: gossiper: remove comment about requiring thread context in `mark_alive`
gms: gossiper: don't use `seastar::async` in `mark_alive`
gms: gossiper: coroutinize `do_on_change_notifications`
gms: gossiper: coroutinize `do_before_change_notifications`
gms: gossiper: coroutinize `real_mark_alive`
gms: gossiper: coroutinize `mark_dead`
It now resembles the original parallel_for_each more, but uses a
coroutine instead of a custom `task` to collect not-ready futures.
Although the usage of a coroutine saves on allocations, the drawback is
that there is currently no way to co_await on a future and handle its
exception without throwing or without unconditionally allocating a
then_wrapped or handle_exception continuation - so it introduces a
rethrow.
Furthermore, now failed results and exceptions are treated as equals.
Previously, in case one parallel invocation returned failed result and
another returned an exception, the exception would always be returned.
Now, the failed result/exception of the invocation with the lowest index
is always preferred, regardless of the failure type.
The reimplementation manages to save about 350-400 instructions, one
task and one allocation in the perf_simple_query benchmark in write
mode.
Results from `perf_simple_query --smp 1 --operations-per-shard 1000000
--write` (before vs. after):
```
126872.54 tps ( 67.2 allocs/op, 14.2 tasks/op, 52404 insns/op)
126532.13 tps ( 67.2 allocs/op, 14.2 tasks/op, 52408 insns/op)
126864.99 tps ( 67.2 allocs/op, 14.2 tasks/op, 52428 insns/op)
127073.10 tps ( 67.2 allocs/op, 14.2 tasks/op, 52404 insns/op)
126895.85 tps ( 67.2 allocs/op, 14.2 tasks/op, 52411 insns/op)
127894.02 tps ( 66.2 allocs/op, 13.2 tasks/op, 52036 insns/op)
127671.51 tps ( 66.2 allocs/op, 13.2 tasks/op, 52042 insns/op)
127541.42 tps ( 66.2 allocs/op, 13.2 tasks/op, 52044 insns/op)
127409.10 tps ( 66.2 allocs/op, 13.2 tasks/op, 52052 insns/op)
127831.30 tps ( 66.2 allocs/op, 13.2 tasks/op, 52043 insns/op)
```
Test: unit(dev), unit(result_utils_test, debug)
Segregates result utilities into:
- result.hh - basic definitions related to results with exception
containers,
- result_combinators.hh - combinators for working with results in
conjunction with futures,
- result_loop.hh - loop-like combinators, currently has only
result_parallel_for_each.
The motivation for the split is:
1. In headers, usually only result.hh will be needed, so no need to
force most .cc files to compile definitions from other files,
2. Less files need to be recompiled when a combinator is added to
result_combinators or result_loop.
As a bonus, `result_with_exception` was moved from `utils::internal` to
just `utils`.
Adapts the exception handling logic in process_request_one so that it
uses utils::result_try to handle both C++ exceptions and failed results
in a unified way.
Adapts the mutate_end exception handling logic so that it uses the new
utils::result_futurize_try function to handle both exceptional futures
and failed results in an unified way.
Temporarily removes the logic which handles failed results in a
non-throwing way. Exceptions from failed results are thrown and handled
in try..catch.
The reason for this change is that it makes the following commit, which
migrates the whole try..catch block to utils::result_futurize_try much
nicer. The next commit will also bring back the non-throwing handling of
the failed result.
Adds result_try and result_futurize_try - functions which allow to
convert existing try..catch blocks into a version which handles C++
exceptions, failed results with exception containers and, depending on
the function variant, exceptional futures.
Secondary tracing sessions used to compute the execution time
from the point of their `begin()`-ning, not the parent session's
`begin()`. As a result, replica reported a slow query if it
exceeded the entire threshold *on that replica* too.
This change augments `trace_info` with the TS of parent's session
starting point, to be used as a reference on replicas.
Fixes#9403Closes#10005
`system.raft`, `system.raft_snapshots` and `system.raft_config`
were missing from the `extra_durable_tables` list, so that
`set_wait_for_sync_to_commitlog(true)` was not enabled when
the tables were re-created via `create_table_from_mutations`.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20220210073418.484843-1-pa.solodovnikov@scylladb.com>
main() has some logic to select the main function it will delegate to
based on argv[1]. The intent is that when the value of argv[1] suggest
that the user did not specify a specific app to run, we default to
"server" (scylla proper).
This logic currently breaks down when there are no arguments at all: in
this case the following error is printed and scylla refuses to start:
error: unrecognized first argument: expected it to be "server", a regular command-line argument or a valid tool name (see `scylla --list-tools`), but got
Fix this by checking for empty argv[1] and defaulting to "server" in
that case.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220210092125.293682-1-bdenes@scylladb.com>
directory_lister provides a simpler interface
compared to lister.
After creating the directory_lister,
its async get() method should be called repeatedly,
returning a std::optional<directory_entry> each call,
until it returns a disengaged entry or an error.
This is especially suitable for coroutines
as demonstrated in the unit tests that were added.
For example:
auto dl = directory_lister(path);
while (auto de = co_await dl.get()) {
co_await process(*de);
}
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch adds a "--raft" option to test/alternator/run to enable the
experimental Raft-based schema changes ("--experimental-features=raft")
when running Scylla for the tests. This is the same option we added to
test/cql-pytest/run in a previous patch.
Note that we still don't have any Alternator tests that pass or fail
differently in these two modes - these will probably come later as we
fix issues #9868 and #6391. But in order to work on fixing those issues
we need to be able to run the tests in Raft mode.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220209123144.321344-1-nyh@scylladb.com>
In a previous patch we fixed the output of experimental features list
(issue #10047), so we also need to fix the test code which detects the
"raft" experimental feature - to use the string "raft" and not the
silly byte 4 we had there before.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220209104331.312999-1-nyh@scylladb.com>
Issue #8968 no longer exists when Raft-based schema updates are enabled
in Scylla (with --experimental-features=raft). Before we can close this
issue we need a way to re-run its test
test_keyspace.py::test_concurrent_create_and_drop_keyspace
with Raft and see it pass. But we also want the tests to continue to run
by default the older raft-less schema updates - so that this mode doesn't
regress during the potentially-long duration that it's still the default!
The solution in this patch is:
1. Introduce a "--raft" option to test/cql-pytest/run, which runs the tests
against a Scylla with the raft experimental feature, while the default is
still to run without it.
2. Introduce a text fixture "fails_without_raft" which marks a test which
is expected to fail with the old pre-raft code, but is expected to
pass in the new code.
3. Mark the test test_concurrent_create_and_drop_keyspace with this new
"fails_without_raft".
After this patch, running
test/cql-pytest/run --raft
test_keyspace.py::test_concurrent_create_and_drop_keyspace
Passes, which shows that issue 8968 was fixed (in Raft mode) - so we can say:
Fixes#8968
Running the same test without "--raft" still xfails (an expected failure).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220208162732.260888-1-nyh@scylladb.com>
The system.config virtual tables prints each configuration variable of
type T based on the JSON printer specified in the config_type_for<T>
in db/config.cc.
For two variable types - experimental_features and tri_mode_restriction,
the specified converter was wrong: We used value_to_json<string> or
value_to_json<vector<string>> on something which was *not* a string.
Unfortunately, value_to_json silently casted the given objects into
strings, and the result was garbage: For example as noted in #10047,
for experimental_features instead of printing a list of features *names*,
e.g., "raft", we got a bizarre list of one-byte strings with each feature's
number (which isn't documented or even guaranteed to not change) as well
as carriage-return characters (!?).
So solution is a new printable_to_json<T> which works on a type T that
can be printed with operator<< - as in fact the above two types can -
and the type is converted into a string or vector of strings using this
operator<<, not a cast.
Also added a cql-pytest test for reading system.config and in particular
options of the above two types - checking that they contain sensible
strings and not "garbage" like before this patch.
Fixes#10047.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220209090421.298849-1-nyh@scylladb.com>
* seastar 0d250d15a...d27bf8b5a (5):
> Merge "Clean internal namespace in io_queue.cc" from Pavel E
> Making par.._for_each and max_conc.._for_each compatible with move-only views (like generators)
> tests: Perf test for smp::submit_to efficiency
> Merge "Auto-increase IO latency goal from reactor" from Pavel E
> reactor: Fix default task-quota-ms to be 0.5ms
If version is absent in cache, it will be fetched from the
coordinator. This is not expensive, but if the version is not known,
it must be also "synced". It means that the node will do a full schema
pull from the coordinator. This pull is expensive and can take seconds.
If the coordinator we pull from is at an old version, the pull will do
nothing and current node will soon forget the old version, initiating
another pull.
If some nodes stay at an old version for a long time for some reason,
this will make new coordinators initiate pulls frequently.
Increase the expiration period to 15 minutes to reduce the impact in
such scenarios.
Fixes#10042.
Message-Id: <20220207122317.674241-1-tgrabiec@scylladb.com>
Make only the first node in group0 to start as voter. Subsequent nodes
start as non-voters and request change to voter once bootstrap is
successful.
Add support for this in raft and a couple of minor fixes.
* alejo/raft-join-non-voting-v6:
raft: nodes joining as non-voters
raft: group 0: use cfg.contains() for config check
raft: modify_config: support voting state change
raft: minor: fix log format string
With trigger_compaction() being called after each new sstable is added
to the set, we'll get quadratic behavior because strategies like
tiered will sort all the candidates before iterating on them, so
complexity is ~ ((N - 1) * N * logN).
Additionally, compaction may be inefficient as we're not waiting for
the sstable set to settle, so table may end up missing files that
would allow for more efficient jobs.
The latter isn't a big problem because we have reshape running in an
earlier phase, so data layout should satisfy the strategy almost.
Boot is not affected by these problems because it temporarily
disables auto compaction, so trigger_compaction() is a no-op for it.
So refresh remains as the only one affected.
Fixes#10046.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220208151154.72606-1-raphaelsc@scylladb.com>
Except for the first node creating the group0, make other nodes join as
non-voters and make them voters after successful bootstrap.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Currently, most of the failures that occur during CQL reads or writes are reported using C++ exceptions. Although the seastar framework avoids most of the cost of unwinding by keeping exceptions in futures as `std::exception_ptr`s, the exceptions need to be inspected at various points for the purposes of accounting metrics or converting them to a CQL error response. Analyzing the value and type of an exception held by `std::exception_ptr`'s cannot be done without rethrowing the exception, and that can be very costly even if the exception is immediately caught. Because of that, exceptions are not a good fit for reporting failures which happen frequently during overload, especially if the CPU is the bottleneck.
This PR introduces facilities for reporting exceptions as values using the boost::outcome library. As a first step, the need to use exceptions for reporting timeouts was eliminated for regular and batch writes, and no exceptions are thrown between creation of a `mutation_write_timeout_exception` and its serialization as a CQL response in the `cql_server`.
The types and helpers introduced here can be reused in order to migrate more exceptions and exception paths in a similar fashion.
Results of `perf_simple_query --smp 1 --operations-per-shard 1000000`:
Master (00a9326ae7)
128789.53 tps ( 82.2 allocs/op, 12.2 tasks/op, 49245 insns/op)
This PR
127072.93 tps ( 82.2 allocs/op, 12.2 tasks/op, 49356 insns/op)
The new version seems to be slower by about 100 insns/op, fortunately not by much (about 0.2%).
Tests: unit(dev), unit(result_utils_test, debug)
Closes#10014
* github.com:scylladb/scylla:
cql_test_env: optimize handling result_message::exception
transport/server: handle exceptions from coordinator_result without throwing
transport/server: propagate coordinator_result to the error handling code
transport/server: unwrap the exception result_message in process_xyz_internal
query_processor: add exception-returning variants of execute_ methods
modification_statement: propagate failed result through result_message::exception
batch_statement: propagate failed result through result_message::exception
cql_statement: add `execute_without_checking_exception_message`
result_message: add result_message::exception
storage_proxy: change mutate_with_triggers to return future<result<>>
storage_proxy: add mutate_atomically_result
storage_proxy: return result<> from mutate_result
storage_proxy: return result<> from mutate_internal
storage_proxy: properly propagate future from mutate_begin to mutate_end
storage_proxy: handle exceptions as values in mutate_end
storage_proxy: let mutate_end take a future<result<>>
storage_proxy: resultify mutate_begin
storage_proxy: use result in the _ready future of write handlers
storage_proxy: introduce helpers for dealing with results
exceptions: add coordinator_exception_container and coordinator_result
utils: add result utils
utils: add exception_container
There will be nodes in non-voting state in configuration, so can_vote()
is not a good check. Use newer cfg.contains().
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
The single_node_cql_env uses query_processor::execute_xyz family of
methods to perform operations. Due to previous commits in this series,
they allocate one more task than before - a continuation that converts
result_message::exception into an exceptional future. We can recover
that one task by using variants of those methods which do not perform a
conversion, and turn .finally() invocations into .then()s which perform
conversion manually.
At the point where `result_message` is converted to a
`cql_server::response`, now the result message is inspected and returned
as failed `result<>` if it contained an error.
For now, the failed `result<>` is thrown as exception in `process` and
`process_on_shard`, but that will change in the next commit.
Adds variants of the execute_prepared, execute_direct and execute_batch
which are allowed to return exceptions as `result_message::exception`.
Because the `result_message::exception` must be explicitly handled by
the receiver, new variants are introduced in order not to accidentally
ignore the exception, which would be very bad.
Modifies the modification_statement code so that is converts failed
`result<>` into a `result_message::exception` without involving the C++
exception runtime.
Modifies the batch_statement code so that is converts failed `result<>`
into a `result_message::exception` without involving the C++ exception
runtime.
Adds a new virtual method to the cql_statement with a wordy name. The
new method is a variant of `execute`, but it is allowed to return errors
via the `result_message::exception` object.
The reason for an additional method is that there are many places in the
code which call `execute` but do not check the result in any way.
Because ignoring an exception unintentionally is a very bad thing, the
new method needs to be explicitly implemented by statements which can
return a `result_message::exception`, and explicitly called in the code
which is prepared to handle a `result_message::exception`.
In order to propagate exceptions as values through the CQL layer with
minimal modifications to the interfaces, a new result_message type is
introduced: result_message::exception. Similarly to
result_message::bounce_to_shard, this is an internal type which is
supposed to be handled before being returned to the client.
Changes the interface of `mutate_with_triggers` so that it returns
`future<result<>>` instead of `future<>`. No intermediate
`mutate_with_triggers_result` method is introduced because all call
sites will be changed in this PR so that they properly handle failed
`result<>`s with exceptions-as-values.
Similarly to `mutate_result` introduced in the previous commit,
`mutate_atomically_result` is introduced which returns some exceptions
inside `result<>`. The pre-existing `mutate_atomically` keeps the same
interface but uses `mutate_atomically_result` internally, converting
failed `result<>` to exceptional future if needed.
In order to be able to propagate exceptions-as-values from storage_proxy
but without having to modify all call sites of `mutate`, an in-between
method `mutate_result` is introduced which returns some exceptions
inside `result<>`. Now, `mutate` just calls the latter and converts
those exceptions to exceptional future if needed.
Instead of stupidly rethrowing the exception in failed result<>, the
`storage_proxy::mutate_end` function now inspects it with a visitor,
which does not involve any rethrows. Moreover, mutate_end now also
returns a `future<result<>>` instead of just `future<>`.
Changes the `storage_proxy::mutate_end` method to accept a
`future<result<>>` instead of `future<>`.
For the time being, all call call sites of that method pass a future
which is either exceptional or contains a result<> with a value.
Moreover, in case of a failed result<>, mutate_end just rethrows the
exception. Both of these will change in the upcoming commits of this PR.
Changes the type of the _ready promise in
abstract_write_response_handler - a promise used by the coordinator
logic to wait until the write operation is complete - to keep a
`result<>` instead of `void`. Now, a timeout is signalled by setting the
promise to a value containing a `result<>` with a mutation write timeout
exception - previously it was signalled by setting the promise to an
exceptional value.
This is just a first step on a long road of throwless propagation of the
error to the cql_server - for now, a failed result is immediately
converted to an exceptional future in `storage_proxy::response_wait`.
Adds coordinator_exception_container which is a typedef over
exception_container and is meant to hold exceptions returned from the
coordinator code path. Currently, it can only hold mutation write
timeout exceptions, because only that kind of error will be returned by
value as a result of this PR. In the future, more exception types can be
added.
Adds coordinator_result which is a boost::outcome::result that uses
coordinator_exception_container as the error type.
Adds a number of utilities for working with boost::outcome::result
combined with exception_container. The utilities are meant to help with
migration of the existing code to use the boost::outcome::result:
- `exception_container_throw_policy` - a NoValuePolicy meant to be used
as a template parameter for the boost::outcome::result. It protects
the caller of `result::value()` and `result::error()` methods - if the
caller wishes to get a value but the result has an error
(exception_container in our case), the exception in the container will
be thrown instead. In case it's the other way around,
boost::outcome::bad_result_access is thrown.
- `result_parallel_for_each` - a version of `parallel_for_each` which is
aware of results and returns a failed result in case any of the
parallel invocations return a failed result.
- `result_into_future` - converts a result into a future. If the result
holds a value, converts it into make_ready_future; if it holds an
exception, the exception is returned as make_exception_future.
- `then_ok_result` takes a `future<T>` and converts it into
a `future<result<T>>`.
- `result_wrap` adapts a callable of type `T -> future<result<T>>` and
returns a callable of type `result<T> -> future<result<T>>`.
"
This is the continuation of 3e31126b (Brush up the initial tokens
generation
code). The replica::database is still used as the configuration
provider, and
two of those bits can be easily fixed.
"
tests: unit(dev)
* 'br-database-no-replacing-config' of https://github.com/xemul/scylla:
database: Move is_replacing() and get_replace_address() (back) into storage_service
bootstrapper: Get 'is-replacing' via argument too
bootstrapper: Get replace address via argument
DynamoDB allows an UpdateItem operation "REMOVE x.y" when a map x
exists in the item, but x.y doesn't - the removal silently does
nothing. Alternator incorrectly generated an error in this case,
and unfortunately we didn't have a test for this case.
So in this patch we add the missing test (which fails on Alternator
before this patch - and passes on DynamoDB) and then fix the behavior.
After this patch, "REMOVE x.y" will remain an error if "x" doesn't
exist (saying "document paths not valid for this item"), but if "x"
exists and is a map, but "x.y" doesn't, the removal will silently
do nothing and will not be an error.
Fixes#10043.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220207133652.181994-1-nyh@scylladb.com>
1. There's nothing we can do about this error.
2. It doesn't affect any query
3. No need to reprort timeout errors here.
Refs #10029
Note that in 4.6.rc4-0.20220203.34d470967a0 (where the issue above was opened against)
the error is likely to be related to read_ahead failure which
is already reported as a warning in master since fc729a804b.
When backported, this patch should be applied after:
fc729a804bd7a993043d
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220207080041.174934-1-bhalevy@scylladb.com>
Leaving group 0 in `decommission` would previously fail with RPC
exception because it happened after messaging service was shutdown.
Fixes#9845.
Message-Id: <20220201112743.9705-1-kbraun@scylladb.com>
As observed in #10026, after schema changes it somehow happened
that a column defition that does not match any of the base table
columns was passed to expression verification code.
The function that looks up the index of a column happens to return
-1 when it doesn't find anything, so using this returned index
without checking if it's nonnegative results in accessing invalid
vector data, and a segfault or silent memory corruption.
Therefore, an explicit check is added to see if the column was actually
found. This serves two purposes:
- avoiding segfaults/memory corruption
- making it easier to investigate the root cause of #10026Closes#10039
The CQL parser currently accepts a command like:
ALTER KEYSPACE ksname WITH replication = {
'class' : 'NetworkTopologyStrategy',
'dc1' : 2,
'dc1' : 3 }
But because these options are read into an std::map, one of the
definitions of 'dc1' is silently ignored (counter-intuitively, it is
the first setting which is kept, and the second setting is ignored.)
But this is most likely a user's typo, so a better choice is to report
this as a parse error instead of arbitrarly and silently keeping just
one of the settings.
This is what Cassandra does since version 3.11 (see
https://issues.apache.org/jira/browse/CASSANDRA-13369 and Cassandra
commit 1a83efe2047d0138725d5e102cc40774f3b14641), and this is what we do
in this patch.
The unit test cassandra_tests/validation/operations/alter_test.py::
testAlterKeyspaceWithMultipleInstancesOfSameDCThrowsSyntaxException,
translated from Cassandra's unit tests, now passes.
Fixes#10037.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220207113709.78613-1-nyh@scylladb.com>
Both helpers (natuarally) used to be storage-service methods, but then
were moved to databse because bootstrapper code wanted to know this info.
Now the bootstraper is equipped with necessary arguments.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This also removes the only usage of this helper outside of the storage
service. The place that needs it is the use_strict_sources_for_ranges()
checker and all the callers of it are aware of whether it's replacing
happenning or not.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The function cql3::util::maybe_quote() is used throughout Scylla to
convert identifier names (column names, table names, etc.) into strings
that can be embedded in CQL commands. maybe_quote() sometimes needs to
quote these identifier names, but when the identifier name is lowercase,
and not a CQL keyword, it is not quoted.
Not quoting identifier names when not needed is nice and pretty, but has
a forward-compatibility problem: If some CQL command with an unquoted
identifier is saved somewhere, and new version of Scylla adss this
identifier as a new reserved keyword - the CQL command will break.
So this patch introduces a new function, cql3::util::quote(), which
unconditionally quotes the given identifier.
The new function is not yet used in Scylla, but we add a unit test
(based on the test of maybe_quote()) to confirm it behaves correctly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220118161217.231811-2-nyh@scylladb.com>
cql3::util::maybe_quote() is a utility function formatting an identifier
name (table name, column name, etc.) that needs to be embedded in a CQL
statement - and might require quoting if it contains non-alphanumeric
characters, uppercase characters, or a CQL keyword.
maybe_quote() made an effort to only quote the identifier name if neccessary,
e.g., a lowercase name usually does not need quoting. But lowercase names
that are CQL keywords - e.g., to or where - cannot be used as identifiers
without quoting. This can cause problems for code that wants to generate
CQL statements, such as the materialized-view problem in issue #9450 - where
a user had a column called "to" and wanted to create a materialized view
for it.
So in this patch we fix maybe_quote() to recognize invalid identifiers by
using the CQL parser, and quote them. This will quote reserved keywords,
but not so-called unreserved keywords, which *are* allowed as identifiers
and don't need quoting. This addition slows down maybe_quote(), but
maybe_quote() is anyway only used in heavy operations which need to
generate CQL.
This patch also adds two tests that reproduce the bug and verify its
fix:
1. Add to the low-level maybe_quote() test (a C++ unit test) also tests
that maybe_quote() quotes reserved keywords like "to", but doesn't
quote unreserved keywords like "int".
2. Add a test reproducing issue #9450 - creating a materialized view
whose key column is a keyword. This new test passes on Cassandra,
failed on Scylla before this patch, and passes after this patch.
It is worth noting that maybe_quote() now has a "forward compatiblity"
problem: If we save CQL statements generated by maybe_quote(), and a
future version introduces a new reserved keyword, the parser of the
future version may not be able to parse the saved CQL statement that
was generated with the old mayb_quote() and didn't quote what is now
a keyword. This problem can be solved in two ways:
1. Try hard not to introduced new reserved keywords. Instead, introduce
unreserved keywords. We've been doing this even before recognizing
this maybe_quote() future-compatibility problem.
2. In the next patch we will introduce quote() - which unconditionally
quotes identifier names, even if lowercase. These quoted names will
be uglier for lowercase names - but will be safe from future
introduction of new keywords. So we can consider switching some or
all uses of maybe_quote() to quote().
Fixes#9450
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220118161217.231811-1-nyh@scylladb.com>
This is a translation of Cassandra's CQL unit test source file
validation/operations/AlterTest.java into our our cql-pytest framework.
This test file includes 24 tests for various types of ALTER operations
(of keyspaces, tables and types). Two additional tests which required
multiple data centers to test were dropped with a comment explaining why.
All 24 tests pass on Cassandra, with 8 failing on Scylla reproducing
one already known Scylla issue and 5 previously-unknown ones:
Refs #8948: Cassandra 3.11.10 uses "class" instead of
"sstable_compression" for compression settings by default
Refs #9929: Cassandra added "USING TIMESTAMP" to "ALTER TABLE",
we didn't.
Refs #9930: Forbid re-adding static columns as regular and vice versa
Refs #9935: Scylla stores un-expanded compaction class name in system
tables.
Refs #10036: Reject empty options while altering a keyspace
Refs #10037: If there are multiple values for a key, CQL silently
chooses last value
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220206163820.1875410-2-nyh@scylladb.com>
Implement the nodetool.compact() function, requesting a major compaction
of the given table. As usual for the nodetool.* functions, this is
implemented with the REST API if available (i.e., testing Scylla), or
with the external "nodetool" command if not (for testing Cassandra).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220206163820.1875410-1-nyh@scylladb.com>
Seastar uses POSIX IO for output in addition to C++ iostreams,
e.g. in print_safe(), where it write()s directly to stdout.
Instead of manipulating C++ output streams to reset
stdout/log files, reopen the underlying file descriptors
to output/log files.
Fixes#9962 "cql_repl prints junk into the log"
Message-Id: <20220204205032.1313150-1-kostja@scylladb.com>
Merged patch series by Konstantin Osipov:
Assorted fixes in test.py in preparation for cluster
testing:
- better logging
- async search for unit test cases
- ubuntu fixes
test.py: highlight the failure cause
test.py: clean up setting of scylla executable
test.py: speed up search for tests cases, use async
test.py: make case cache global
test.py: make --cpus option work on Ubuntu
test.py: create an own TestSuite instance for each path/mode combo
test.py: do not fail entire run if list-content fails due to ASAN
test.py: print subtest name on cancel
test.py: fix flake8 complaints
Since these two functions call each other, convert
to coroutines and eliminate the dependency on `seastar::async`
for both of them at the same time.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Since `real_mark_alive` does not require `seastar::async`
now, we can eliminate the wrapping async call, as well.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Adds `exception_container` - a helper type used to hold exceptions as a
value, without involving the std::exception_ptr.
The motivation behind this type is that it allows inspecting exception's
type and value without having to rethrow that exception and catch it,
unlike std::exception_ptr. In our current codebase, some exception
handling paths need to rethrow the exception multiple times in order to
account it into metrics or encode it as an error response to the CQL
client. Some types of exceptions can be thrown very frequently in case
of overload (e.g. timeouts) and inspecting those exceptions with
rethrows can make the overload even worse. For those kinds of exceptions
it is important to handle them as cheaply as possible, and
exception_container used with conjunction with boost::outcome::result
can help achieve that.
Search for test cases in parallel.
This speeds up the search for test cases from 30 to 4-5
seconds in absence of test case cache and from 4 to 3
seconds if case cache is present.
test.py runs each unit test's test case in a separate process.
The list of test cases is built at start, by running --list-cases
for each unit test. The output is cached, so that if one uses --repeat
option, we don't list the cases again and again.
The cache, however, was only useful for --repeat, because it was only
caching the last tests' output, not all tests output, so if I, for example,
run tests like:
./test.py foo bar foo
.. the cache was unused. Make the cache global which simplifies its
logic and makes it work in more cases.
To run tests in a given mode we will need to start off scylla
clusters, which we would want to pool and reuse between many tests.
TestSuite class was designed to share resources of common tests.
One can't pool together scylla servers compiled with different
tests, so create an own TestSuite instance for each mode.
It's good practice to use linters and style formatters for
all scripted languages. Python community is more strict
about formatting guidelines than others, and using
formatters (like flake8 or black) is almost universally
accepted.
test.py was adhering to flake8 standards at some point,
but later this was spoiled by random commits.
An OOM failure while peeking into fragment, to determine if reader will
produce any fragment, causes Scylla to abort as flat_mutation_reader
expects reader to be closed before destroyed. Let's close it if
peek() fails, to handle the scenario more gracefully.
Fixes#10027.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220204031553.124848-1-raphaelsc@scylladb.com>
Snapshot-ctl methods fetch information about snapshots from
column family objects. The problem with this is that we get rid
of these objects once the table gets dropped, while the snapshots
might still be present (the auto_snapshot option is specifically
made to create this kind of situation). This commit switches from
relying on column family interface to scanning every datadir
that the database knows of in search for "snapshots" folders.
This PR is a rebased version of #9539 (and slightly cleaned-up, cosmetically)
and so it replaces the previous PR.
Fixes#3463Closes#7122Closes#9884
* github.com:scylladb/scylla:
snapshots: Fix snapshot-ctl to include snapshots of dropped tables
table: snapshot: add debug messages
Following the advice in the FIXME note, helper functions for parsing
expressions are now based on string views to avoid a few unnecessary
conversions to std::string.
Tests: unit(dev)
Closes#10013
In commit d72465531e we fixed the building
of relocatable packages of submodules (tools/java, etc.) to use the
top-level Scylla's version. However, if on an active working directory
Scylla's version changes - as we just did from 4.7 to 5.0 - these
relocatable packages are not rebuilt with the new version number, and as
a result some of our scripts (such as the docker build) can't find them.
Because the build-submodule-reloc rule depends on the files
build/SCYLLA-{PRODUCT,VERSION,RELEASE}-FILE (which is what the
aforementioned commit did), in this patch we add those files as a
dependency whenever build-submodule-reloc is used. This means that if
any of these files change, we rebuild the relocatable packages and
anything depending on them (e.g., Debian packages).
Fixes#10018.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220202131248.1610678-1-nyh@scylladb.com>
"
which is currently unhandled from multiple call sites, leading to the following warning
as seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/1094/artifact/logs-all.release.2/1643794928169_materialized_views_test.py%3A%3ATestInterruptBuildProcess%3A%3Atest_interrupt_build_process_and_resharding_half_to_max_test/node2.log
```
Scylla version 5.0.dev-0.20220201.a026b4ef4 with build-id cebf6dca8edd8df843a07e0f01a1573f1d0a6dfc starting ...
WARN 2022-02-02 09:31:56,616 [shard 2] seastar - Exceptional future ignored: seastar::sleep_aborted (Sleep is aborted), backtrace: 0x463b65e 0x463bb50 0x463be58 0x426c165 0x230c744 0x42adad4 0x42aeea7 0x42cdb55 0x4281a2a /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/a026b4ef490074df0d31d4b0ed9189d0cfaa745e/scylla/libreloc/libpthread.so.0+0x9298 /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/a026b4ef490074df0d31d4b0ed9189d0cfaa745e/scylla/libreloc/libc.so.6+0x100352
--------
seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<serialized_action::trigger(bool)::{lambda()#2}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<serialized_action::trigger(bool)::{lambda()#2}, false> >(seastar::future<void>::finally_body<serialized_action::trigger(bool)::{lambda()#2}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<serialized_action::trigger(bool)::{lambda()#2}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
```
Decoded:
```
void seastar::backtrace(seastar::current_backtrace_tasklocal()::$_3&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:59
(inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./seastar/src/util/backtrace.cc:86
seastar::current_tasktrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:137
seastar::current_backtrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:170
seastar::report_failed_future(std::__exception_ptr::exception_ptr const&) at ./build/release/seastar/./seastar/src/core/future.cc:210
(inlined by) seastar::report_failed_future(seastar::future_state_base::any&&) at ./build/release/seastar/./seastar/src/core/future.cc:218
seastar::future_state_base::any::check_failure() at ././seastar/include/seastar/core/future.hh:567
(inlined by) seastar::future_state::clear() at ././seastar/include/seastar/core/future.hh:609
(inlined by) ~future_state at ././seastar/include/seastar/core/future.hh:614
(inlined by) ~future at ././seastar/include/seastar/core/scheduling.hh:43
(inlined by) void seastar::futurize >::satisfy_with_result_of::then_wrapped_nrvo, seastar::future::finally_body >(seastar::future::finally_body&&)::{lambda(seastar::internal::promise_base_with_type&&, serialized_action::trigger(bool)::{lambda()#2}&, seastar::future_state&&)#1}::operator()(seastar::internal::promise_base_with_type, seastar::internal::promise_base_with_type&&, seastar::future_state::finally_body&&::monostate>) const::{lambda()#1}>(seastar::internal::promise_base_with_type, seastar::future::finally_body&&) at ././seastar/include/seastar/core/future.hh:2120
(inlined by) operator() at ././seastar/include/seastar/core/future.hh:1667
(inlined by) seastar::continuation, seastar::future::finally_body, seastar::future::then_wrapped_nrvo, serialized_action::trigger(bool)::{lambda()#2}>(serialized_action::trigger(bool)::{lambda()#2}&&)::{lambda(seastar::internal::promise_base_with_type&&, serialized_action::trigger(bool)::{lambda()#2}&, seastar::future_state&&)#1}, void>::run_and_dispose() at ././seastar/include/seastar/core/future.hh:767
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2344
(inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2754
seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2923
operator() at ./build/release/seastar/./seastar/src/core/reactor.cc:4128
(inlined by) void std::__invoke_impl(std::__invoke_other, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_100&) at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/invoke.h:61
(inlined by) std::enable_if, void>::type std::__invoke_r(seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_100&) at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/invoke.h:111
(inlined by) std::_Function_handler::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/std_function.h:291
std::function::operator()() const at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/std_function.h:560
(inlined by) seastar::posix_thread::start_routine(void*) at ./build/release/seastar/./seastar/src/core/posix.cc:60
```
This series handles exception handling to serialized actions triggers
that don't handle exceptions.
Test: unit(dev)
"
* tag 'handle-serialized_action-trigger-exception-v1' of https://github.com/bhalevy/scylla:
migration_manager: passive_announce(version): handle exception
view_builder: do_build_step: handle unexpected exceptions
storage_service: no need to include utils/serialized_action.hh
Exception are handled by do_build_step in principle,
Yet if an unhandled exception escapes handling
(e.g. get_units(_sem, 1) fails on a broken semaphore)
we should warn about it since the _build_step.trigger() calls
do no handle exceptions.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Snapshot-ctl methods fetch information about snapshots from
column family objects. The problem with this is that we get rid
of these objects once the table gets dropped, while the snapshots
might still be present (the auto_snapshot option is specifically
made to create this kind of situation). This commit switches from
relying on column family interface to scanning every datadir
that the database knows of in search for "snapshots" folders.
Fixes#3463Closes#7122Closes#9884
Signed-off-by: Piotr Wojtczak <piotr.m.wojtczak@gmail.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Added test that checks if a SELECT COUNT(*) query was transformed and
processed in a parallel way. Checking is done by looking at the cql
statistics and comparing subsequent counts of parallelized aggregation
SELECT query executions.
Coordinators processed each vnode sequentially on shards when executing
a `forward_request` sent by super-coordinator. This commit changes this
behavior and parallelizes execution of `forward_request` across shards.
It does that by adding additional layer of dispatching to
`forward_service`. When a coordinator receives a `forward_request`, it
forwards it to each of its shards. Shards slice `forward_request`'s
partition ranges so that they will only query data that is owned by
them. Implementation of slicing partition ranges was based on @nyh's
`token_ranges_owned_by_this_shard` from `alternator/ttl.cc`.
Detect whether a statement is a count(*) query in prepare time. If so,
instantiate a new `select_statement` subclass -
`parallelized_select_statement`. This subclass has a different execution
logic, that enables it to distribute count(*) queries across a cluster.
Also, a new counter was added - `select_parallelized` that counts the
number of parallelized aggregation SELECT query executions.
The new service is responsible for:
* spreading forward_request execution across multiple nodes in cluster
* collecting forward_request execution results and merging them
`forward_service::dispatch` method takes forward_request as an
argument, and forwards its execution to group of other nodes (using rpc
verb added in previous commits). Each node (in the group chosen by
dispatch method) is provided with forward_request, which is no different
from the original argument except for changed partition ranges. They are
changed so that vnodes contained in them are owned by recipient node.
Executing forward_request is realized in `forward_service::execute`
method, that is registered to be called on FORWARD_REQUEST verb receipt.
Process of executing forward_request consists of mocking few
non-serializable object (such as `cql3::selection`) in order to create
`service:pager:query_pagers::pager` and `cql3::selection::result_set_builder`.
After pager and result_set_builder creation, execution process resembles
what might be seen in select_statement's execution path.
Except for the verb addition, this commit also defines forward_request
and forward_result structures, used as an argument and result of the new
rpc. forward_request is used to forward information about select
statement that does count(*) (or other aggregating functions such as
max, min, avg in the future). Due to the inability to serialize
cql3::statements::select_statement, I chose to include
query::read_command, dht::partition_range_vector and some configuration
options in forward_request. They can be serialized and are sufficient
enough to allow creation of service::pager::query_pagers::pager.
The way that this detection works is a bit clunky, but it does its job
given the simplest cases e.g. "SELECT COUNT(*) FROM ks.t". It fails when
there are multiple selectors, or when there is a column name specified
("SELECT COUNT(column_name) FROM ks.t").
2022-02-01 21:14:41 +01:00
2354 changed files with 145602 additions and 44458 deletions
@@ -18,3 +18,5 @@ If you need help formatting or sending patches, [check out these instructions](h
The Scylla C++ source code uses the [Seastar coding style](https://github.com/scylladb/seastar/blob/master/coding-style.md) so please adhere to that in your patches. Note that Scylla code is written with `using namespace seastar`, so should not explicitly add the `seastar::` prefix to Seastar symbols. You will usually not need to add `using namespace seastar` to new source files, because most Scylla header files have `#include "seastarx.hh"`, which does this.
Header files in Scylla must be self-contained, i.e., each can be included without having to include specific other headers first. To verify that your change did not break this property, run `ninja dev-headers`. If you added or removed header files, you must `touch configure.py` first - this will cause `configure.py` to be automatically re-run to generate a fresh list of header files.
For more criteria on what reviewers consider good code, see the [review checklist](https://github.com/scylladb/scylla/blob/master/docs/dev/review-checklist.md).
@@ -383,6 +383,40 @@ Open the link printed at the end. Be horrified. Go and write more tests.
For more details see `./scripts/coverage.py --help`.
### Resolving stack backtraces
Scylla may print stack backtraces to the log for several reasons.
For example:
- When aborting (e.g. due to assertion failure, internal error, or segfault)
- When detecting seastar reactor stalls (where a seastar task runs for a long time without yielding the cpu to other tasks on that shard)
The backtraces contain code pointers so they are not very helpful without resolving into code locations.
To resolve the backtraces, one needs the scylla relocatable package that contains the scylla binary (with debug information),
as well as the dynamic libraries it is linked against.
Builds from our automated build system are uploaded to the cloud
and can be searched on http://backtrace.scylladb.com/
Make sure you have the scylla server exact `build-id` to locate
its respective relocatable package, required for decoding backtraces it prints.
The build-id is printed to the system log when scylla starts.
It can also be found by executing `scylla --build-id`, or
by using the `file` utility, for example:
```
$ scylla --build-id
4cba12e6eb290a406bfa4930918db23941fd4be3
$ file scylla
scylla: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=4cba12e6eb290a406bfa4930918db23941fd4be3, with debug_info, not stripped, too many notes (256)
```
To find the build-id of a coredump, use the `eu-unstrip` utility as follows:
co_returnapi_error::unknown_operation("UpdateTimeToLive not yet supported. Experimental support is available if the 'alternator-ttl' experimental feature is enabled on all nodes.");
co_returnapi_error::unknown_operation("DescribeTimeToLive not yet supported. Experimental support is available if the 'alternator_ttl' experimental feature is enabled on all nodes.");
"description":"Comma seperated keyspaces name to snapshot",
"description":"Keyspace(s) to snapshot. Multiple keyspaces can be provided using a comma-separated list. If omitted, snapshot all keyspaces.",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -632,7 +632,7 @@
},
{
"name":"cf",
"description":"the column family to snapshot",
"description":"Table(s) to snapshot. Multiple tables (in a single keyspace) can be provided using a comma-separated list. If omitted, snapshot all tables in the given keyspace(s).",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -667,7 +667,7 @@
},
{
"name":"kn",
"description":"Commaseperated keyspaces name that their snapshot will be deleted",
"description":"Comma-separated keyspaces name that their snapshot will be deleted",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -723,7 +723,7 @@
},
{
"name":"cf",
"description":"Commaseperated column family names",
"description":"Comma-separated column family names",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -755,7 +755,7 @@
},
{
"name":"cf",
"description":"Commaseperated column family names",
"description":"Comma-separated column family names",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -787,7 +787,7 @@
},
{
"name":"cf",
"description":"Comma-seperated table names",
"description":"Comma-separated table names",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -862,7 +862,7 @@
},
{
"name":"cf",
"description":"Commaseperated column family names",
"description":"Comma-separated column family names",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -902,7 +902,7 @@
},
{
"name":"cf",
"description":"Commaseperated column family names",
"description":"Comma-separated column family names",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -934,7 +934,7 @@
},
{
"name":"cf",
"description":"Commaseperated column family names",
"description":"Comma-separated column family names",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -1946,7 +1946,7 @@
"operations":[
{
"method":"POST",
"summary":"Reset local schema",
"summary":"Forces this node to recalculate versions of schema objects.",
"type":"void",
"nickname":"reset_local_schema",
"produces":[
@@ -2073,7 +2073,7 @@
},
{
"name":"cf",
"description":"Commaseperated column family names",
"description":"Comma-separated column family names",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -2100,7 +2100,7 @@
},
{
"name":"cf",
"description":"Commaseperated column family names",
"description":"Comma-separated column family names",
log_debug("Replacing earlier exhausted sstable(s) {} by new sstable {}",formatted_sstables_list(exhausted_ssts,false),sst->get_filename());
log_debug("Replacing earlier exhausted sstable(s) {} by new sstable(s) {}",formatted_sstables_list(exhausted_ssts,false),formatted_sstables_list(_new_unused_sstables,true));
"scrub compaction cannot handle invalid fragments with an active range tombstone change");
}
// If the unexpected fragment is a partition end, we just drop it.
// The only case a partition end is invalid is when it comes after
// another partition end, and we can just drop it in that case.
@@ -1312,6 +1343,7 @@ private:
}
voidfill_buffer_from_underlying(){
utils::get_local_injector().inject("rest_api_keyspace_scrub_abort",[]{throwcompaction_aborted_exception("","","scrub compaction found invalid data");});
leveled_manifest::logger.warn("Found SSTable with level {}, higher than the maximum {}. This is unexpected, but will fix",sst_level,leveled_manifest::MAX_LEVELS-1);
// This is really unexpected, so we'll just compact it all to fix it
leveled_manifest::logger.warn("Turns out that level {} is not disjoint, found {} overlapping SSTables, so compacting everything on behalf of {}.{}",level,overlapping_sstables,schema->ks_name(),schema->cf_name());
// Unfortunately no good limit to limit input size to max_sstables for LCS major
leveled_manifest::logger.warn("Turns out that level {} is not disjoint, found {} overlapping SSTables, so the level will be entirely compacted on behalf of {}.{}",level,overlapping_sstables,schema->ks_name(),schema->cf_name());
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.