This commit fixes the rollback procedure in
the 5.0-to-5.1 upgrade guide:
- The "Restore system tables" step is removed.
- The "Restore the configuration file" command
is fixed.
- The "Gracefully shutdown ScyllaDB" command
is fixed.
In addition, there are the following updates
to be in sync with the tests:
- The "Backup the configuration file" step is
extended to include a command to backup
the packages.
- The Rollback procedure is extended to restore
the backup packages.
- The Reinstallation section is fixed for RHEL.
Also, I've the section removed the rollback
section for images, as it's not correct or
relevant.
Refs https://github.com/scylladb/scylladb/issues/11907
This commit must be backported to branch-5.4, branch-5.2, and branch-5.1
Closesscylladb/scylladb#16154
(cherry picked from commit 7ad0b92559)
The copy assignment operator of _ck can throw
after _type and _bound_weight have already been changed.
This leaves position_in_partition in an inconsistent state,
potentially leading to various weird symptoms.
The problem was witnessed by test_exception_safety_of_reads.
Specifically: in cache_flat_mutation_reader::add_to_buffer,
which requires the assignment to _lower_bound to be exception-safe.
The easy fix is to perform the only potentially-throwing step first.
Fixes#15822Closesscylladb/scylladb#15864
(cherry picked from commit 93ea3d41d8)
* tools/jmx 06f2735...ed3cc6d (1):
> Merge "scylla-apiclient: update several Java dependencies" from Piotr Grabowski
* tools/java be0aaf7597...7459a11815 (1):
> Merge 'build: update several dependencies' from Piotr Grabowski
Update build dependencies which were flagged by security scanners.
Refs: scylladb/scylla-jmx#220
Refs: scylladb/scylla-tools-java#351
Closes#16151
Currently, the API call recalculates only per-node schema version. To
workaround issues like #4485 we want to recalculate per-table
digests. One way to do that is to restart the node, but that's slow
and has impact on availability.
Use like this:
curl -X POST http://127.0.0.1:10000/storage_service/relocal_schemaFixes#15380Closes#15381
(cherry picked from commit c27d212f4b)
(cherry picked from commit bfd8401477)
Currently, when said feature is enabled, we recalcuate the schema
digest. But this feature also influences how table versions are
calculated, so it has to trigger a recalculation of all table versions,
so that we can guarantee correct versions.
Before, this used to happen by happy accident. Another feature --
table_digest_insensitive_to_expiry -- used to take care of this, by
triggering a table version recalulation. However this feature only takes
effect if digest_insensitive_to_expiry is also enabled. This used to be
the case incidently, by the time the reload triggered by
table_digest_insensitive_to_expiry ran, digest_insensitive_to_expiry was
already enabled. But this was not guaranteed whatsoever and as we've
recently seen, any change to the feature list, which changes the order
in which features are enabled, can cause this intricate balance to
break.
This patch makes digest_insensitive_to_expiry also kick off a schema
reload, to eliminate our dependence on (unguaranteed) feature order, and
to guarantee that table schemas have a correct version after all features
are enabled. In fact, all schema feature notification handlers now kick
off a full schema reload, to ensure bugs like this don't creep in, in
the future.
Fixes: #16004Closesscylladb/scylladb#16013
(cherry picked from commit 22381441b0)
(cherry picked from commit e31f2224f5)
In 0c86abab4d `merge_schema` obtained a new flag, `reload`.
Unfortunately, the flag was assigned a default value, which I think is
almost always a bad idea, and indeed it was in this case. When
`merge_scehma` is called on shard different than 0, it recursively calls
itself on shard 0. That recursive call forgot to pass the `reload` flag.
Fix this.
(cherry picked from commit 48164e1d09)
(cherry picked from commit c994ed2057)
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.
Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.
After ae8d2a550d (5.2.0), it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).
This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.
A similar problem was fixed for per-node digest calculation in
c2ba94dc39e4add9db213751295fb17b95e6b962. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.
Fixes#4485.
Manually tested using ccm on cluster upgrade scenarios and node restarts.
Closes#14441
* github.com:scylladb/scylladb:
test: schema_change_test: Verify digests also with TABLE_DIGEST_INSENSITIVE_TO_EXPIRY enabled
schema_mutations, migration_manager: Ignore empty partitions in per-table digest
migration_manager, schema_tables: Implement migration_manager::reload_schema()
schema_tables: Avoid crashing when table selector has only one kind of tables
(cherry picked from commit cf81eef370)
(cherry picked from commit 40eed1f1c5)
Currently the code will assert because cl pointer will be null and it
will be null because there is no mutations to initialize it from.
Message-Id: <20230212144837.2276080-3-gleb@scylladb.com>
(cherry picked from commit 941407b905)
Backport needed by #4485.
(cherry picked from commit f233c8a9e4)
Currently, it is started/stopped in the streaming/maintenance sg, which
is what the API itself runs in.
Starting the native transport in the streaming sg, will lead to severely
degraded performance, as the streaming sg has significantly less
CPU/disk shares and reader concurrency semaphore resources.
Furthermore, it will lead to multi-paged reads possibly switching
between scheduling groups mid-way, triggering an internal error.
To fix, use `with_scheduling_group()` for both starting and stopping
native transport. Technically, it is only strictly necessary for
starting, but I added it for stop as well for consistency.
Also apply the same treatment to RPC (Thrift). Although no one uses it,
best to fix it, just to be on the safe side.
I think we need a more systematic approach for solving this once and for
all, like passing the scheduling group to the protocol server and have
it switch to it internally. This allows the server to always run on the
correct scheduling group, not depending on the caller to remember using
it. However, I think this is best done in a follow-up, to keep this
critical patch small and easily backportable.
Fixes: #15485Closesscylladb/scylladb#16019
(cherry picked from commit dfd7981fa7)
$ID_LIKE = "rhel" works only on RHEL compatible OSes, not for RHEL
itself.
To detect RHEL correctly, we also need to check $ID = "rhel".
Fixes#16040Closesscylladb/scylladb#16041
(cherry picked from commit 338a9492c9)
When base write triggers mv write and it needs to be send to another
shard it used the same service group and we could end up with a
deadlock.
This fix affects also alternator's secondary indexes.
Testing was done using (yet) not committed framework for easy alternator
performance testing: https://github.com/scylladb/scylladb/pull/13121.
I've changed hardcoded max_nonlocal_requests config in scylla from 5000 to 500 and
then ran:
./build/release/scylla perf-alternator-workloads --workdir /tmp/scylla-workdir/ --smp 2 \
--developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write_gsi \
--duration 60 --ring-delay-ms 0 --skip-wait-for-gossip-to-settle 0 --continue-after-error true --concurrency 2000
Without the patch when scylla is overloaded (i.e. number of scheduled futures being close to max_nonlocal_requests) after couple seconds
scylla hangs, cpu usage drops to zero, no progress is made. We can confirm we're hitting this issue by seeing under gdb:
p seastar::get_smp_service_groups_semaphore(2,0)._count
$1 = 0
With the patch I wasn't able to observe the problem, even with 2x
concurrency. I was able to make the process hang with 10x concurrency
but I think it's hitting different limit as there wasn't any depleted
smp service group semaphore and it was happening also on non mv loads.
Fixes https://github.com/scylladb/scylladb/issues/15844Closesscylladb/scylladb#15845
(cherry picked from commit 020a9c931b)
These APIs may return stale or simply incorrect data on shards
other than 0. Newer versions of Scylla are better at maintaining
cross-shard consistency, but we need a simple fix that can be easily and
without risk be backported to older versions; this is the fix.
Add a simple test to check that the `failure_detector/endpoints`
API returns nonzero generation.
Fixes: scylladb/scylladb#15816Closesscylladb/scylladb#15970
* github.com:scylladb/scylladb:
test: rest_api: test that generation is nonzero in `failure_detector/endpoints`
api: failure_detector: fix indentation
api: failure_detector: invoke on shard 0
(cherry picked from commit 9443253f3d)
Off-strategy suffers with a 100% space overhead, as it adopted
a sort of all or nothing approach. Meaning all input sstables,
living in maintenance set, are kept alive until they're all
reshaped according to the strategy criteria.
Input sstables in off-strategy are very likely to be mostly disjoint,
so it can greatly benefit from incremental compaction.
The incremental compaction approach is not only good for
decreasing disk usage, but also memory usage (as metadata of
input and output live in memory), and file desc count, which
takes memory away from OS.
Turns out that this approach also greatly simplifies the
off-strategy impl in compaction manager, as it no longer have
to maintain new unused sstables and mark them for
deletion on failure, and also unlink intermediary sstables
used between reshape rounds.
Fixes https://github.com/scylladb/scylladb/issues/14992.
Backport notes: relatively easy to backport, had to include
**replica: Make compaction_group responsible for deleting off-strategy compaction input**
and
**compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0**
Closes#15794
* github.com:scylladb/scylladb:
test: Verify that off-strategy can do incremental compaction
compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0
compaction: Clear pending_replacement list when tombstone GC is disabled
compaction: Enable incremental compaction on off-strategy
compaction: Extend reshape type to allow for incremental compaction
compaction: Move reshape_compaction in the source
compaction: Enable incremental compaction only if replacer callback is engaged
replica: Make compaction_group responsible for deleting off-strategy compaction input
removenode host_id must specify the host ID as a UUID,
not an ip address.
Fixes#11839
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#11840
(cherry picked from commit 44e1058f63)
before this change, `checksummed_file_data_sink_impl` just inherits the
`data_sink_impl::flush()` from its parent class. but as a wrapper around
the underlying `_out` data_sink, this is not only an unusual design
decision in a layered design of an I/O system, but also could be
problematic. to be more specific, the typical user of `data_sink_impl`
is a `data_sink`, whose `flush()` member function is called when
the user of `data_sink` want to ensure that the data sent to the sink
is pushed to the underlying storage / channel.
this in general works, as the typical user of `data_sink` is in turn
`output_stream`, which calls `data_sink.flush()` before closing the
`data_sink` with `data_sink.close()`. and the operating system will
eventually flush the data after application closes the corresponding
fd. to be more specific, almost none of the popular local filesystem
implements the file_operations.op, hence, it's safe even if the
`output_stream` does not flush the underlying data_sink after writing
to it. this is the use case when we write to sstables stored on local
filesystem. but as explained above, if the data_sink is backed by a
network filesystem, a layered filesystem or a storage connected via
a buffered network device, then it is crucial to flush in a timely
manner, otherwise we could risk data lost if the application / machine /
network breaks when the data is considerered persisted but they are
_not_!
but the `data_sink` returned by `client::make_upload_jumbo_sink` is
a little bit different. multipart upload is used under the hood, and
we have to finalize the upload once all the parts are uploaded by
calling `close()`. but if the caller fails / chooses to close the
sink before flushing it, the upload is aborted, and the partially
uploaded parts are deleted.
the default-implemented `checksummed_file_data_sink_impl::flush()`
breaks `upload_jumbo_sink` which is the `_out` data_sink being
wrapped by `checksummed_file_data_sink_impl`. as the `flush()`
calls are shortcircuited by the wrapper, the `close()` call
always aborts the upload. that's why the data and index components
just fail to upload with the S3 backend.
in this change, we just delegate the `flush()` call to the
wrapped class.
Fixes#15079
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15134
(cherry picked from commit d2d1141188)
The grammar mistakenly allows nothing to be parsed as an
intValue (itself accepted in LIMIT and similar clauses).
Easily fixed by removing the empty alternative. A unit test is
added.
Fixes#14705.
Closes#14707
(cherry picked from commit e00811caac)
In this branch(5.1) the most recent available rustc version is 1.60,
despite that, the 'cargo install' command tries to install the most
recent version of a package by default, which may rely on newer rustc
versions. This patch specifies the version of the cxxbridge-cmd package
to one that works with rustc 1.60.
Closesscylladb/scylladb#15812
[avi: regenerated frozen toolchain]
Closesscylladb/scylladb#15828
Prevent div-by-zero byt returning const level 1
if max_sstable_size is zero, as configured by
cleanup_incremental_compaction_test, before it's
extended to cover also offstrategy compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b1e164a241)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
pending_replacement list is used by incremental compaction to
communicate to other ongoing compactions about exhausted sstables
that must be replaced in the sstable set they keep for tombstone
GC purposes.
Reshape doesn't enable tombstone GC, so that list will not
be cleared, which prevents incremental compaction from releasing
sstables referenced by that list. It's not a problem until now
where we want reshape to do incremental compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Off-strategy suffers with a 100% space overhead, as it adopted
a sort of all or nothing approach. Meaning all input sstables,
living in maintenance set, are kept alive until they're all
reshaped according to the strategy criteria.
Input sstables in off-strategy are very likely to mostly disjoint,
so it can greatly benefit from incremental compaction.
The incremental compaction approach is not only good for
decreasing disk usage, but also memory usage (as metadata of
input and output live in memory), and file desc count, which
takes memory away from OS.
Turns out that this approach also greatly simplifies the
off-strategy impl in compaction manager, as it no longer have
to maintain new unused sstables and mark them for
deletion on failure, and also unlink intermediary sstables
used between reshape rounds.
Fixes#14992.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 42050f13a0)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's done by inheriting regular_compaction, which implement
incremental compaction. But reshape still implements its own
methods for creating writer and reader. One reason is that
reshape is not driven by controller, as input sstables to it
live in maintenance set. Another reason is customization
of things like sstable origin, etc.
stop_sstable_writer() is extended because that's used by
regular_compaction to check for possibility of removing
exhausted sstables earlier whenever an output sstable
is sealed.
Also, incremental compaction will be unconditionally
enabled for ICS/LCS during off-strategy.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit db9ce9f35a)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's in preparation to next change that will make reshape
inherit from regular compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's needed for enabling incremental compaction to operate, and
needed for subsequent work that enables incremental compaction
for off-strategy, which in turn uses reshape compaction type.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Compaction group is responsible for deleting SSTables of "in-strategy"
compactions, i.e. regular, major, cleanup, etc.
Both in-strategy and off-strategy compaction have their completion
handled using the same compaction group interface, which is
compaction_group::table_state::on_compaction_completion(...,
sstables::offstrategy offstrategy)
So it's important to bring symmetry there, by moving the responsibility
of deleting off-strategy input, from manager to group.
Another important advantage is that off-strategy deletion is now throttled
and gated, allowing for better control, e.g. table waiting for deletion
on shutdown.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13432
(cherry picked from commit 457c772c9c)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Commit 8c4b5e4 introduced an optimization which only
calculates max purgeable timestamp when a tombstone satisfy the
grace period.
Commit 'repair: Get rid of the gc_grace_seconds' inverted the order,
probably under the assumption that getting grace period can be
more expensive than calculating max purgeable, as repair-mode GC
will look up into history data in order to calculate gc_before.
This caused a significant regression on tombstone heavy compactions,
where most of tombstones are still newer than grace period.
A compaction which used to take 5s, now takes 35s. 7x slower.
The reason is simple, now calculation of max purgeable happens
for every single tombstone (once for each key), even the ones that
cannot be GC'ed yet. And each calculation has to iterate through
(i.e. check the bloom filter of) every single sstable that doesn't
participate in compaction.
Flame graph makes it very clear that bloom filter is a heavy path
without the optimization:
45.64% 45.64% sstable_compact sstable_compaction_test_g
[.] utils::filter::bloom_filter::is_present
With its resurrection, the problem is gone.
This scenario can easily happen, e.g. after a deletion burst, and
tombstones becoming only GC'able after they reach upper tiers in
the LSM tree.
Before this patch, a compaction can be estimated to have this # of
filter checks:
(# of keys containing *any* tombstone) * (# of uncompacting sstable
runs[1])
[1] It's # of *runs*, as each key tend to overlap with only one
fragment of each run.
After this patch, the estimation becomes:
(# of keys containing a GC'able tombstone) * (# of uncompacting
runs).
With repair mode for tombstone GC, the assumption, that retrieval
of gc_before is more expensive than calculating max purgeable,
is kept. We can revisit it later. But the default mode, which
is the "timeout" (i.e. gc_grace_seconds) one, we still benefit
from the optimization of deferring the calculation until
needed.
Cherry picked from commit 38b226f997
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Fixes#14091.
Closes#13908
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#15745
This is a backport of PR https://github.com/scylladb/scylladb/pull/15740.
This commit removes the information about the recommended way of upgrading ScyllaDB images - by updating ScyllaDB and OS packages in one step. This upgrade procedure is not supported (it was implemented, but then reverted).
The scope of this commit:
- Remove the information from the 5.0-to.-5.1 upgrade guide and replace with general info.
- Remove the information from the 4.6-to.-5.1 upgrade guide and replace with general info.
- Remove the information from the 5.x.y-to.-5.x.z upgrade guide and replace with general info.
- Remove the following files as no longer necessary (they were only created to incorporate the (invalid) information about image upgrade into the upgrade guides.
/upgrade/_common/upgrade-image-opensource.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p1.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p2.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian.rst
Closes#15769
* github.com:scylladb/scylladb:
doc: remove wrong image upgrade info (5.x.y-to-5.x.y)
doc: remove wrong image upgrade info (4.6-to-5.0)
doc: remove wrong image upgrade info (5.0-to-5.1)
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 5.x.y-to-5.x.y upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).
Refs https://github.com/scylladb/scylladb/issues/15733
In addition, the following files are removed as no longer
necessary (they were only created to incorporate the (invalid)
information about image upgrade into the upgrade guides.
/upgrade/_common/upgrade-image-opensource.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p1.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p2.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian.rst
(cherry picked from commit dd1207cabb)
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 4.6-to-5.0 upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).
Refs https://github.com/scylladb/scylladb/issues/15733
(cherry picked from commit 526d543b95)
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 5.0-to-5.1 upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).
Refs https://github.com/scylladb/scylladb/issues/15733
(cherry picked from commit 9852130c5b)
The estimated_partitions is estimated after the repair_meta is created.
Currently, the default estimated_partitions was used to create the
write which is not correct.
To fix, use the updated estimated_partitions.
Reported by Petr Gusev
Closes#14179Fixes#15748
(cherry picked from commit 4592bbe182)
Scylla can crash due to a complicated interaction of service level drop,
evictable readers, inactive read registration path.
1) service level drop invoke stop of reader concurrency semaphore, which will
wait for in flight requests
2) turns out it stops first the gate used for closing readers that will
become inactive.
3) proceeds to wait for in-flight reads by closing the reader permit gate.
4) one of evictable reads take the inactive read registration path, and
finds the gate for closing readers closed.
5) flat mutation reader is destroyed, but finds the underlying reader was
not closed gracefully and triggers the abort.
By closing permit gate first, evictable readers becoming inactive will
be able to properly close underlying reader, therefore avoiding the
crash.
Fixes#15534.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15535
(cherry picked from commit 914cbc11cf)
Currently, when creating the table, permissions may be mistakenly
granted to the user even if the table is already existing. This
can happen in two cases:
The query has a IF NOT EXISTS clause - as a result no exception
is thrown after encountering the existing table, and the permission
granting is not prevented.
The query is handled by a non-zero shard - as a result we accept
the query with a bounce_to_shard result_message, again without
preventing the granting of permissions.
These two cases are now avoided by checking the result_message
generated when handling the query - now we only grant permissions
when the query resulted in a schema_change message.
Additionally, a test is added that reproduces both of the mentioned
cases.
CVE-2023-33972
Fixes#15467.
* 'no-grant-on-no-create' of github.com:scylladb/scylladb-ghsa-ww5v-p45p-3vhq:
auth: do not grant permissions to creator without actually creating
transport: add is_schema_change() method to result_message
(cherry picked from commit ab6988c52f)
Today, we base compaction throughput on the amount of data written,
but it should be based on the amount of input data compacted
instead, to show the amount of data compaction had to process
during its execution.
A good example is a compaction which expire 99% of data, and
today throughput would be calculated on the 1% written, which
will mislead the reader to think that compaction was terribly
slow.
Fixes#14533.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14615
(cherry picked from commit 3b1829f0d8)
We allow inserting column values using a JSON value, eg:
```cql
INSERT INTO mytable JSON '{ "\"myKey\"": 0, "value": 0}';
```
When no JSON value is specified, the query should be rejected.
Scylla used to crash in such cases. A recent change fixed the crash
(https://github.com/scylladb/scylladb/pull/14706), it now fails
on unwrapping an uninitialized value, but really it should
be rejected at the parsing stage, so let's fix the grammar so that
it doesn't allow JSON queries without JSON values.
A unit test is added to prevent regressions.
Refs: https://github.com/scylladb/scylladb/pull/14707
Fixes: https://github.com/scylladb/scylladb/issues/14709
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
\Closes #14785
(cherry picked from commit cbc97b41d4)
The Alternator test test_ttl.py::test_ttl_expiration_gsi_lsi was flaky.
The test incorrectly assumes that when we write an already expired item,
it will be visible for a short time until being deleted by the TTL thread.
But this doesn't need to be true - if the test is slow enough, it may go
look or the item after it was already expired!
So we fix this test by splitting it into two parts - in the first part
we write a non-expiring item, and notice it eventually appears in the
GSI, LSI, and base-table. Then we write the same item again, with an
expiration time - and now it should eventually disappear from the GSI,
LSI and base-table.
This patch also fixes a small bug which prevented this test from running
on DynamoDB.
Fixes#14495
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14496
(cherry picked from commit 599636b307)
Permits added to `_ready_list` remain there until
executed by `execution_loop()`.
But `execution_loop()` exits when `_stopped == true`,
even though nothing prevents new permits from being added
to `_ready_list` after `stop()` sets `_stopped = true`.
Thus, if there are reads concurrent with `stop()`,
it's possible for a permit to be added to `_ready_list`
after `execution_loop()` has already quit. Such a permit will
never be destroyed, and `stop()` will forever block on
`_permit_gate.close()`.
A natural solution is to dismiss `execution_loop()` only after
it's certain that `_ready_list` won't receive any new permits.
This is guaranteed by `_permit_gate.close()`. After this call completes,
it is certain that no permits *exist*.
After this patch, `execution_loop()` no longer looks at `_stopped`.
It only exits when `_ready_list_cv` breaks, and this is triggered
by `stop()` right after `_permit_gate.close()`.
Fixes#15198Closes#15199
(cherry picked from commit 2000a09859)
Call replicate_live_endpoints on shard 0 to copy from 0 to the rest of
the shards. And get the list of live members from shard 0.
Move lock to the callers.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13240
(cherry picked from commit da00052ad8)
Add an API call to wait for all shards to reach the current shard 0
gossiper version. Throws when timeout is reached.
Closes#12540
* github.com:scylladb/scylladb:
api: gossiper: fix alive nodes
gms, service: lock live endpoint copy
gms, service: live endpoint copy method
(cherry picked from commit b919373cce)
when the local_deletion_time is too large and beyond the
epoch time of INT32_MAX, we cap it to INT32_MAX - 1.
this is a signal of bad configuration or a bug in scylla.
so let's add more information in the logging message to
help track back to the source of the problem.
Fixes#15015
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 9c24be05c3)
Closes#15151
This mini-series backports the fix for #12010 along with low-risk patches it depends on.
Fixes: #12010Closes#15135
* github.com:scylladb/scylladb:
distributed_loader: process_sstable_dir: do not verify snapshots
utils/directories: verify_owner_and_mode: add recursive flag
utils: Restore indentation after previous patch
utils: Coroutinize verify_owner_and_mode()
Skip over verification of owner and mode of the snapshots
sub-directory as this might race with scylla-manager
trying to delete old snapshots concurrently.
\Fixes #12010
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 845b6f901b)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Allow the caller to verify only the top level directories
so that sub-directories can be verified selectively
(in particular, skip validation of snapshots).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 60862c63dd)
There's a helper verification_error() that prints a warning and returns
excpetional future. The one is converted into void throwing one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 4ebb812df0)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Loop in shard_reshaping_compaction_task_impl::run relies on whether
sstables::compaction_stopped_exception is thrown from run_custom_job.
The exception is swallowed for each type of compaction
in compaction_manager::perform_task.
Rethrow an exception in perfrom task for reshape compaction.
Fixes: #15058.
(cherry picked from commit e0ce711e4f)
Closes#15123
This argument was dead since its introduction and 'discard' was
always configured regardless of its value.
This patch allows actually configuring things using this argument.
Fixes#14963Closes#14964
(cherry picked from commit e13a2b687d)
While repair requested by user is performed, some tables
may be dropped. When the repair proceeds to these tables,
it should skip them and continue with others.
When no_such_column_family is thrown during user requested
repair, it is logged and swallowed. Then the repair continues with
the remaining tables.
Fixes: scylladb/scylladb#13045Closesscylladb/scylladb#13068
* github.com:scylladb/scylladb:
repair: fix indentation
repair: continue user requested repair if no_such_column_family is thrown
repair: add find_column_family_if_exists function
(cherry picked from commit 9859bae54f)
Will be useful for writing tests which trigger failures, and for
warkarounds in production.
(cherry picked from commit 5c8ad2db3c)
Refs scylladb/scylladb#12969
We have had support for COUNTER columns for quite some time now, but some functionality was left unimplemented - various internal and CQL functions resulted in "unimplemented" messages when used, and the goal of this series is to fix those issues. The primary goal was to add the missing support for CASTing counters to other types in CQL (issue #14501), but we also add the missing CQL `counterasblob()` and `blobascounter()` functions (issue #14742).
As usual, the series includes extensive functional tests for these features, and one pre-existing test for CAST that used to fail now begins to pass.
Fixes#14501Fixes#14742Closes#14745
* github.com:scylladb/scylladb:
test/cql-pytest: test confirming that casting to counter doesn't work
cql: support casting of counter to other types
cql: implement missing counterasblob() and blobascounter() functions
cql: implement missing type functions for "counters" type
(cherry picked from commit a637ddd09c)
Small modification was needed to validate_visitor API for the patch to
apply.
This patch includes a translation of two more test files from
Cassandra's CQL unit test directory cql3/validation/operations.
All tests included here pass on Cassandra. Several test fail on Scylla
and are marked "xfail". These failures discovered two previously-unknown
bugs:
#12243: Setting USING TTL of "null" should be allowed
#12247: Better error reporting for oversized keys during INSERT
And also added reproducers for two previously-known bugs:
#3882: Support "ALTER TABLE DROP COMPACT STORAGE"
#6447: TTL unexpected behavior when setting to 0 on a table with
default_time_to_live
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12248
(cherry picked from commit 0c26032e70)
This is a translation of Cassandra's CQL unit test source file
validation/operations/CompactStorageTest.java into our cql-pytest
framework.
This very large test file includes 86 tests for various types of
operations and corner cases of WITH COMPACT STORAGE tables.
All 86 tests pass on Cassandra (except one using a deprecated feature
that needs to be specially enabled). 30 of the tests fail on Scylla
reproducing 7 already-known Scylla issues and 7 previously-unknown issues:
Already known issues:
Refs #3882: Support "ALTER TABLE DROP COMPACT STORAGE"
Refs #4244: Add support for mixing token, multi- and single-column
restrictions
Refs #5361: LIMIT doesn't work when using GROUP BY
Refs #5362: LIMIT is not doing it right when using GROUP BY
Refs #5363: PER PARTITION LIMIT doesn't work right when using GROUP BY
Refs #7735: CQL parser missing support for Cassandra 3.10's new "+=" syntax
Refs #8627: Cleanly reject updates with indexed values where value > 64k
New issues:
Refs #12471: Range deletions on COMPACT STORAGE is not supported
Refs #12474: DELETE prints misleading error message suggesting
ALLOW FILTERING would work
Refs #12477: Combination of COUNT with GROUP BY is different from
Cassandra in case of no matches
Refs #12479: SELECT DISTINCT should refuse GROUP BY with clustering column
Refs #12526: Support filtering on COMPACT tables
Refs #12749: Unsupported empty clustering key in COMPACT table
Refs #12815: Hidden column "value" in compact table isn't completely hidden
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12816
(cherry picked from commit 328cdb2124)
(cherry picked from commit e11561ef65)
Modified for 5.1 to comment out error-path tests for "unset" values what
are silently ignored (instead of being detected) in this version.
This is a translation of Cassandra's CQL unit test source file
functions/CastFctsTest.java into our cql-pytest framework.
There are 13 tests, 9 of them currently xfail.
The failures are caused by one recently-discovered issue:
Refs #14501: Cannot Cast Counter To Double
and by three previously unknown or undocumented issues:
Refs #14508: SELECT CAST column names should match Cassandra's
Refs #14518: CAST from timestamp to string not same as Cassandra on zero
milliseconds
Refs #14522: Support CAST function not only in SELECT
Curiously, the careful translation of this test also caused me to
find a bug in Cassandra https://issues.apache.org/jira/browse/CASSANDRA-18647
which the test in Java missed because it made the same mistake as the
implementation.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14528
(cherry picked from commit f08bc83cb2)
(cherry picked from commit e03c21a83b)
This patch adds tests to reproduce issue #13551. The issue, discovered
by a dtest (cql_cast_test.py), claimed that either cast() or sum(cast())
from varint type broke. So we add two tests in cql-pytest:
1. A new test file, test_cast_data.py, for testing data casts (a
CAST (...) as ... in a SELECT), starting with testing casts from
varint to other types.
The test uncovers a lot of interesting cases (it is heavily
commented to explain these cases) but nothing there is wrong
and all tests pass on Scylla.
2. An xfailing test for sum() aggregate of +Inf and -Inf. It turns out
that this caused #13551. In Cassandra and older Scylla, the sum
returned a NaN. In Scylla today, it generates a misleading
error message.
As usual, the tests were run on both Cassandra (4.1.1) and Scylla.
Refs #13551.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 78555ba7f1)
(cherry picked from commit 79b5befe65)
If semaphore mismatch occurs, check whether both semaphores belong
to user. If so, log a warning, log a `querier_cache_scheduling_group_mismatches` stat and drop cached reader instead of throwing an error.
Until now, semaphore mismatch was only checked in multi-partition queries. The PR pushes the check to `querier_cache` and perform it on all `lookup_*_querier` methods.
The mismatch can happen if user's scheduling group changed during
a query. We don't want to throw an error then, but drop and reset
cached reader.
This patch doesn't solve a problem with mismatched semaphores because of changes in service levels/scheduling groups but only mitigate it.
Refers: https://github.com/scylladb/scylla-enterprise/issues/3182
Refers: https://github.com/scylladb/scylla-enterprise/issues/3050Closes: #14770Closes#14736
* github.com:scylladb/scylladb:
querier_cache: add stats of scheduling group mismatches
querier_cache: check semaphore mismatch during querier lookup
querier_cache: add reference to `replica::database::is_user_semaphore()`
replica:database: add method to determine if semaphore is user one
(cherry picked from commit a8feb7428d)
This mini-series introduces dht::tokens_filter and uses it for consuming staging sstable in the view_update_generator.
The tokens_filter uses the token ranges owned by the current node, as retrieved by get_keyspace_local_ranges.
Refs #9559Closes#11932
* github.com:scylladb/scylladb:
db: view_update_generator: always clean up staging sstables
compaction: extract incremental_owned_ranges_checker out to dht
(cherry picked from commit 3aff59f189)
do_refresh_state() keeps iterators to rows_entry in a vector.
This vector might be resized during the procedure, triggering
memory reclaim and invalidating the iterators, which can cause
arbitrarily long loops and/or a segmentation fault during make_heap().
To fix this, do_refresh_state has to always be called from the allocating
section.
Additionally, it turns out that the first do_refresh_state is useless,
because reset_state() doesn't set _change_mark. This causes do_refresh_state
to be needlessly repeated during a next_row() or next_range_tombstone() which
happens immediately after it. Therefore this patch moves the _change_mark
assignment from maybe_refresh_state to do_refresh_state, so that the change mark
is properly set even after the first refresh.
Fixes#14696Closes#14697
(cherry picked from commit 41aef6dc96)
before this change, there are chances that the temporary sstables
created for collecting the GC-able data create by a certain
compaction can be picked up by another compaction job. this
wastes the CPU cycles, adds write amplification, and causes
inefficiency.
in general, these GC-only SSTables are created with the same run id
as those non-GC SSTables, but when a new sstable exhausts input
sstable(s), we proactively replace the old main set with a new one
so that we can free up the space as soon as possible. so the
GC-only SSTables are added to the new main set along with
the non-GC SSTables, but since the former have good chance to
overlap the latter. these GC-only SSTables are assigned with
different run ids. but we fail to register them to the
`compaction_manager` when replacing the main sstable set.
that's why future compactions pick them up when performing compaction,
when the compaction which created them is not yet completed.
so, in this change,
* to prevent sstables in the transient stage from being picked
up by regular compactions, a new interface class is introduced
so that the sstable is always added to registration before
it is added to sstable set, and removed from registration after
it is removed from sstable set. the struct helps to consolidate
the regitration related logic in a single place, and helps to
make it more obvious that the timespan of an sstable in
the registration should cover that in the sstable set.
* use a different run_id for the gc sstable run, as it can
overlap with the output sstable run. the run_id for the
gc sstable run is created only when the gc sstable writer
is created. because the gc sstables is not always created
for all compactions.
please note, all (indirect) callers of
`compaction_task_executor::compact_sstables()` passes a non-empty
`std::function` to this function, so there is no need to check for
empty before calling it. so in this change, the check is dropped.
Fixes#14560
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14725
(cherry picked from commit fdf61d2f7c)
Closes#14828
Te view updating consumer uses `_buffer_size` to decide when to flush the accumulated mutations, passing them to the actual view building code. This `_buffer_size` is incremented every time a mutation fragment is consumed. This is not exact, as e.g. range tombstones are represented differently in the mutation object, than in the fragment, but it is good enough. There is one flaw however: `_buffer_size` is not incremented when consuming a partition-start fragment. This is when the mutation object is created in the mutation rebuilder. This is not a big problem when partition have many rows, but if the partitions are tiny, the error in accounting quickly becomes significant. If the partitions are empty, `_buffer_size` is not bumped at all for empty partitions, and any number of these can accumulate in the buffer. We have recently seen this causing stalls and OOM as the buffer got to immense size, only containing empty and tiny partitions.
This PR fixes this by accounting the size of the freshly created `mutation` object in `_buffer_size`, after the partition-start fragment is consumed.
Fixes: #14819Closes#14821
* github.com:scylladb/scylladb:
test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations
db/view/view_updating_consumer: account for the size of mutations
mutation/mutation_rebuilder*: return const mutation& from consume_new_partition()
mutation/mutation: add memory_usage()
(cherry picked from commit 056d04954c)
(cherry picked from commit e34c62c567)
It was found that cached_file dtor can hit the following assert
after OOM
cached_file_test: utils/cached_file.hh:379: cached_file::~cached_file(): Assertion _cache.empty()' failed.`
cached_file's dtor iterates through all entries and evict those
that are linked to LRU, under the assumption that all unused
entries were linked to LRU.
That's partially correct. get_page_ptr() may fetch more than 1
page due to read ahead, but it will only call cached_page::share()
on the first page, the one that will be consumed now.
share() is responsible for automatically placing the page into
LRU once refcount drops to zero.
If the read is aborted midway, before cached_file has a chance
to hit the 2nd page (read ahead) in cache, it will remain there
with refcount 0 and unlinked to LRU, in hope that a subsequent
read will bring it out of that state.
Our main user of cached_file is per-sstable index caching.
If the scenario above happens, and the sstable and its associated
cached_file is destroyed, before the 2nd page is hit, cached_file
will not be able to clear all the cache because some of the
pages are unused and not linked.
A page read ahead will be linked into LRU so it doesn't sit in
memory indefinitely. Also allowing for cached_file dtor to
clear all cache if some of those pages brought in advance
aren't fetched later.
A reproducer was added.
Fixes#14814.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14818
(cherry picked from commit 050ce9ef1d)
The new test detected a stack-use-after-return when using table's
as_mutation_source_excluding_staging() for range reads.
This doesn't really affect view updates that generate single
key reads only. So the problem was only stressed in the recently
added test. Otherwise, we'd have seen it when running dtests
(in debug mode) that stress the view update path from staging.
The problem happens because the closure was feeded into
a noncopyable_function that was taken by reference. For range
reads, we defer before subsequent usage of the predicate.
For single key reads, we only defer after finished using
the predicate.
Fix is about using sstable_predicate type, so there won't
be a need to construct a temporary object on stack.
Fixes#14812.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14813
(cherry picked from commit 0ac43ea877)
Currently, scylla_fstrim_setup does not start scylla-fstrim.timer and
just enables it, so the timer starts only after rebooted.
This is incorrect behavior, we start start it during the setup.
Also, unmask is unnecessary for enabling the timer.
Fixes#14249Closes#14252
(cherry picked from commit c70a9cbffe)
Closes#14420
Consider
- 10 repair instances take all the 10 _streaming_concurrency_sem
- repair readers are done but the permits are not released since they
are waiting for view update _registration_sem
- view updates trying to take the _streaming_concurrency_sem to make
progress of view update so it could release _registration_sem, but it
could not take _streaming_concurrency_sem since the 10 repair
instances have taken them
- deadlock happens
Note, when the readers are done, i.e., reaching EOS, the repair reader
replaces the underlying (evictable) reader with an empty reader. The
empty reader is not evictable, so the resources cannot be forcibly
released.
To fix, release the permits manually as soon as the repair readers are
done even if the repair job is waiting for _registration_sem.
Fixes#14676Closes#14677
(cherry picked from commit 1b577e0414)
Adds preemption points used in Alternator when:
- sending bigger json response
- building results for BatchGetItem
I've tested manually by inserting in preemptible sections (e.g. before `os.write`) code similar to:
auto start = std::chrono::steady_clock::now();
do { } while ((std::chrono::steady_clock::now() - start) < 100ms);
and seeing reactor stall times. After the patch they
were not increasing while before they kept building up due to no preemption.
Refs #7926Fixes#13689Closes#12351
* github.com:scylladb/scylladb:
alternator: remove redundant flush call in make_streamed
utils: yield when streaming json in print()
alternator: yield during BatchGetItem operation
(cherry picked from commit d2e089777b)
On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name.
If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen.
This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster.
This PR proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore.
I tested this manually, by creating a cluster of 2 OSS nodes, then upgrading one of the nodes to enterprise and verifying (with extra logging) that service level connections are matched to the default statement tenant after the PR and they indeed match to the default scheduling group before.
Fixes: #13841Fixes: #12552Closes#13843
* github.com:scylladb/scylladb:
message: match unknown tenants to the default tenant
message: generalize per-tenant connection types
(cherry picked from commit a7c2c9f92b)
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.
This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
e225c88a65 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.
To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.
\Fixes scylladb/scylladb#14182
Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing #14182
\Closes scylladb/scylladb#14183
* github.com:scylladb/scylladb:
docs: dml: add update ordering section
cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
atomic_cell: compare_atomic_cell_for_merge: update and add documentation
compare_atomic_cell_for_merge: compare value last for live cells
mutation_test: test_cell_ordering: improve debuggability
(cherry picked from commit 87b4606cd6)
Closes#14651
View update routines accept mutation objects.
But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to convert the fragment stream into mutations. This is done by piping the stream to mutation_rebuilder_v2.
To keep memory usage limited, the stream for a single partition might have to be split into multiple partial mutation objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error.
This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next mutation object).
The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic.
Backported from c25201c1a3. Some minor API-related adjustments were needed.
Closes#14621
* github.com:scylladb/scylladb:
test: view_build_test: add range tombstones to test_view_update_generator_buffering
test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
view_updating_consumer: make buffer limit a variable
view: fix range tombstone handling on flushes in view_updating_consumer
Fixes#11017
When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.
Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.
Closes#14631
In mutation_reader_merger and clustering_order_reader_merger, the
operator()() is responsible for producing mutation fragments that will
be merged and pushed to the combined reader's buffer. Sometimes, it
might have to advance existing readers, open new and / or close some
existing ones, which requires calling a helper method and then calling
operator()() recursively.
In some unlucky circumstances, a stack overflow can occur:
- Readers have to be opened incrementally,
- Most or all readers must not produce any fragments and need to report
end of stream without preemption,
- There has to be enough readers opened within the lifetime of the
combined reader (~500),
- All of the above needs to happen within a single task quota.
In order to prevent such a situation, the code of both reader merger
classes were modified not to perform recursion at all. Most of the code
of the operator()() was moved to maybe_produce_batch which does not
recur if it is not possible for it to produce a fragment, instead it
returns std::nullopt and operator()() calls this method in a loop via
seastar::repeat_until_value.
A regression test is added.
Fixes: scylladb/scylladb#14415
Closes#14452
(cherry picked from commit ee9bfb583c)
Closes#14604
The discussion on the thread says, when we reformat a volume with another
filesystem, kernel and libblkid may skip to populate /dev/disk/by-* since it
detected two filesystem signatures, because mkfs.xxx did not cleared previous
filesystem signature.
To avoid this, we need to run wipefs before running mkfs.
Note that this runs wipefs twice, for target disks and also for RAID device.
wipefs for RAID device is needed since wipefs on disks doesn't clear filesystem signatures on /dev/mdX (we may see previous filesystem signature on /dev/mdX when we construct RAID volume multiple time on same disks).
Also dropped -f option from mkfs.xfs, it will check wipefs is working as we
expected.
Fixes#13737
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#13738
(cherry picked from commit fdceda20cc)
This patch adds a full-range tombstone to the compacted mutation.
This raises the coverage of the test. In particular, it reproduces
issue #14503, which should have been caught by this test, but wasn't.
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of
mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to
convert the fragment stream into `mutation`s. This is done by piping
the stream to mutation_rebuilder_v2.
To keep memory usage limited, the stream for a single partition might
have to be split into multiple partial `mutation` objects.
view_update_consumer does that, but in improper way -- when the
split/flush happens inside an active range tombstone, the range
tombstone isn't closed properly. This is illegal, and triggers an
internal error.
This patch fixes the problem by closing the active range tombstone
(and reopening in the same position in the next `mutation` object).
The tombstone is closed just after the last seen clustered position.
This is not necessary for correctness -- for example we could delay
all processing of the range tombstone until we see its end
bound -- but it seems like the most natural semantic.
Fixes#14503
This PR fixes the Restore System Tables section of the upgrade guides by adding a command to clean upgraded SStables during rollback or adding the entire section to restore system tables (which was missing from the older documents).
This PR fixes is a bug and must be backported to branch-5.3, branch-5.2., and branch-5.1.
Refs: https://github.com/scylladb/scylla-enterprise/issues/3046
- [x] 5.1-to-2022.2 - update command (backport to branch-5.3, branch-5.2, and branch-5.1)
- [x] 5.0-to-2022.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1)
- [x] 4.3-to-2021.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1)
(see https://github.com/scylladb/scylla-enterprise/issues/3046#issuecomment-1604232864)
Closes#14444
* github.com:scylladb/scylladb:
doc: fix rollback in 4.3-to-2021.1 upgrade guide
doc: fix rollback in 5.0-to-2022.1 upgrade guide
doc: fix rollback in 5.1-to-2022.2 upgrade guide
(cherry picked from commit 8a7261fd70)
Prior to off-strategy compaction, streaming / repair would place
staging files into main sstable set, and wait for view building
completion before they could be selected for regular compaction.
The reason for that is that view building relies on table providing
a mutation source without data in staging files. Had regular compaction
mixed staging data with non-staging one, table would have a hard time
providing the required mutation source.
After off-strategy compaction, staging files can be compacted
in parallel to view building. If off-strategy completes first, it
will place the output into the main sstable set. So a parallel view
building (on sstables used for off-strategy) may potentially get a
mutation source containing staging data from the off-strategy output.
That will mislead view builder as it won't be able to detect
changes to data in main directory.
To fix it, we'll do what we did before. Filter out staging files
from compaction, and trigger the operation only after we're done
with view building. We're piggybacking on off-strategy timer for
still allowing the off-strategy to only run at the end of the
node operation, to reduce the amount of compaction rounds on
the data introduced by repair / streaming.
Fixes#11882.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#11919
(cherry picked from commit a57724e711)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14365
with off-strategy, input list size can be close to 1k, which will
lead to unneeded reallocations when formatting the list for
logging.
in the past, we faced stalls in this area, and excessive reallocation
(log2 ~1k = ~10) may have contributed to that.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13907
(cherry picked from commit 5544d12f18)
Fixesscylladb/scylladb#14071
Information was duplicated before and the version on this page was outdated - RBNO is enabled for replace operation already.
Closes#12984
(cherry picked from commit bd7caefccf)
View building from staging creates a reader from scratch (memtable
+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.
perf shows that the reader creation is very expensive:
+ 12.15% 10.75% reactor-3 scylla [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+ 10.01% 9.99% reactor-3 scylla [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 8.95% 8.94% reactor-3 scylla [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+ 7.29% 7.28% reactor-3 scylla [.] dht::ring_position_tri_compare
+ 6.28% 6.27% reactor-3 scylla [.] dht::tri_compare
+ 4.11% 3.52% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 4.09% 4.07% reactor-3 scylla [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+ 3.46% 0.93% reactor-3 scylla [.] sstables::sstable_run::will_introduce_overlapping
+ 2.53% 2.53% reactor-3 libstdc++.so.6 [.] std::_Rb_tree_increment
+ 2.45% 2.45% reactor-3 scylla [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 2.14% 2.13% reactor-3 scylla [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 2.07% 2.07% reactor-3 scylla [.] logalloc::region_impl::free
+ 2.06% 1.91% reactor-3 scylla [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()#1}::operator()() const::{lambda()#1}::operator()
+ 2.04% 2.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 1.87% 0.00% reactor-3 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 1.86% 0.00% reactor-3 [kernel.kallsyms] [k] do_syscall_64
+ 1.39% 1.38% reactor-3 libc.so.6 [.] __memcmp_avx2_movbe
+ 1.37% 0.92% reactor-3 scylla [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+ 1.34% 1.33% reactor-3 scylla [.] logalloc::region_impl::alloc_small
+ 1.33% 1.33% reactor-3 scylla [.] seastar::memory::small_pool::add_more_objects
+ 1.30% 0.35% reactor-3 scylla [.] seastar::reactor::do_run
+ 1.29% 1.29% reactor-3 scylla [.] seastar::memory::allocate
+ 1.19% 0.05% reactor-3 libc.so.6 [.] syscall
+ 1.16% 1.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+ 1.07% 0.79% reactor-3 scylla [.] sstables::partitioned_sstable_set::insert
That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).
The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.
This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.
This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.
With this improvement, view building was measured to be 3x faster.
from
INFO 2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s
to
INFO 2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s
Refs #14089.
Fixes#14244.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14476
The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the previous buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between.
The code guranteeing this forward progress had a bug: the comparison between the position after the last buffer-fill and the current last fragment position was done in the wrong direction.
So if the condition that we wanted to achieve was already true, we would continue filling the buffer until partition end which may lead to OOMs such as in #13491.
There was already a fix in this area to handle `partition_start` fragments correctly - #13563 - but it missed that the position comparison was done in the wrong order.
Fix the comparison and adjust one of the tests (added in #13563) to detect this case.
After the fix, the evictable reader starts generating some redundant (but expected) range tombstone change fragments since it's now being paused and resumed. For this we need to adjust mutation source tests which were a bit too specific. We modify `flat_mutation_reader_assertions` to squash the redundant `r_t_c`s.
Fixes#13491Closes#14375
* github.com:scylladb/scylladb:
readers: evictable_reader: don't accidentally consume the entire partition
test: flat_mutation_reader_assertions: squash `r_t_c`s with the same position
(cherry picked from commit 586102b42e)
range_tombstone_change_generator::flush() mishandles the case when two range
tombstones are adjacent and flush(pos, end_of_range=true) is called with pos
equal to the end bound of the lesser-position range tombstone.
In such case, the start change of the greater-position rtc will be accidentally
emitted, and there won't be an end change, which breaks reader assumptions by
ending the stream with an unclosed range tombstone, triggering an assertion.
This is due to a non-strict inequality used in a place where strict inequality
should be used. The modified line was intended to close range tombstones
which end exactly on the flush position, but this is unnecessary because such
range tombstones are handled by the last `if` in the function anyway.
Instead, this line caused range tombstones beginning right after the flush
position to be emitted sometimes.
Fixes#12462Closes#13906
(cherry picked from commit 9b0679c140)
This includes seastar update titled
'Merge 'Split rpc::server stop into two parts''
Includes backport of #12244 fix
* br-5.1-backport-ms-shutdown:
messaging_service: Shutdown rpc server on shutdown
messaging_service: Generalize stop_servers()
messaging_service: Restore indentation after previous patch
messaging_service: Coroutinize stop()
messaging_service: Coroutinize stop_servers()
messaging: Shutdown on stop() if it wasn't shut down earlier
Update seastar submodule
refs: #14031
The RPC server now has a lighter .shutdown() method that just does what
m.s. shutdown() needs, so call it. On stop call regular stop to finalize
the stopping process
backport: The messaging_service::shutdown() had conflict due to missing
e147681d85 commit
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it do_with_servers() and make it accept method to call and message
to print. This gives the ability to reuse this helper in next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After c7826aa910, sstable runs are cleaned up together.
The procedure which executes cleanup was holding reference to all
input sstables, such that it could later retry the same cleanup
job on failure.
Turns out it was not taking into account that incremental compaction
will exhaust the input set incrementally.
Therefore cleanup is affected by the 100% space overhead.
To fix it, cleanup will now have the input set updated, by removing
the sstables that were already cleaned up. On failure, cleanup
will retry the same job with the remaining sstables that weren't
exhausted by incremental compaction.
New unit test reproduces the failure, and passes with the fix.
Fixes#14035.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14038
(cherry picked from commit 23443e0574)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14195
With regards to closing the looked-up querier if an exception is thrown. In particular, this requires closing the querier if a semaphore mismatch is detected. Move the table lookup above the line where the querier is looked up, to avoid having to handle the exception from it. As a consequence of closing the querier on the error path, the lookup lambda has to be made a coroutine. This is sad, but this is executed once per page, so its cost should be insignificant when spread over an
entire page worth of work.
Also add a unit test checking that the mismatch is detected in the first place and that readers are closed.
Fixes: #13784Closes#13790
* github.com:scylladb/scylladb:
test/boost/database_test: add unit test for semaphore mismatch on range scans
partition_slice_builder: add set_specific_ranges()
multishard_mutation_query: make reader_context::lookup_readers() exception safe
multishard_mutation_query: lookup_readers(): make inner lambda a coroutine
(cherry picked from commit 1c0e8c25ca)
Due to a simple programming oversight, one of keyspace_metadata
constructors is using empty user_types_metadata instead of the
passed one. Fix that.
Fixes#14139Closes#14143
(cherry picked from commit 1a521172ec)
A long long time ago there was an issue about removing infinite timeouts
from distributed queries: #3603. There was also a fix:
620e950fc8. But apparently some queries
escaped the fix, like the one in `default_role_row_satisfies`.
With the right conditions and timing this query may cause a node to hang
indefinitely on shutdown. A node tries to perform this query after it
starts. If we kill another node which is required to serve this query
right before that moment, the query will hang; when we try to shutdown
the querying node, it will wait for the query to finish (it's a
background task in auth service), which it never does due to infinite
timeout.
Use the same timeout configuration as other queries in this module do.
Fixes#13545.
Closes#14134
(cherry picked from commit f51312e580)
Fixes a regression introduced in 80917a1054:
"scylla_prepare: stop generating 'mode' value in perftune.yaml"
When cpuset.conf contains a "full" CPU set the negation of it from
the "full" CPU set is going to generate a zero mask as a irq_cpu_mask.
This is an illegal value that will eventually end up in the generated
perftune.yaml, which in line will make the scylla service fail to start
until the issue is resolved.
In such a case a irq_cpu_mask must represent a "full" CPU set mimicking
a former 'MQ' mode.
\Fixes scylladb/scylladb#11701
Tested:
- Manually on a 2 vCPU VM in an 'auto-selection' mode.
- Manually on a large VM (48 vCPUs) with an 'MQ' manually
enforced.
Message-Id: <20221004004237.2961246-1-vladz@scylladb.com>
(cherry picked from commit 8195dab92a)
This patch fixes the regression introduced by 3a51e78 which broke
a very important contract: perftune.yaml should not be "touched"
by Scylla scriptology unless explicitly requested.
And a call for scylla_cpuset_setup is such an explicit request.
The issue that the offending patch was intending to fix was that
cpuset.conf was always generated anew for every call of
scylla_cpuset_setup - even if a resulting cpuset.conf would come
out exactly the same as the one present on the disk before tha call.
And since the original code was following the contract mentioned above
it was also deleting perftune.yaml every time too.
However, this was just an unavoidable side-effect of that cpuset.conf
re-generation.
The above also means that if scylla_cpuset_setup doesn't write to cpuset.conf
we should not "touch" perftune.yaml and vise versa.
This patch implements exactly that together with reverting the dangerous
logic introduced by 3a51e78.
\Fixes scylladb/scylladb#11385
\Fixes scylladb/scylladb#10121
(cherry picked from commit c538cc2372)
Modern perftune.py supports a more generic way of defining IRQ CPUs:
'irq_cpu_mask'.
This patch makes our auto-generation code create a perftune.yaml
that uses this new parameter instead of using outdated 'mode'.
As a side effect, this change eliminates the notion of "incorrect"
value in cpuset.conf - every value is valid now as long as it fits into
the 'all' CPU set of the specific machine.
Auto-generated 'irq_cpu_mask' is going to include all bits from 'all'
CPU mask except those defined in cpuset.conf.
\Fixes scylladb/scylladb#9903
(cherry picked from commit 80917a1054)
This class exists for one purpose only: to serve as glue code between
dht::ring_position and boost::icl::interval_map. The latter requires
that keys in its intervals are:
* default constructible
* copyable
* have standalone compare operations
For this reason we have to wrap `dht::ring_position` in a class,
together with a schema to provide all this. This is
`compatible_ring_position`. There is one further requirement by code
using the interval map: it wants to do lookups without copying the
lookup key(s). To solve this, we came up with
`compatible_ring_position_or_view` which is a union of a key or a key
view + schema. As we recently found out, boost::icl copies its keys **a
lot**. It seems to assume these keys are cheap to copy and carelessly
copies them around even when iterating over the map. But
`compatible_ring_position_or_view` is not cheap to copy as it copies a
`dht::ring_position` which allocates, and it does that via an
`std::optional` and `std::variant` to add insult to injury.
This patch make said class cheap to copy, by getting rid of the variant
and storing the `dht::ring_position` via a shared pointer. The view is
stored separately and either points to the ring position stored in the
shared pointer or to an outside ring position (for lookups).
Fixes: #11669Closes#11670
(cherry picked from commit 169a8a66f2)
The manager intended to periodically reevaluate compaction need for
each registered table. But it's not working as intended.
The reevaluation is one-off.
This means that compaction was not kicking in later for a table, with
low to none write activity, that had expired data 1 hour from now.
Also make sure that reevaluation happens within the compaction
scheduling group.
Fixes#13430.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 156ac0a67a)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Every 1 hour, compaction manager will submit all registered table_state
for a regular compaction attempt, all without yielding.
This can potentially cause a reactor stall if there are 1000s of table
states, as compaction strategy heuristics will run on behalf of each,
and processing all buckets and picking the best one is not cheap.
This problem can be magnified with compaction groups, as each group
is represented by a table state.
This might appear in dashboard as periodic stalls, every 1h, misleading
the investigator into believing that the problem is caused by a
chronological job.
This is fixed by piggybacking on compaction reevaluation loop which
can yield between each submission attempt if needed.
Fixes#12390.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12391
(cherry picked from commit 67ebd70e6e)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
postponed_compactions_reevaluation() runs until compaction_manager is
stopped, checking if it needs to launch new compactions.
Make it return a future instead of stashing its completion somewhere.
This makes is easier to convert it to a coroutine.
(cherry picked from commit d2c44cba77)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When off-strategy compaction completes, regular compaction is not triggered.
If off-strategy output causes the table's SSTable set to not conform the strategy
goal, it means that read and space amplification will be suboptimal until the next
compaction kicks in, which can take undefinite amount of time (e.g. when active
memtable is flushed).
Let's reevaluate compaction on main SSTable set when off-strategy ends.
Fixes#13429.
Backport note: conflict is around compaction_group vs table.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 2652b41606)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
static report:
sstables/mx/reader.cc:1705:58: error: invalid invocation of method 'operator*' on object 'schema' while it is in the 'consumed' state [-Werror,-Wconsumed]
legacy_reverse_slice_to_native_reverse_slice(*schema, slice.get()), pc, std::move(trace_state), fwd, fwd_mr, monitor);
Fixes#13394.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 213eaab246)
use-after-free in ctor, which potentially leads to a failure
when locating table from moved schema object.
static report
In file included from db/system_keyspace.cc:51:
./db/view/build_progress_virtual_reader.hh:202:40: warning: invalid invocation of method 'operator->' on object 's' while it is in the 'consumed' state [-Wconsumed]
_db.find_column_family(s->ks_name(), system_keyspace::v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS),
Fixes#13395.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 1ecba373d6)
static report:
./index/built_indexes_virtual_reader.hh:228:40: warning: invalid invocation of method 'operator->' on object 's' while it is in the 'consumed' state [-Wconsumed]
_db.find_column_family(s->ks_name(), system_keyspace::v3::BUILT_VIEWS),
Fixes#13396.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit f8df3c72d4)
Variant used by
streaming/stream_transfer_task.cc: , reader(cf.make_streaming_reader(cf.schema(), std::move(permit_), prs))
as full slice is retrieved after schema is moved (clang evaluates
left-to-right), the stream transfer task can be potentially working
on a stale slice for a particular set of partitions.
static report:
In file included from replica/dirty_memory_manager.cc:6:
replica/database.hh:706:83: error: invalid invocation of method 'operator->' on object 'schema' while it is in the 'consumed' state [-Werror,-Wconsumed]
return make_streaming_reader(std::move(schema), std::move(permit), range, schema->full_slice());
Fixes#13397.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 04932a66d3)
The immediate mode is similar to timeout mode with gc_grace_seconds
zero. Thus, the gc_before returned should be the query_time instead of
gc_clock::time_point::max in immediate mode.
Setting gc_before to gc_clock::time_point::max, a row could be dropped
by compaction even if the ttl is not expired yet.
The following procedure reproduces the issue:
- Start 2 nodes
- Insert data
```
CREATE KEYSPACE ks2a WITH REPLICATION = { 'class' : 'SimpleStrategy',
'replication_factor' : 2 };
CREATE TABLE ks2a.tb (pk int, ck int, c0 text, c1 text, c2 text, PRIMARY
KEY(pk, ck)) WITH tombstone_gc = {'mode': 'immediate'};
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (10 ,1, 'x', 'y', 'z')
USING TTL 1000000;
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (20 ,1, 'x', 'y', 'z')
USING TTL 1000000;
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (30 ,1, 'x', 'y', 'z')
USING TTL 1000000;
```
- Run nodetool flush and nodetool compact
- Compaction drops all data
```
~128 total partitions merged to 0.
```
Fixes#13572Closes#13800
(cherry picked from commit 7fcc403122)
This is not really an error, so print it in debug log_level
rather than error log_level.
Fixes#13374
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13462
(cherry picked from commit cc42f00232)
Courtersy of clang-tidy:
row_cache.cc:1191:28: warning: 'entry' used after it was moved [bugprone-use-after-move]
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{_schema});
^
row_cache.cc:1191:60: note: move occurred here
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{_schema});
^
row_cache.cc:1191:28: note: the use and move are unsequenced, i.e. there is no guarantee about the order in which they are evaluated
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{*_schema});
The use-after-move is UB, as for it to happen, depends on evaluation order.
We haven't hit it yet as clang is left-to-right.
Fixes#13400.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13401
(cherry picked from commit d2d151ae5b)
Aggregation query on counter column is failing because forward_service is looking for function with counter as an argument and such function doesn't exist. Instead the long type should be used.
Fixes: #12939Closes#12963
* github.com:scylladb/scylladb:
test:boost: counter column parallelized aggregation test
service:forward_service: use long type when column is counter
(cherry picked from commit 61e67b865a)
Run tests for parallelized aggregation with
`enable_parallelized_aggregation` set always to true, so the tests work
even if the default value of the option is false.
Closes#12409
(cherry picked from commit 83bb77b8bb)
Ref #12939.
This patch fixes#12475, where an aggregation (e.g., COUNT(*), MIN(v))
of absolutely no partitions (e.g., "WHERE p = null" or "WHERE p in ()")
resulted in an internal error instead of the "zero" result that each
aggregator expects (e.g., 0 for COUNT, null for MIN).
The problem is that normally our aggregator forwarder picks the nodes
which hold the relevant partition(s), forwards the request to each of
them, and then combines these results. When there are no partitions,
the query is sent to no node, and we end up with an empty result set
instead of the "zero" results. So in this patch we recognize this
case and build those "zero" results (as mentioned above, these aren't
always 0 and depend on the aggregation function!).
The patch also adds two tests reproducing this issue in a fairly general
way (e.g., several aggregators, different aggregation functions) and
confirming the patch fixes the bug.
The test also includes two additional tests for COUNT aggregation, which
uncovered an incompatibility with Cassandra which is still not fixed -
so these tests are marked "xfail":
Refs #12477: Combining COUNT with GROUP by results with empty results
in Cassandra, and one result with empty count in Scylla.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12715
(cherry picked from commit 3ba011c2be)
The evictable reader must ensure that each buffer fill makes forward
progress, i.e. the last fragment in the buffer has a position larger
than the last fragment from the last buffer-fill. Otherwise, the reader
could get stuck in an infinite loop between buffer fills, if the reader
is evicted in-between.
The code guranteeing this forward change has a bug: when the next
expected position is a partition-start (another partition), the code
would loop forever, effectively reading all there is from the underlying
reader.
To avoid this, add a special case to ignore the progress guarantee loop
altogether when the next expected position is a partition start. In this
case, progress is garanteed anyway, because there is exactly one
partition-start fragment in each partition.
Fixes: #13491Closes#13563
(cherry picked from commit 72003dc35c)
Undefined behavior because the evaluation order is undefined.
With GCC, where evaluation is right-to-left, schema will be moved
once it's forwarded to make_flat_mutation_reader_from_mutations_v2().
The consequence is that memory tracking of mutation_fragment_v2
(for tracking only permit used by view update), which uses the schema,
can be incorrect. However, it's more likely that Scylla will crash
when estimating memory usage for row, which access schema column
information using schema::column_at(), which in turn asserts that
the requested column does really exist.
Fixes#13093.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13092
(cherry picked from commit 3fae46203d)
Fixes https://github.com/scylladb/scylladb/issues/13106
This commit removes the information that BYPASS CACHE
is an Enterprise-only feature and replaces that info
with the link to the BYPASS CACHE description.
Closes#13316
(cherry picked from commit 1cfea1f13c)
* tools/python3 bf6e892...4b04b46 (1):
> dist: redhat: provide only a single version
s/%{version}/%{version}-%{release}/ in `Requires:` sections.
this enforces the runtime dependencies of exactly the same
releases between scylla packages.
Fixes#13222
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 7165551fd7)
The REST test test_storage_service.py::test_toppartitions_pk_needs_escaping
was flaky. It tests the toppartition request, which unfortunately needs
to choose a sampling duration in advance, and we chose 1 second which we
considered more than enough - and indeed typically even 1ms is enough!
but very rarely (only know of only one occurance, in issue #13223) one
second is not enough.
Instead of increasing this 1 second and making this test even slower,
this patch takes a retry approach: The tests starts with a 0.01 second
duration, and is then retried with increasing durations until it succeeds
or a 5-seconds duration is reached. This retry approach has two benefits:
1. It de-flakes the test (allowing a very slow test to take 5 seconds
instead of 1 seconds which wasn't enough), and 2. At the same time it
makes a successful test much faster (it used to always take a full
second, now it takes 0.07 seconds on a dev build on my laptop).
A *failed* test may, in some cases, take 10 seconds after this patch
(although in some other cases, an error will be caught immediately),
but I consider this acceptable - this test should pass, after all,
and a failure indicates a regression and taking 10 seconds will be
the last of our worries in that case.
Fixes#13223.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13238
(cherry picked from commit c550e681d7)
This patch increases the connection timeout in the get_cql_cluster()
function in test/cql-pytest/run.py. This function is used to test
that Scylla came up, and also test/alternator/run uses it to set
up the authentication - which can only be done through CQL.
The Python driver has 2-second and 5-second default timeouts that should
have been more than enough for everybody (TM), but in #13239 we saw
that in one case it apparently wasn't enough. So to be extra safe,
let's increase the default connection-related timeouts to 60 seconds.
Note this change only affects the Scylla *boot* in the test/*/run
scripts, and it does not affect the actual tests - those have different
code to connect to Scylla (see cql_session() in test/cql-pytest/util.py),
and we already increased the timeouts there in #11289.
Fixes#13239
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13291
(cherry picked from commit 4fdcee8415)
sleep_abortable() is aborted on success, which causes sleep_aborted
exception to be thrown. This causes scylla to throw every 100ms for
each pinged node. Throwing may reduce performance if happens often.
Also, it spams the logs if --logger-log-level exception=trace is enabled.
Avoid by swallowing the exception on cancellation.
Fixes#13278.
Closes#13279
(cherry picked from commit 99cb948eac)
Otherwise the null pointer is dereferenced.
Add a unit test reproducing the issue
and testing this fix.
Fixes#13636
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 12877ad026)
The removenode_abort logic that follows the warning
may throw, in which case information about
the original exception was lost.
Fixes: #11722Closes#11735
(cherry picked from commit 40bd9137f8)
Related: https://github.com/scylladb/scylla-enterprise/issues/2807
This commit removes the --load-and-stream nodetool option
from version 5.1 - it is not supported in this version.
This commit should only be merged to branch-5.1 (not to master)
as the feature will be added in the later versions => in versions
prior to 5.2.x the information about the option is a bug.
Closes#13618
in `make_group0_history_state_id_mutation`, when adding a new entry to
the group 0 history table, if the parameter `gc_older_than` is engaged,
we create a range tombstone in the mutation which deletes entries older
than the new one by `gc_older_than`. In particular if
`gc_older_than = 0`, we want to delete all older entries.
There was a subtle bug there: we were using millisecond resolution when
generating the tombstone, while the provided state IDs used microsecond
resolution. On a super fast machine it could happen that we managed to
perform two schema changes in a single millisecond; this happened
sometimes in `group0_test.test_group0_history_clearing_old_entries`
on our new CI/promotion machines, causing the test to fail because the
tombstone didn't clear the entry correspodning to the previous schema
change when performing the next schema change (since they happened in
the same millisecond).
Use microsecond resolution to fix that. The consecutive state IDs used
in group 0 mutations are guaranteed to be strictly monotonic at
microsecond resolution (see `generate_group0_state_id` in
service/raft/raft_group0_client.cc).
Fixes#13594Closes#13604
* github.com:scylladb/scylladb:
db: system_keyspace: use microsecond resolution for group0_history range tombstone
utils: UUID_gen: accept decimicroseconds in min_time_UUID
(cherry picked from commit 10c1f1dc80)
We should scan all sstables in the table directory and its
subdirectories to determine the highest sstable version and generation
before using it for creating new sstables (via reshard or reshape).
Otherwise, the generations of new sstables created when populating staging (via reshard or reshape) may collide with generations in the base directory, leading to https://github.com/scylladb/scylladb/issues/11789
Refs scylladb/scylladb#11789
Fixes scylladb/scylladb#11793
Closes#11795
* github.com:scylladb/scylladb:
distributed_loader: populate_column_family: reindent
distributed_loader: coroutinize populate_column_family
distributed_loader: table_population_metadata: start: reindent
distributed_loader: table_population_metadata: coroutinize start_subdir
distributed_loader: table_population_metadata: start_subdir: reindent
distributed_loader: pre-load all sstables metadata for table before populating it
(cherry picked from commit 4aa0b16852)
Our documentation states that writing an item with "USING TTL 0" means it
should never expire. This should be true even if the table has a default
TTL. But Scylla mistakenly handled "USING TTL 0" exactly like having no
USING TTL at all (i.e., it took the default TTL, instead of unlimited).
We had two xfailing tests demonstrating that Scylla's behavior in this
is different from Cassandra. Scylla's behavior in this case was also
undocumented.
By the way, Cassandra used to have the same bug (CASSANDRA-11207) but
it was fixed already in 2016 (Cassandra 3.6).
So in this patch we fix Scylla's "USING TTL 0" behavior to match the
documentation and Cassandra's behavior since 2016. One xfailing test
starts to pass and the second test passes this bug and fails on a
different one. This patch also adds a third test for "USING TTL ?"
with UNSET_VALUE - it behaves, on both Scylla and Cassandra, like a
missing "USING TTL".
The origin of this bug was that after parsing the statement, we saved
the USING TTL in an integer, and used 0 for the case of no USING TTL
given. This meant that we couldn't tell if we have USING TTL 0 or
no USING TTL at all. This patch uses an std::optional so we can tell
the case of a missing USING TTL from the case of USING TTL 0.
Fixes#6447
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13079
(cherry picked from commit a4a318f394)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The purpose of `_stop` is to remember whether the consumption of the
last partition was interrupted or it was consumed fully. In the former
case, the compactor allows retreiving the compaction state for the given
partition, so that its compaction can be resumed at a later point in
time.
Currently, `_stop` is set to `stop_iteration::yes` whenever the return
value of any of the `consume()` methods is also `stop_iteration::yes`.
Meaning, if the consuming of the partition is interrupted, this is
remembered in `_stop`.
However, a partition whose consumption was interrupted is not always
continued later. Sometimes consumption of a partitions is interrputed
because the partition is not interesting and the downstream consumer
wants to stop it. In these cases the compactor should not return an
engagned optional from `detach_state()`, because there is not state to
detach, the state should be thrown away. This was incorrectly handled so
far and is fixed in this patch, but overwriting `_stop` in
`consume_partition_end()` with whatever the downstream consumer returns.
Meaning if they want to skip the partition, then `_stop` is reset to
`stop_partition::no` and `detach_state()` will return a disengaged
optional as it should in this case.
Fixes: #12629Closes#13365
(cherry picked from commit bae62f899d)
The patch doesn't apply cleanly, so a targeted backport PR was necessary.
I also needed to cherry-pick two patches from https://github.com/scylladb/scylladb/pull/13255 that the backported patch depends on. Decided against backporting the entire https://github.com/scylladb/scylladb/pull/13255 as it is quite an intrusive change.
Fixes: https://github.com/scylladb/scylladb/issues/11803Closes#13516
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: don't evict inactive readers needlessly
reader_concurrency_semaphore: add stats to record reason for queueing permits
reader_concurrency_semaphore: can_admit_read(): also return reason for rejection
reader_concurrency_semaphore: add set_resources()
total disk space used metric is incorrectly telling the amount of
disk space ever used, which is wrong. It should tell the size of
all sstables being used + the ones waiting to be deleted.
live disk space used, by this defition, shouldn't account the
ones waiting to be deleted.
and live sstable count, shouldn't account sstables waiting to
be deleted.
Fix all that.
Fixes#12717.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 529a1239a9)
Some callees of update_pending_ranges use the variant of get_address_ranges()
which builds a hashmap of all <endpoint, owned range> pairs. For
everywhere_topology, the size of this map is quadratic in the number of
endpoints, making it big enough to cause contiguous allocations of tens of MiB
for clusters of realistic size, potentially causing trouble for the
allocator (as seen e.g. in #12724). This deserves a correction.
This patch removes the quadratic variant of get_address_ranges() and replaces
its uses with its linear counterpart.
Refs #10337
Refs #10817
Refs #10836
Refs #10837Fixes#12724
(cherry picked from commit 9e57b21e0c)
Inactive readers should only be evicted to free up resources for waiting
readers. Evicting them when waiters are not admitted for any other
reason than resources is wasteful and leads to extra load later on when
these evicted readers have to be recreated end requeued.
This patch changes the logic on both the registering path and the
admission path to not evict inactive readers unless there are readers
actually waiting on resources.
A unit-test is also added, reproducing the overly-agressive eviction and
checking that it doesn't happen anymore.
Fixes: #11803Closes#13286
(cherry picked from commit bd57471e54)
When diagnosing problems, knowing why permits were queued is very
valuable. Record the reason in a new stats, one for each reason a permit
can be queued.
(cherry picked from commit 7b701ac52e)
Allowing to change the total or initial resources the semaphore has.
After calling `set_resources()` the semaphore will look like as if it
was created with the specified amount of resources when created.
(cherry picked from commit ecc7c72acd)
This is a backport of #11949Closes#13303
* github.com:scylladb/scylladb:
transport server: fix "request size too large" handling
transport server: fix unexpected server errors handling
test/cql-pytest.py: add scylla_inject_error() utility
test/cql-pytest: add simple tests for USE statement
Fixes#12104
Calling _read_buf.close() doesn't imply eof(), some data
may have already been read into kernel or client buffers
and will be returned next time read() is called.
When the _server._max_request_size limit was exceeded
and the _read_buf was closed, the process_request method
finished and we started processing the next request in
connection::process. The unread data from _read_buf was
treated as the header of the next request frame, resulting
in "Invalid or unsupported protocol version" error.
The existing test_shed_too_large_request was adjusted.
It was originally written with the assumption that the data
of a large query would simply be dropped from the socket
and the connection could be used to handle the
next requests. This behaviour was changed in scylladb#8800,
now the connection is closed on the Scylla side and
can no longer be used. To check there are no errors
in this case, we use Scylla metrics, getting them
from the Scylla Prometheus API.
(cherry picked from commit 3263523)
If request processing ended with an error, it is worth
sending the error to the client through
make_error/write_response. Previously in this case we
just wrote a message to the log and didn't handle the
client connection in any way. As a result, the only
thing the client got in this case was timeout error.
A new test_batch_with_error is added. It is quite
difficult to reproduce error condition in a test,
so we use error injection instead. Passing injection_key
in the body of the request ensures that the exception
will be thrown only for this test request and
will not affect other requests that
the driver may send in the background.
Closes: scylladb#12104
(cherry picked from commit a4cf509)
This patch adds a scylla_inject_error(), a context manager which tests
can use to temporarily enable some error injection while some test
code is running. It can be used to write tests that artificially
inject certain errors instead of trying to reach the elaborate (and
often requiring precise timing or high amounts of data) situation where
they occur naturally.
The error-injection API is Scylla-specific (it uses the Scylla REST API)
and does not work on "release"-mode builds (all other modes are supported),
so when Cassandra or release-mode build are being tested, the test which
uses scylla_inject_error() gets skipped.
Example usage:
```python
from rest_api import scylla_inject_error
with scylla_inject_error(cql, "injection_name", one_shot=True):
# do something here
...
```
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12264
(cherry picked from commit 6d2e146aa6)
This patch adds a couple of simple tests for the USE statement: that
without USE one cannot create a table without explicitly specifying
a keyspace name, and with USE, it is possible.
Beyond testing these specific feature, this patch also serves as an
example of how to write more tests that need to control the effective USE
setting. Specifically, it adds a "new_cql" function that can be used to
create a new connection with a fresh USE setting. This is necessary
in such tests, because if multiple tests use the same cql fixture
and its single connection, they will share their USE setting and there
is no way to undo or reset it after being set.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11741
(cherry picked from commit ef0da14d6f)
We currently don't clean up the system_distributed.view_build_status
table after removed nodes. This can cause false-positive check for
whether view update generation is needed for streaming.
The proper fix is to clean up this table, but that will be more
involved, it even when done, it might not be immediate. So until then
and to be on the safe side, filter out entries belonging to unknown
hosts from said table.
Fixes: #11905
Refs: #11836Closes#11860
(cherry picked from commit 84a69b6adb)
On some docker instance configuration, hostname resolution does not
work, so our script will fail on startup because we use hostname -i to
construct cqlshrc.
To prevent the error, we can use --rpc-address or --listen-address
for the address since it should be same.
Fixes#12011Closes#12115
(cherry picked from commit 642d035067)
`paxos_response_handler::learn_decision` was calling
`cdc_service::augment_mutation_call` concurrently with
`storage_proxy::mutate_internal`. `augment_mutation_call` was selecting
rows from the base table in order to create the preimage, while
`mutate_internal` was writing rows to the table. It was therefore
possible for the preimage to observe the update that it accompanied,
which doesn't make any sense, because the preimage is supposed to show
the state before the update.
Fix this by performing the operations sequentially. We can still perform
the CDC mutation write concurrently with the base mutation write.
`cdc_with_lwt_test` was sometimes failing in debug mode due to this bug
and was marked flaky. Unmark it.
Fixes#12098
(cherry picked from commit 1ef113691a)
This PR backports 2f4a793457 to branch-5.1. Said patch depends on some other patches that are not part of any release yet.
Closes#13224
* github.com:scylladb/scylladb:
reader_concurrency_semaphore:: clear_inactive_reads(): defer evicting to evict()
reader_permit: expose operator<<(reader_permit::state)
reader_permit: add get_state() accessor
Instead of open-coding the same, in an incomplete way.
clear_inactive_reads() does incomplete eviction in severeal ways:
* it doesn't decrement _stats.inactive_reads
* it doesn't set the permit to evicted state
* it doesn't cancel the ttl timer (if any)
* it doesn't call the eviction notifier on the permit (if there is one)
The list goes on. We already have an evict() method that all this
correctly, use that instead of the current badly open-coded alternative.
This patch also enhances the existing test for clear_inactive_reads()
and adds a new one specifically for `stop()` being called while having
inactive reads.
Fixes: #13048Closes#13049
(cherry picked from commit 2f4a793457)
This is another attempt to fix#13001 on `branch-5.1`.
In #13001 we found a test case which causes a crash on `branch-5.1` because it didn't handle `UNSET_VALUE` properly:
```python3
def test_unset_insert_where(cql, table2):
p = unique_key_int()
stmt = cql.prepare(f'INSERT INTO {table2} (p, c) VALUES ({p}, ?)')
with pytest.raises(InvalidRequest, match="unset"):
cql.execute(stmt, [UNSET_VALUE])
def test_unset_insert_where_lwt(cql, table2):
p = unique_key_int()
stmt = cql.prepare(f'INSERT INTO {table2} (p, c) VALUES ({p}, ?) IF NOT EXISTS')
with pytest.raises(InvalidRequest, match="unset"):
cql.execute(stmt, [UNSET_VALUE])
```
This problem has been fixed on `master` by PR #12517. I tried to backport it to `branch-5.1` (#13029), but this didn't go well - it was a big change that touched a lot of components. It's hard to make sure that it won't cause some unexpected issues.
Then I made a simpler fix for `branch-5.1`, which achieves the same effect as the original PR (#13057).
The problem is that this effect includes backwards incompatible changes - it bans UNSET_VALUE in some places that `branch-5.1` used to allow.
Breaking changes are bad, so I made this PR, which does an absolutely minimal change to fix the crash.
It adds a check the moment before the crash would happen.
To make sure that everything works correctly, and to detect any possible breaking changes, I wrote a bunch of tests that validate the current behavior.
I also ported some tests from the `master` branch, at least the ones that were in line with the behavior on `branch-5.1`.
Closes#13133
* github.com:scylladb/scylladb:
cql-pytest/test_unset: port some tests from master branch
cql-pytest/test_unset: test unset value in UPDATEs with LWT conditions
cql-pytest/test_unset: test unset value in UPDATEs with IF EXISTS
cql-pytest/test_unset: test unset value in UPDATE statements
cql-pytest/test_unset: test unset value in INSERTs with IF NOT EXISTS
cql-pytest/test_unset: test unset value in INSERT statements
cas_request: fix crash on unset value in primary key with LWT
I copied cql-pytest tests from the master branch,
at least the ones that were compatible with branch-5.1
Some of them were expecting an InvalidRequest exception
in case of UNSET VALUES being present in places that
branch-5.1 allows, so I skipped these tests.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add tests which test INSERT statements with IF NOT EXISTS,
when an UNSET_VLAUE is passed for some column.
The test are similar to the previous ones done for simple
INSERTs without IF NOT EXISTS.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add some tests which test what happens when an UNSET_VALUE
is passed to an INSERT statement.
Passing it for partition key column is impossible
because python driver doesn't allow it.
Passing it for clustering key column causes Scylla
to silently ignore the INSERT.
Passing it for a regular or static column
causes this column to remain unchanged,
as expected.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Doing an LWT INSERT/UPDATE and passing UNSET_VALUE
for the primary key column used to caused a crash.
This is a minimal fix for this crash.
Crash backtrace pointed to a place where
we tried doing .front() on an empty vector
of primary key ranges.
I added a check that the vector isn't empty.
If it's empty then let's throw an error
and mention that it's most likely
caused by an unset value.
This has been fixed on master,
but the PR that fixed it introduced
breaking changes, which I don't want
to add to branch-5.1.
This fix is absolutely minimal
- it performs the check at the
last moment before a crash.
It's not the prettiest, but it works
and can't introduce breaking changes,
because the new code gets activated
only in cases that would've caused
a crash.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
There was a bug in `expr::search_and_replace`.
It doesn't preserve the `order` field of binary_operator.
`order` field is used to mark relations created
using the SCYLLA_CLUSTERING_BOUND.
It is a CQL feature used for internal queries inside Scylla.
It means that we should handle the restriction as a raw
clustering bound, not as an expression in the CQL language.
Losing the SCYLLA_CLUSTERING_BOUND marker could cause issues,
the database could end up selecting the wrong clustering ranges.
Fixes: #13055
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#13056
(cherry picked from commit aa604bd935)
EOF is only guarateed to be set if one tried to read past the end of the
file. So when checking for EOF, also try to read some more. This
should force the EOF flag into a correct value. We can then check that
the read yielded 0 bytes.
This should ensure that `validate_checksums()` will not falsely declare
the validation to have failed.
Fixes: #11190Closes#12696
(cherry picked from commit 693c22595a)
Currently, UDAs can't be reused if Scylla has been
restarted since they have been created. This is
caused by the missing initialization of saved
UDAs that should have inserted them to the
cql3::functions::functions::_declared map, that
should store all (user-)created functions and
aggregates.
This patch adds the missing implementation in a way
that's analogous to the method of inserting UDF to
the _declared map.
Fixes#11309
(cherry picked from commit e558c7d988)
The reason is alloc-dealloc mismatch of position_in_partition objects
allocated by cursors inside coroutine object stored in the update
variable in row_cache::do_update()
It is allocated under cache region, but in case of exception it will
be destroyed under the standard allocator. If update is successful, it
will be cleared under region allocator, so there is not problem in the
normal case.
Fixes#12068Closes#12233
(cherry picked from commit 992a73a861)
This commit makes the following changes to the docs landing page:
- Adds the ScyllaDB enterprise docs as one of three tiles.
- Modifies the three tiles to reflect the three flavors of ScyllaDB.
- Moves the "New to ScyllaDB? Start here!" under the page title.
- Renames "Our Products" to "Other Products" to list the products other
than ScyllaDB itself. In addtition, the boxes are enlarged from to
large-4 to look better.
The major purpose of this commit is to expose the ScyllaDB
documentation.
docs: fix the link
(cherry picked from commit 27bb8c2302)
Closes#13086
Azure metadata API may return empty zone sometimes. If that happens
shard-0 gets empty string as its rack, but propagates UNKNOWN_RACK to
other shards.
Empty zones response should be handled regardless.
refs: #12185
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12274
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Several snitch drivers make http requests to get
region/dc/zone/rack/whatever from the cloud provider. They blindly rely
on the response being successfull and read response body to parse the
data they need from.
That's not nice, add checks for requests finish with http OK statuses.
refs: #12185
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12287
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Check the first fragment before dereferencing it, the fragment might be
empty, in which case move to the next one.
Found by running range scan tests with random schema and random data.
Fixes: #12821Fixes: #12823Fixes: #12708Closes#12824
(cherry picked from commit ef548e654d)
we should never return a reference to local variable.
so in this change, a reference to a static variable is returned
instead. this should address following warning from Clang 17:
```
/home/kefu/dev/scylladb/tools/schema_loader.cc:146:16: error: returning reference to local temporary object [-Werror,-Wreturn-stack-address]
return {};
^~
```
Fixes#12875
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12876
(cherry picked from commit 6eab8720c4)
Currently they are upgraded during learn on a replica. The are two
problems with this. First the column mapping may not exist on a replica
if it missed this particular schema (because it was down for instance)
and the mapping history is not part of the schema. In this case "Failed
to look up column mapping for schema version" will be thrown. Second lwt
request coordinator may not have the schema for the mutation as well
(because it was freed from the registry already) and when a replica
tries to retrieve the schema from the coordinator the retrieval will fail
causing the whole request to fail with "Schema version XXXX not found"
Both of those problems can be fixed by upgrading stored mutations
during prepare on a node it is stored at. To upgrade the mutation its
column mapping is needed and it is guarantied that it will be present
at the node the mutation is stored at since it is pre-request to store
it that the corresponded schema is available. After that the mutation
is processed using latest schema that will be available on all nodes.
Fixes#10770
Message-Id: <Y7/ifraPJghCWTsq@scylladb.com>
(cherry picked from commit 15ebd59071)
trim_clustering_row_ranges_to() is broken for non-full keys in reverse
mode. It will trim the range to
position_in_partition_view::after_key(full_key) instead of
position_in_partition_view::before_key(key), hence it will include the
key in the resulting range rather than exclude it.
Fixes#12180
Refs #1446
(cherry picked from commit 536c0ab194)
A frozen set can be part of the clustering key, and with compact
storage, the corresponding key component can have an empty value.
Comparison was not prepared for this, the iterator attempts to
deserialize the item count and will fail if the value is empty.
Fixes#12242
(cherry picked from commit 232ce699ab)
Option names given in db/config.cc are handled for the command line by passing
them to boost::program_options, and by YAML by comparing them with YAML
keys.
boost::program_options has logic for understanding the
long_name,short_name syntax, so for a "workdir,W" option both --workdir and -W
worked, as intended. But our YAML config parsing doesn't have this logic
and expected "workdir,W" verbatim, which is obviously not intended. Fix that.
Fixes#7478Fixes#9500Fixes#11503Closes#11506
(cherry picked from commit af7ace3926)
We currently configure only TimeoutStartSec, but probably it's not
enough to prevent coredump timeout, since TimeoutStartSec is maximum
waiting time for service startup, and there is another directive to
specify maximum service running time (RuntimeMaxSec).
To fix the problem, we should specify RunTimeMaxSec and TimeoutSec (it
configures both TimeoutStartSec and TimeoutStopSec).
Fixes#5430Closes#12757
(cherry picked from commit bf27fdeaa2)
Related https://github.com/scylladb/scylladb/issues/12658.
This issue fixes the bug in the upgrade guides for the released versions.
Closes#12679
* github.com:scylladb/scylladb:
doc: fix the service name in the upgrade guide for patch releases versions 2022
doc: fix the service name in the upgrade guide from 2021.1 to 2022.1
(cherry picked from commit 325246ab2a)
Both patches are important to fix inefficiencies when updating the backlog tracker, which can manifest as a reactor stall, on a special event like schema change.
A simple conflict was resolved in the first patch, since master has compaction groups. It was very easy to resolve.
Regression since 1d9f53c881, which is present in 5.1 onwards. So probably it merits a backport to 5.2 too.
Closes#12769
* github.com:scylladb/scylladb:
compaction: Fix inefficiency when updating LCS backlog tracker
table: Fix quadratic behavior when inserting sstables into tracker on schema change
LCS backlog tracker uses STCS tracker for L0. Turns out LCS tracker
is calling STCS tracker's replace_sstables() with empty arguments
even when higher levels (> 0) *only* had sstables replaced.
This unnecessary call to STCS tracker will cause it to recompute
the L0 backlog, yielding the same value as before.
As LCS has a fragment size of 0.16G on higher levels, we may be
updating the tracker multiple times during incremental compaction,
which operates on SSTables on higher levels.
Inefficiency is fixed by only updating the STCS tracker if any
L0 sstable is being added or removed from the table.
This may be fixing a quadratic behavior during boot or refresh,
as new sstables are loaded one by one.
Higher levels have a substantial higher number of sstables,
therefore updating STCS tracker only when level 0 changes, reduces
significantly the number of times L0 backlog is recomputed.
Refs #12499.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12676
(cherry picked from commit 1b2140e416)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Each time backlog tracker is informed about a new or old sstable, it
will recompute the static part of backlog which complexity is
proportional to the total number of sstables.
On schema change, we're calling backlog_tracker::replace_sstables()
for each existing sstable, therefore it produces O(N ^ 2) complexity.
Fixes#12499.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12593
(cherry picked from commit 87ee547120)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Convert decompressed temporary buffers into tracked buffers just before
returning them to the upper layer. This ensures these buffers are known
to the reader concurrency semaphore and it has an accurate view of the
actual memory consumption of reads.
Fixes: #12448Closes#12454
(cherry picked from commit c4688563e3)
Consider the following MVCC state of a partition:
v2: ==== <7> [entry2] ==== <9> ===== <last dummy>
v1: ================================ <last dummy> [entry1]
Where === means a continuous range and --- means a discontinuous range.
After two LRU items are evicted (entry1 and entry2), we will end up with:
v2: ---------------------- <9> ===== <last dummy>
v1: ================================ <last dummy> [entry1]
This will cause readers to incorrectly think there are no rows before
entry <9>, because the range is continuous in v1, and continuity of a
snapshot is a union of continuous intervals in all versions. The
cursor will see the interval before <9> as continuous and the reader
will produce no rows.
This is only temporary, because current MVCC merging rules are such
that the flag on the latest entry wins, so we'll end up with this once
v1 is no longer needed:
v2: ---------------------- <9> ===== <last dummy>
...and the reader will go to sstables to fetch the evicted rows before
entry <9>, as expected.
The bug is in rows_entry::on_evicted(), which treats the last dummy
entry in a special way, and doesn't evict it, and doesn't clear the
continuity by omission.
The situation is not easy to trigger because it requires certain
eviction pattern concurrent with multiple reads of the same partition
in different versions, so across memtable flushes.
Closes#12452
(cherry-picked from commit f97268d8f2)
Fixes#12451.
LCS reshape is compacting all levels if a single one breaks
disjointness. That's unnecessary work because rewriting that single
level is enough to restore disjointness. If multiple levels break
disjointness, they'll each be reshaped in its own iteration, so
reducing operation time for each step and disk space requirement,
as input files can be released incrementally.
Incremental compaction is not applied to reshape yet, so we need to
avoid "major compaction", to avoid the space overhead.
But space overhead is not the only problem, the inefficiency, when
deciding what to reshape when overlapping is detected, motivated
this patch.
Fixes#12495.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12496
(cherry picked from commit f2f839b9cc)
Currently, we create `forward_aggregates` inside a function that
returns the result of a future lambda that captures these aggregates
by reference. As a result, the aggregates may be destructed before
the lambda finishes, resulting in a heap use-after-free.
To prolong the lifetime of these aggregates, we cannot use a move
capture, because the lambda is wrapped in a with_thread_if_needed()
call on these aggregates. Instead, we fix this by wrapping the
entire return statement in a do_with().
Fixes#12528Closes#12533
(cherry picked from commit 5f45b32bfa)
Currently reverse types match the default case (false), even though they
might be wrapping a tuple type. One user-visible effect of this is that
a schema, which has a reversed<frozen<UDT>> clustering key component,
will have this component incorrectly represented in the schema cql dump:
the UDT will loose the frozen attribute. When attempting to recreate
this schema based on the dump, it will fail as the only frozen UDTs are
allowed in primary key components.
Fixes: #12576Closes#12579
(cherry picked from commit ebc100f74f)
Fixes#12601 (maybe?)
Sort the set of tables on ID. This should ensure we never
generate duplicates in a paged listing here. Can obviously miss things if they
are added between paged calls and end up with a "smaller" UUID/ARN, but that
is to be expected.
(cherry picked from commit da8adb4d26)
Since we're potentially searching the row_lock in parallel to acquiring
the read_lock on the partition, we're racing with row_locker::unlock
that may erase the _row_locks entry for the same clustering key, since
there is no lock to protect it up until the partition lock has been
acquired and the lock_partition future is resolved.
This change moves the code to search for or allocate the row lock
_after_ the partition lock has been acquired to make sure we're
synchronously starting the read/write lock function on it, without
yielding, to prevent this use-after-free.
This adds an allocation for copying the clustering key in advance
even if a row_lock entry already exists, that wasn't needed before.
It only us slows down (a bit) when there is contention and the lock
already existed when we want to go locking. In the fast path there
is no contention and then the code already had to create the lock
and copy the key. In any case, the penalty of copying the key once
is tiny compared to the rest of the work that view updates are doing.
This is required on top of 5007ded2c1 as
seen in https://github.com/scylladb/scylladb/issues/12632
which is closely related to #12168 but demonstrates a different race
causing use-after-free.
Fixes#12632
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 4b5e324ecb)
before this change, we construct a sstring from a comma statement,
which evaluates to the return value of `name.size()`, but what we
expect is `sstring(const char*, size_t)`.
in this change
* instead of passing the size of the string_view,
both its address and size are used
* `std::string_view` is constructed instead of sstring, for better
performance, as we don't need to perform a deep copy
the issue is reported by GCC-13:
```
In file included from cql3/selection/selectable.cc:11:
cql3/selection/field_selector.hh:83:60: error: ignoring return value of function declared with 'nodiscard' attribute [-Werror,-Wunused-result]
auto sname = sstring(reinterpret_cast<const char*>(name.begin(), name.size()));
^~~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12666
(cherry picked from commit 186ceea009)
Fixes#12739.
(cherry picked from commit b588b19620)
Currently, segment file removal first calls `f.remove_file()` and
does `total_size_on_disk -= f.known_size()` later.
However, `remove_file()` resets `known_size` to 0, so in effect
the freed space in not accounted for.
`total_size_on_disk` is not just a metric. It is also responsible
for deciding whether a segment should be recycled -- it is recycled
only if `total_size_on_disk - known_size < max_disk_size`.
Therefore this bug has dire performance consequences:
if `total_size_on_disk - known_size` ever exceeds `max_disk_size`,
the recycling of commitlog segments will stop permanently, because
`total_size_on_disk - known_size` will never go back below
`max_disk_size` due to the accounting bug. All new segments from this
point will be allocated from scratch.
The bug was uncovered by a QA performance test. It isn't easy to trigger --
it took the test 7 hours of constant high load to step into it.
However, the fact that the effect is permanent, and degrades the
performance of the cluster silently, makes the bug potentially quite severe.
The bug can be easily spotted with Prometheus as infinitely rising
`commitlog_total_size_on_disk` on the affected shards.
Fixes#12645Closes#12646
(cherry picked from commit fa7e904cd6)
Fix some problems in the documentation, e.g. it is not possible to
enable Raft in an existing cluster in 5.0, but the documentation claimed
that it is.
(cherry picked from commit 1cc68b262e)
Cherry-pick note: the original commit added a lot of new stuff like
describing the Raft upgrade procedure, but also fixed problems with the
existing documentation. In this backport we include only the latter.
Closes#12582
`forward_request` verb carried information about timeouts using
`lowres_clock::time_point` (that came from local steady clock
`seastar::lowres_clock`). The time point was produced on one node and
later compared against other node `lowres_clock`. That behavior
was wrong (`lowres_clock::time_point`s produced with different
`lowres_clock`s cannot be compared) and could lead to delayed or
premature timeout.
To fix this issue, `lowres_clock::time_point` was replaced with
`lowres_system_clock::time_point` in `forward_request` verb.
Representation to which both time point types serialize is the same
(64-bit integer denoting the count of elapsed nanoseconds), so it was
possible to do an in-place switch of those types using logic suggested
by @avikivity:
- using steady_clock is just broken, so we aren't taking anything
from users by breaking it further
- once all nodes are upgraded, it magically starts to work
Closes#12529
(cherry picked from commit bbbe12af43)
Fixes#12458
This a backport of 9fa1783892 (#11902) to branch-5.1
Flush the memtable before cleaning up the table so not to leave any disowned tokens in the memtable
as they might be resurrected if left in the memtable.
Refs #1239Closes#12490
* github.com:scylladb/scylladb:
table: perform_cleanup_compaction: flush memtable
table: add perform_cleanup_compaction
api: storage_service: add logging for compaction operations et al
We don't explicitly cleanup the memtable, while
it might hold tokens disowned by the current node.
Flush the memtable before performing cleanup compaction
to make sure all tokens in the memtable are cleaned up.
Note that non-owned ranges are invalidate in the cache
in compaction_group::update_main_sstable_list_on_compaction_completion
using desc.ranges_for_cache_invalidation.
\Fixes #1239
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit eb3a94e2bc)
Move the integration with compaction_manager
from the api layer to the tabel class so
it can also make sure the memtable is cleaned up in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit fc278be6c4)
The line modified in this patch was supposed to increase the
optimization levels of parsers in debug mode to 1, because they
were too slow otherwise. But as a side effect, it also reduced the
optimization level in release mode to 1. This is not a problem
for the CQL frontend, because statement preparation is not
performance-sensitive, but it is a serious performance problem
for Alternator, where it lies in the hot path.
Fix this by only applying the -O1 to debug modes.
Fixes#12463Closes#12460
(cherry picked from commit 08b3a9c786)
Sometimes a single modification to a base partition requires updates to
a large number of view rows. A common example is deletion of a base
partition containing many rows. A large BATCH is also possible.
To avoid large allocations, we split the large amount of work into
batch of 100 (max_rows_for_view_updates) rows each. The existing code
assumed an empty result from one of these batches meant that we are
done. But this assumption was incorrect: There are several cases when
a base-table update may not need a view update to be generated (see
can_skip_view_updates()) so if all 100 rows in a batch were skipped,
the view update stopped prematurely. This patch includes two tests
showing when this bug can happen - one test using a partition deletion
with a USING TIMESTAMP causing the deletion to not affect the first
100 rows, and a second test using a specially-crafed large BATCH.
These use cases are fairly esoteric, but in fact hit a user in the
wild, which led to the discovery of this bug.
The fix is fairly simple: To detect when build_some() is done it is no
longer enough to check if it returned zero view-update rows; Rather,
it explicitly returns whether or not it is done as an std::optional.
The patch includes several tests for this bug, which pass on Cassandra,
failed on Scylla before this patch, and pass with this patch.
Fixes#12297.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12305
(cherry picked from commit 92d03be37b)
We recently (in 7fbad8de87) made sure all admission paths can trigger the eviction of inactive reads. As reader eviction happens in the background, a mechanism was added to make sure only a single eviction fiber was running at any given time. This mechanism however had a preemption point between stopping the fiber and releasing the evict lock. This gave an opportunity for either new waiters or inactive readers to be added, without the fiber acting on it. Since it still held onto the lock, it also prevented from other eviction fibers to start. This could create a situation where the semaphore could admit new reads by evicting inactive ones, but it still has waiters. Since an empty waitlist is also an admission criteria, once one waiter is wrongly added, many more can accumulate.
This series fixes this by ensuring the lock is released in the instant the fiber decides there is no more work to do.
It also fixes the assert failure on recursive eviction and adds a detection to the inactive/waiter contradiction.
Fixes: #11923
Refs: #11770Closes#12026
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: do_wait_admission(): detect admission-waiter anomaly
reader_concurrency_semaphore: evict_readers_in_the_background(): eliminate blind spot
reader_concurrency_semaphore: do_detach_inactive_read(): do a complete detach
(cherry picked from commit 15ee8cfc05)
The semaphore currently has two admission paths: the
obtain_permit()/with_permit() methods which admits permits on user
request (the front door) and the maybe_admit_waiters() which admits
permits based on internal events like memory resource being returned
(the back door). The two paths used their own admission conditions
and naturally this means that they diverged in time. Notably,
maybe_admit_waiters() did not look at inactive readers assuming that if
there are waiters there cannot be inactive readers. This is not true
however since we merged the execution-stage into the semaphore. Waiters
can queue up even when there are inactive reads and thus
maybe_admit_waiters() has to consider evicting some of them to see if
this would allow for admitting new reads.
To avoid such divergence in the future, the admission logic was moved
into a new method can_admit_read() which is now shared between the two
method families. This method now checks for the possibility of evicting
inactive readers as well.
The admission logic was tuned slightly to only consider evicting
inactive readers if there is a real possibility that this will result
in admissions: notably, before this patch, resource availability was
checked before stalls were (used permits == blocked permits), so we
could evict readers even if this couldn't help.
Because now eviction can be started from maybe_admit_waiters(), which is
also downstream from eviction, we added a flag to avoid recursive
evict -> maybe admit -> evict ... loops.
Fixes: #11770Closes#11784
(cherry picked from commit 7fbad8de87)
--online-discard option defined as string parameter since it doesn't
specify "action=", but has default value in boolean (default=True).
It breaks "provisioning in a similar environment" since the code
supposed boolean value should be "action='store_true'" but it's not.
We should change the type of the option to int, and also specify
"choices=[0, 1]" just like --io-setup does.
Fixes#11700Closes#11831
(cherry picked from commit acc408c976)
Regular INSERT statements with null values for primary key
components are rejected by Scylla since #9286 and #9314.
Batch statements missed a similar check, this patch
fixes it.
Fixes: #12060
(cherry picked from commit 7730c4718e)
When the mutation compactor has all the rows it needs for a page, it
saves the decision to stop in a member flag: _stop.
For single partition queries, the mutation compactor is kept alive
across pages and so it has a method, start_new_page() to reset its state
for the next page. This method didn't clear the _stop flag. This meant
that the value set at the end of the previous could cause the new page
and subsequently the entire query to be stopped prematurely.
This can happen if the new page starts with a row that is covered by a
higher level tombstone and is completely empty after compaction.
Reset the _stop flag in start_new_page() to prevent this.
This commit also adds a unit test which reproduces the bug.
Fixes: #12361Closes#12384
(cherry picked from commit b0d95948e1)
This series backports several patches which add or enable tests for Alternator TTL. The series does not touch the code - just tests.
The goal of backporting more tests is to get the code - which is already in branch 5.1 - tested. It wasn't a good idea to backport code without backporting the tests for it.
Closes#12200Fixes#11374
* github.com:scylladb/scylladb:
test/alternator: increase timeout on TTL tests
test/alternator: fix timeout in flaky test test_ttl_stats
test/alternator: test Alternator TTL metrics
test/alternator: skip fewer Alternator TTL tests
Due to an oversight, the local index cache isn't evicted gently
when _upper_bound existed. This is a source of reactor stalls.
Fix that.
Fixes#12271Closes#12364
(cherry picked from commit d9269abf5b)
Fix https://github.com/scylladb/scylla-doc-issues/issues/816
Fix https://github.com/scylladb/scylla-docs/issues/1613
This PR fixes the CQL version in the Interfaces page, so that it is the same as in other places across the docs and in sync with the version reported by the ScyllaDB (see https://github.com/scylladb/scylla-doc-issues/issues/816#issuecomment-1173878487).
To make sure the same CQL version is used across the docs, we should use the `|cql-version| `variable rather than hardcode the version number on several pages.
The variable is specified in the conf.py file:
```
rst_prolog = """
.. |cql-version| replace:: 3.3.1
"""
```
Closes#11320
* github.com:scylladb/scylladb:
doc: add the Cassandra version on which the tools are based
doc: fix the version number
doc: update the Enterprise version where the ME format was introduced
doc: add the ME format to the Cassandar Compatibility page
doc: replace Scylla with ScyllaDB
doc: rewrite the Interfaces table to the new format to include more information about CQL support
doc: remove the CQL version from pages other than Cassandra compatibility
doc: fix the CQL version in the Interfaces table
(cherry picked from commit ee606a5d52)
The problematic scenario this patch fixes might happen due to
unfortunate serialization of locks/unlocks between lock_pk and lock_ck,
as follows:
1. lock_pk acquires an exclusive lock on the partition.
2.a lock_ck attempts to acquire shared lock on the partition
and any lock on the row. both cases currently use a fiber
returning a future<rwlock::holder>.
2.b since the partition is locked, the lock_partition times out
returning an exceptional future. lock_row has no such problem
and succeeds, returning a future holding a rwlock::holder,
pointing to the row lock.
3.a the lock_holder previously returned by lock_pk is destroyed,
calling `row_locker::unlock`
3.b row_locker::unlock sees that the partition is not locked
and erases it, including the row locks it contains.
4.a when_all_succeeds continuation in lock_ck runs. Since
the lock_partition future failed, it destroyes both futures.
4.b the lock_row future is destroyed with the rwlock::holder value.
4.c ~holder attempts to return the semaphore units to the row rwlock,
but the latter was already destroyed in 3.b above.
Acquiring the partition lock and row lock in parallel
doesn't help anything, but it complicates error handling
as seen above,
This patch serializes acquiring the row lock in lock_ck
after locking the partition to prevent the above race.
This way, erasing the unlocked partition is never expected
to happen while any of its rows locks is held.
Fixes#12168
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12208
(cherry picked from commit 5007ded2c1)
This PR adds the link to the KB article about updating the mode after the upgrade to the 5.1 upgrade guide.
In addition, I have:
- updated the KB article to include the versions affected by that change.
- fixed the broken link to the page about metric updates (it is not related to the KB article, but I fixed it in the same PR to limit the number of PRs that need to be backported).
Related: https://github.com/scylladb/scylladb/pull/11122Closes#12148
* github.com:scylladb/scylladb:
doc: update the releases in the KB about updating the mode after upgrade
doc: fix the broken link in the 5.1 upgrade guide
doc: add the link to the 5.1-related KB article to the 5.1 upgrade guide
(cherry picked from commit 897b501ba3)
This is a backport of https://github.com/scylladb/scylladb/pull/11783.
Closes#12229
* github.com:scylladb/scylladb:
doc: replace Scylla with ScyllaDB
doc: add a comment to remove in future versions any information that refers to previous releases
doc: rewrite the notes to improve clarity
doc: remove the reperitions from the notes
Changing configuration involves two entries in the log: a 'joint
configuration entry' and a 'non-joint configuration entry'. We use
`wait_for_entry` to wait on the joint one. To wait on the non-joint one,
we use a separate promise field in `server`. This promise wasn't
connected to the `abort_source` passed into `set_configuration`.
The call could get stuck if the server got removed from the
configuration and lost leadership after committing the joint entry but
before committing the non-joint one, waiting on the promise. Aborting
wouldn't help. Fix this by subscribing to the `abort_source` in
resolving the promise exceptionally.
Furthermore, make sure that two `set_configuration` calls don't step on
each other's toes by one setting the other's promise. To do that, reset
the promise field at the end of `set_configuration` and check that it's
not engaged at the beginning.
Fixes#11288.
Closes#11325
* github.com:scylladb/scylladb:
test: raft: randomized_nemesis_test: additional logging
raft: server: handle aborts when waiting for config entry to commit
(cherry picked from commit 83850e247a)
When `io_fiber` fetched a batch with a configuration that does not
contain this node, it would send the entries committed in this batch to
`applier_fiber` and proceed by any remaining entry dropping waiters (if
the node was no longer a leader).
If there were waiters for entries committed in this batch, it could
either happen that `applier_fiber` received and processed those entries
first, notifying the waiters that the entries were committed and/or
applied, or it could happen that `io_fiber` reaches the dropping waiters
code first, causing the waiters to be resolved with
`commit_status_unknown`.
The second scenario is undesirable. For example, when a follower tries
to remove the current leader from the configuration using
`modify_config`, if the second scenario happens, the follower will get
`commit_status_unknown` - this can happen even though there are no node
or network failures. In particular, this caused
`randomized_nemesis_test.remove_leader_with_forwarding_finishes` to fail
from time to time.
Fix it by serializing the notifying and dropping of waiters in a single
fiber - `applier_fiber`. We decided to move all management of waiters
into `applier_fiber`, because most of that management was already there
(there was already one `drop_waiters` call, and two `notify_waiters`
calls). Now, when `io_fiber` observes that we've been removed from the
config and no longer a leader, instead of dropping waiters, it sends a
message to `applier_fiber`. `applier_fiber` will drop waiters when
receiving that message.
Improve an existing test to reproduce this scenario more frequently.
Fixes#11235.
Closes#11308
* github.com:scylladb/scylladb:
test: raft: randomized_nemesis_test: more chaos in `remove_leader_with_forwarding_finishes`
raft: server: drop waiters in `applier_fiber` instead of `io_fiber`
raft: server: use `visit` instead of `holds_alternative`+`get`
(cherry picked from commit 9c4e32d2e2)
Contains fixes requested in the issue (and some tiny extras), together with analysis why they don't affect the users (see commit messages).
Fixes [ #11800](https://github.com/scylladb/scylladb/issues/11800)
Closes#11926
* github.com:scylladb/scylladb:
alternator: add maybe_quote to secondary indexes 'where' condition
test/alternator: correct xfail reason for test_gsi_backfill_empty_string
test/alternator: correct indentation in test_lsi_describe
alternator: fix wrong 'where' condition for GSI range key
(cherry picked from commit ce7c1a6c52)
The SELECT JSON statement, just like SELECT, allows the user to rename
selected columns using an "AS" specification. E.g., "SELECT JSON v AS foo".
This specification was not honored: We simply forgot to look at the
alias in SELECT JSON's implementation (we did it correctly in regular
SELECT). So this patch fixes this bug.
We had two tests in cassandra_tests/validation/entities/json_test.py
that reproduced this bug. The checks in those tests now pass, but these
two tests still continue to fail after this patch because of two other
unrelated bugs that were discovered by the same tests. So in this patch
I also add a new test just for this specific issue - to serve as a
regression test.
Fixes#8078
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12123
(cherry picked from commit c5121cf273)
When we write to a materialized view, we need to know some information
defined in the base table such as the columns in its schema. We have
a "view_info" object that tracks each view and its base.
This view_info object has a couple of mutable attributes which are
used to lazily-calculate and cache the SELECT statement needed to
read from the base table. If the base-table schema ever changes -
and the code calls set_base_info() at that point - we need to forget
this cached statement. If we don't (as before this patch), the SELECT
will use the wrong schema and writes will no longer work.
This patch also includes a reproducing test that failed before this
patch, and passes afterwords. The test creates a base table with a
view that has a non-trivial SELECT (it has a filter on one of the
base-regular columns), makes a benign modification to the base table
(just a silly addition of a comment), and then tries to write to the
view - and before this patch it fails.
Fixes#10026Fixes#11542
(cherry picked from commit 2f2f01b045)
Some of the tests in test/alternator/test_ttl.py need an expiration scan
pass to complete and expire items. In development builds on developer
machines, this usually takes less than a second (our scanning period is
set to half a second). However, in debug builds on Jenkins each scan
often takes up to 100 (!) seconds (this is the record we've seen so far).
This is why we set the tests' timeout to 120.
But recently we saw another test run failing. I think the problem is
that in some case, we need not one, but *two* scanning passes to
complete before the timeout: It is possible that the test writes an
item right after the current scan passed it, so it doesn't get expired,
and then we a second scan at a random position, possibly making that
item we mention one of the last items to be considered - so in total
we need to wait for two scanning periods, not one, for the item to
expire.
So this patch increases the timeout from 120 seconds to 240 seconds -
more than twice the highest scanning time we ever saw (100 seconds).
Note that this timeout is just a timeout, it's not the typical test
run time: The test can finish much more quickly, as little as one
second, if items expire quickly on a fast build and machine.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12106
(cherry picked from commit 6bc3075bbd)
The test `test_metrics.py::test_ttl_stats` tests the metrics associated
with Alternator TTL expiration events. It normally finishes in less than a
second (the TTL scanning is configured to run every 0.5 seconds), so we
arbitrarily set a 60 second timeout for this test to allow for extremely
slow test machines. But in some extreme cases even this was not enough -
in one case we measured the TTL scan to take 63 seconds.
So in this patch we increase the timeout in this test from 60 seconds
to 120 seconds. We already did the same change in other Alternator TTL
tests in the past - in commit 746c4bd.
Fixes#11695
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11696
(cherry picked from commit 3a30fbd56c)
This patch adds a test for the metrics generated by the background
expiration thread run for Alternator's TTL feature.
We test three of the four metrics: scylla_expiration_scan_passes,
scylla_expiration_scan_table and scylla_expiration_items_deleted.
The fourth metric, scylla_expiration_secondary_ranges_scanned, counts the
number of times that this node took over another node's expiration duty.
so requires a multi-node cluster to test, and we can't test it in the
single-node cluster test framework.
To see TTL expiration in action this test may need to wait up to the
setting of alternator_ttl_period_in_seconds. For a setting of 1
second (the default set by test/alternator/run), this means this
test can take up to 1 second to run. If alternator_ttl_period_in_seconds
is set higher, the test is skipped unless --runveryslow is requested.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 297109f6ee)
Most of the Alternator TTL tests are extremely slow on DynamoDB because
item expiration may be delayed up to 24 hours (!), and in practice for
10 to 30 minutes. Because of this, we marked most of these tests
with the "veryslow" mark, causing them to be skipped by default - unless
pytest is given the "--runveryslow" option.
The result was that the TTL tests were not run in the normal test runs,
which can allow regressions to be introduced (luckily, this hasn't happened).
However, this "veryslow" mark was excessive. Many of the tests are very
slow only on DynamoDB, but aren't very slow on Scylla. In particular,
many of the tests involve waiting for an item to expire, something that
happens after the configurable alternator_ttl_period_in_seconds, which
is just one second in our tests.
So in this patch, we remove the "veryslow" mark from 6 tests of Alternator TTL
tests, and instead use two new fixtures - waits_for_expiration and
veryslow_on_aws - to only skip the test when running on DynamoDB or
when alternator_ttl_period_in_seconds is high - but in our usual test
environment they will not get skipped.
Because 5 of these 6 tests wait for an item to expire, they take one
second each and this patch adds 5 seconds to the Alternator test
runtime. This is unfortunate (it's more than 25% of the total Alternator
test runtime!) but not a disaster, and we plan to reduce this 5 second
time futher in the following patch, but decreasing the TTL scanning
period even further.
This patch also increases the timeout of several of these tests, to 120
seconds from the previous 10 seconds. As mentioned above, normally,
these tests should always finish in alternator_ttl_period_in_seconds
(1 second) with a single scan taking less than 0.2 seconds, but in
extreme cases of debug builds on overloaded test machines, we saw even
60 seconds being passed, so let's increase the maximum. I also needed
to make the sleep time between retries smaller, not a function of the
new (unrealistic) timeout.
4 more tests remain "veryslow" (and won't run by default) because they
are take 5-10 seconds each (e.g., a test which waits to see that an item
does *not* get expired, and a test involving writing a lot of data).
We should reconsider this in the future - to perhaps run these tests in
our normal test runs - but even for now, the 6 extra tests that we
start running are a much better protection against regressions than what
we had until now.
Fixes#11374
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
x
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 746c4bd9eb)
PR #9314 fixed a similar issue with regular insert statements
but missed the LWT code path.
It's expected behaviour of
modification_statement::create_clustering_ranges to return an
empty range in this case, since possible_lhs_values it
uses explicitly returns empty_value_set if it evaluates rhs
to null, and it has a comment about it (All NULL
comparisons fail; no column values match.) On the other hand,
all components of the primary key are required to be set,
this is checked at the prepare phase, in
modification_statement::process_where_clause. So the only
problem was modification_statement::execute_with_condition
was not expecting an empty clustering_range in case of
a null clustering key.
Fixes: #11954
(cherry picked from commit 0d443dfd16)
According to seastar/doc/lambda-coroutine-fiasco.md lambda that
co_awaits once loses its capture frame. In distrobuted_loader
code there's at least one of that kind.
fixes: #12175
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12170
(cherry picked from commit 71179ff5ab)
Fix https://github.com/scylladb/scylla-docs/issues/4126Closes#11122
* github.com:scylladb/scylladb:
doc: add info about the time-consuming step due to resharding
doc: add the new KB to the toctree
doc: doc: add a KB about updating the mode in perftune.yaml after upgrade
(cherry picked from commit e9fec761a2)
Release 5.1. introduced a new CQL extension that applies to the CREATE TABLE and ALTER TABLE statements. The ScyllaDB-specific extensions are described on a separate page, so the CREATE TABLE and ALTER TABLE should include links to that page and section.
Note: CQL extensions are described with Markdown, while the Data Definition page is RST. Currently, there's no way to link from an RST page to an MD subsection (using a section heading or anchor), so a URL is used as a temporary solution.
Related: https://github.com/scylladb/scylladb/pull/9810Closes#12070
* github.com:scylladb/scylladb:
doc: move the info about per-partition rate limit for the ALTER TABLE statemet from the paragraph to the list
doc: add the links to the per-partition rate limit extention to the CREATE TABLE and ALTER TABLE sections
(cherry picked from commit 6e9f739f19)
This is a backport of https://github.com/scylladb/scylladb/pull/11460.
Closes#12079
* github.com:scylladb/scylladb:
doc: update the commands to upgrade the ScyllaDB image
doc: fix the filename in the index to resolve the warnings and fix the link
doc: apply feedback by adding she step fo load the new repo and fixing the links
doc: fix the version name in file upgrade-guide-from-2021.1-to-2022.1-image.rst
doc: rename the upgrade-image file to upgrade-image-opensource and update all the links to that file
doc: update the Enterprise guide to include the Enterprise-onlyimage file
doc: update the image files
doc: split the upgrade-image file to separate files for Open Source and Enterprise
doc: clarify the alternative upgrade procedures for the ScyllaDB image
doc: add the upgrade guide for ScyllaDB Image from 2022.x.y. to 2022.x.z
doc: add the upgrade guide for ScyllaDB Image from 5.x.y. to 5.x.z
This is a backport of https://github.com/scylladb/scylladb/pull/11108.
Closes#12063
* github.com:scylladb/scylladb:
doc: apply feedback about scylla-enterprise-machine-image
doc: update the note about installing scylla-enterprise-machine-image
update the info about installing scylla-enterprise-machine-image during upgrade
doc: add the requirement to install scylla-enterprise-machine-image if the previous version was installed with an image
doc: update the info about metrics in 2022.1 compared to 5.0
doc: minor formatting and language fixes
doc: add the new guide to the toctree
doc: add the upgrade guide from 5.0 to 2022.1
PR #11577 added the 5.0->5.1 upgrade guide. At the same time, it
improved some of the common `.rst` files that were using in other
upgrade guides; e.g. the `docs/upgrade/_common/upgrade-guide-v4-rpm.rst`
file is used in the 4.6->5.0 upgrade guide.
The 5.0->5.1 upgrade guide was then refactored. The refactored version
was already backported to the 5.1 branch (#12034). But we should still
backport the improvements done in #11577. This commit contains these
improvements.
(cherry picked from commit 2513497f9a)
Closes#12055
This is a backport of https://github.com/scylladb/scylladb/pull/11461.
Closes#12044
* github.com:scylladb/scylladb:
doc: remove support for Debian 9 from versions 2022.1 and 2022.2
doc: remove support for Ubuntu 16.04 from versions 2022.1 and 2022.2
backport 11461 doc: add support for Debian 11 to versions 2022.1 and 2022.2
We added UUID device file existance check on #11399, we expect UUID
device file is created before checking, and we wait for the creation by
"udevadm settle" after "mkfs.xfs".
However, we actually getting error which says UUID device file missing,
it probably means "udevadm settle" doesn't guarantee the device file created,
on some condition.
To avoid the error, use var-lib-scylla.mount to wait for UUID device
file is ready, and run the file existance check when the service is
failed.
Fixes#11617Closes#11666
(cherry picked from commit a938b009ca)
Just like 4a8ed4c, we also need to wait for udev event completion to
create /dev/disk/by-uuid/$UUID for newly formatted disk, to mount the
disk just after formatting.
Fixes#11359
(cherry picked from commit 8835a34ab6)
When filtering with multi column restriction present all other restrictions were ignored.
So a query like:
`SELECT * FROM WHERE pk = 0 AND (ck1, ck2) < (0, 0) AND regular_col = 0 ALLOW FILTERING;`
would ignore the restriction `regular_col = 0`.
This was caused by a bug in the filtering code:
2779a171fc/cql3/selection/selection.cc (L433-L449)
When multi column restrictions were detected, the code checked if they are satisfied and returned immediately.
This is fixed by returning only when these restrictions are not satisfied. When they are satisfied the other restrictions are checked as well to ensure all of them are satisfied.
This code was introduced back in 2019, when fixing #3574.
Perhaps back then it was impossible to mix multi column and regular columns and this approach was correct.
Fixes: #6200Fixes: #12014Closes#12031
* github.com:scylladb/scylladb:
cql-pytest: add a reproducer for #12014, verify that filtering multi column and regular restrictions works
boost/restrictions-test: uncomment part of the test that passes now
cql-pytest: enable test for filtering combined multi column and regular column restrictions
cql3: don't ignore other restrictions when a multi column restriction is present during filtering
(cherry picked from commit 2d2034ea28)
There were 4 different pages for upgrading Scylla 5.0 to 5.1 (and the
same is true for other version pairs, but I digress) for different
environments:
- "ScyllaDB Image for EC2, GCP, and Azure"
- Ubuntu
- Debian
- RHEL/CentOS
THe Ubuntu and Debian pages used a common template:
```
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p1.rst
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p2.rst
```
with different variable substitutions.
The "Image" page used a similar template, with some extra content in the
middle:
```
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p1.rst
.. include:: /upgrade/_common/upgrade-image-opensource.rst
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p2.rst
```
The RHEL/CentOS page used a different template:
```
.. include:: /upgrade/_common/upgrade-guide-v4-rpm.rst
```
This was an unmaintainable mess. Most of the content was "the same" for
each of these options. The only content that must actually be different
is the part with package installation instructions (e.g. calls to `yum`
vs `apt-get`). The rest of the content was logically the same - the
differences were mistakes, typos, and updates/fixes to the text that
were made in some of these docs but not others.
In this commit I prepare a single page that covers the upgrade and
rollback procedures for each of these options. The section dependent on
the system was implemented using Sphinx Tabs.
I also fixed and changed some parts:
- In the "Gracefully stop the node" section:
Ubuntu/Debian/Images pages had:
```rst
.. code:: sh
sudo service scylla-server stop
```
RHEL/CentOS pages had:
```rst
.. code:: sh
.. include:: /rst_include/scylla-commands-stop-index.rst
```
the stop-index file contained this:
```rst
.. tabs::
.. group-tab:: Supported OS
.. code-block:: shell
sudo systemctl stop scylla-server
.. group-tab:: Docker
.. code-block:: shell
docker exec -it some-scylla supervisorctl stop scylla
(without stopping *some-scylla* container)
```
So the RHEL/CentOS version had two tabs: one for Scylla installed
directly on the system, one for Scylla running in Docker - which is
interesting, because nothing anywhere else in the upgrade documents
mentions Docker. Furthermore, the RHEL/CentOS version used `systemctl`
while the ubuntu/debian/images version used `service` to stop/start
scylla-server. Both work on modern systems.
The Docker option is completely out of place - the rest of the upgrade
procedure does not mention Docker. So I decided it doesn't make sense to
include it. Docker documentation could be added later if we actually
decide to write upgrade documentation when using Docker... Between
`systemctl` and `service` I went with `service` as it's a bit
higher-level.
- Similar change for "Start the node" section, and corresponding
stop/start sections in the Rollback procedure.
- To reuse text for Ubuntu and Debian, when referencing "ScyllaDB deb
repo" in the Debian/Ubuntu tabs, I provide two separate links: to
Debian and Ubuntu repos.
- the link to rollback procedure in the RPM guide (in 'Download and
install the new release' section) pointed to rollback procedure from
3.0 to 3.1 guide... Fixed to point to the current page's rollback
procedure.
- in the rollback procedure steps summary, the RPM version missed the
"Restore system tables" step.
- in the rollback procedure, the repository links were pointing to the
new versions, while they should point to the old versions.
There are some other pre-existing problems I noticed that need fixing:
- EC2/GCP/Azure option has no corresponding coverage in the rollback
section (Download and install the old release) as it has in the
upgrade section. There is no guide for rolling back 3rd party and OS
packages, only Scylla. I left a TODO in a comment.
- the repository links assume certain Debian and Ubuntu versions (Debian
10 and Ubuntu 20), but there are more available options (e.g. Ubuntu
22). Not sure how to deal with this problem. Maybe a separate section
with links? Or just a generic link without choice of platform/version?
Closes#11891
(cherry picked from commit 0c7ff0d2cb)
Backport notes:
Funnily, the 5.1 branch did not have the upgrade guide to 5.1 at all. It
was only in `master`. So the backport does not remove files, only adds
new ones.
I also had to add:
- an additional link in the upgrade-opensource index to the 5.1 upgrade
page (it was already in upstream `master` when the cherry-picked commit
was added)
- the list of new metrics, which was also completely missing in
branch-5.1.
Closes#12034
Ubuntu 22.04 is supported by both ScyllaDB Open Source 5.0 and Enterprise 2022.1.
Closes#11227
* github.com:scylladb/scylladb:
doc: add the redirects from Ubuntu version specific to version generic pages
doc: remove version-speific content for Ubuntu and add the generic page to the toctree
doc: rename the file to include Ubuntu
doc: remove the version number from the document and add the link to Supported Versions
doc: add a generic page for Ubuntu
doc: add the upgrade guide from 5.0 to 2022.1 on Ubuntu 2022.1
(cherry picked from commit d4c986e4fa)
This PR is related to https://github.com/scylladb/scylla-docs/issues/4124 and https://github.com/scylladb/scylla-docs/issues/4123.
**New Enterprise Upgrade Guide from 2021.1 to 2022.2**
I've added the upgrade guide for ScyllaDB Enterprise image. In consists of 3 files:
/upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p1.rst
upgrade/_common/upgrade-image.rst
/upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p2.rst
**Modified Enterprise Upgrade Guides 2021.1 to 2022.2**
I've modified the existing guides for Ubuntu and Debian to use the same files as above, but exclude the image-related information:
/upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p1.rst + /upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p2.rst = /upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian.rst
To make things simpler and remove duplication, I've replaced the guides for Ubuntu 18 and 20 with a generic Ubuntu guide.
**Modified Enterprise Upgrade Guides from 4.6 to 5.0**
These guides included a bug: they included the image-related information (about updating OS packages), because a file that includes that information was included by mistake. What's worse, it was duplicated. After the includes were removed, image-related information is no longer included in the Ubuntu and Debian guides (this fixes https://github.com/scylladb/scylla-docs/issues/4123).
I've modified the index file to be in sync with the updates.
Closes#11285
* github.com:scylladb/scylladb:
doc: reorganize the content to list the recommended way of upgrading the image first
doc: update the image upgrade guide for ScyllaDB image to include the location of the manifest file
doc: fix the upgrade guides for Ubuntu and Debian by removing image-related information
doc: update the guides for Ubuntu and Debian to remove image information and the OS version number
doc: add the upgrade guide for ScyllaDB image from 2021.1 to 2022.1
(cherry picked from commit dca351c2a6)
Fix https://github.com/scylladb/scylladb/issues/11393
- Rename the tool names across the docs.
- Update the examples to replace `scylla-sstable` and `scylla-types` with `scylla sstable` and `scylla types`, respectively.
Closes#11432
* github.com:scylladb/scylladb:
doc: update the tool names in the toctree and reference pages
doc: rename the scylla-types tool as Scylla Types
doc: rename the scylla-sstable tool as Scylla SStable
(cherry picked from commit 2c46c24608)
This is a very important aspect of the tool that was completely missing from the document before. Also add a comparison with SStableDump.
Fixes: https://github.com/scylladb/scylladb/issues/11363Closes#11390
* github.com:scylladb/scylladb:
docs: scylla-sstable.rst: add comparison with SStableDump
docs: scylla-sstable.rst: add section about providing the schema
(cherry picked from commit 2ab5cbd841)
The purpose of this PR is to update the information about the default SStable format.
It
Closes#11431
* github.com:scylladb/scylladb:
doc: simplify the information about default formats in different versions
doc: update the SSTables 3.0 Statistics File Format to add the UUID host_id option of the ME format
doc: add the information regarding the ME format to the SSTables 3.0 Data File Format page
doc: fix additional information regarding the ME format on the SStable 3.x page
doc: add the ME format to the table
add a comment to remove the information when the documentation is versioned (in 5.1)
doc: replace Scylla with ScyllaDB
doc: fix the formatting and language in the updated section
doc: fix the default SStable format
(cherry picked from commit a0392bc1eb)
This PR introduces the following changes to the documentation landing page:
- The " New to ScyllaDB? Start here!" box is added.
- The "Connect your application to Scylla" box is removed.
- Some wording has been improved.
- "Scylla" has been replaced with "ScyllaDB".
Closes#11896
* github.com:scylladb/scylladb:
Update docs/index.rst
doc: replace Scylla with ScyllaDB on the landing page
doc: improve the wording on the landing page
doc: add the link to the ScyllaDB Basics page to the documentation landing page
(cherry picked from commit 2b572d94f5)
It was pointed out to me that our description of the synchronous_updates
materialized-view option does not make it clear enough what is the
default setting, or why a user might want to use this option.
This patch changes the description to (I hope) better address these
issues.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11404
* github.com:scylladb/scylladb:
doc: cql-extensions.md: replace "Scylla" by "ScyllaDB"
doc: cql-extensions.md: improve description of synchronous views
(cherry picked from commit b9fc504fb2)
This PR is V2 of the[ PR created by @psarna.](https://github.com/scylladb/scylladb/pull/11560).
I have:
- copied the content.
- applied the suggestions left by @nyh.
- made minor improvements, such as replacing "Scylla" with "ScyllaDB", fixing punctuation, and fixing the RST syntax.
Fixes https://github.com/scylladb/scylladb/issues/11378Closes#11984
* github.com:scylladb/scylladb:
doc: label user-defined functions as Experimental
doc: restore the note for the Count function (removed by mistatke)
doc: document user defined functions (UDFs)
(cherry picked from commit 7cbb0b98bb)
Fix https://github.com/scylladb/scylladb/issues/11373
- Updated the information on the "Counting all rows in a table is slow" page.
- Added COUNT to the list of selectors of the SELECT statement (somehow it was missing).
- Added the note to the description of the COUNT() function with a link to the KB page for troubleshooting if necessary. This will allow the users to easily find the KB page.
Closes#11417
* github.com:scylladb/scylladb:
doc: add a comment to remove the note in version 5.1
doc: update the information on the Countng all rows page and add the recommendation to upgrade ScyllaDB
doc: add a note to the description of COUNT with a reference to the KB article
doc: add COUNT to the list of acceptable selectors of the SELECT statement
(cherry picked from commit 22bb35e2cb)
compaction_manager::task (and thus compaction_data) can be stopped
because of many different reasons. Thus, abort can be requested more
than once on compaction_data abort source causing a crash.
To prevent this before each request_abort() we check whether an abort
was requested before.
Closes#12004
(cherry picked from commit 7ead1a7857)
Fixes#12002.
The get_live_token_owners returns the nodes that are part of the ring
and live.
The get_unreachable_token_owners returns the nodes that are part of the ring
and is not alive.
The token_metadata::get_all_endpoints returns nodes that are part of the
ring.
The patch changes both functions to use the more authoritative source to
get the nodes that are part of the ring and call is_alive to check if
the node is up or down. So that the correctness does not depend on
any derived information.
This patch fixes a truncate issue in storage_proxy::truncate_blocking
where it calls get_live_token_owners and get_unreachable_token_owners to
decide the nodes to talk with for truncate operation. The truncate
failed because incorrect nodes were returned.
Fixes#10296Fixes#11928Closes#11952
(cherry picked from commit 16bd9ec8b1)
Wrong access to an uninitialized token instead of the actual
generated string caused the parser to crash, this wasn't
detected by the ANTLR3 compiler because all the temporary
variables defined in the ANTLR3 statements are global in the
generated code. This essentialy caused a null dereference.
Tests: 1. The fixed issue scenario from github.
2. Unit tests in release mode.
Fixes#11774
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20190612133151.20609-1-eliransin@scylladb.com>
Closes#11777
(cherry picked from commit ab7429b77d)
The view builder builds the views from a given base table in
view_builder::batch_size batches of rows. After processing this many
rows, it suspends so the view builder can switch to building views for
other base tables in the name of fairness. When resuming the build step
for a given base table, it reuses the reader used previously (also
serving the role of a snapshot, pinning sstables read from). The
compactor however is created anew. As the reader can be in the middle of
a partition, the view builder injects a partition start into the
compactor to prime it for continuing the partition. This however only
included the partition-key, crucially missing any active tombstones:
partition tombstone or -- since the v2 transition -- active range
tombstone. This can result in base rows covered by either of this to be
resurrected and the view builder to generate view updates for them.
This patch solves this by using the detach-state mechanism of the
compactor which was explicitly developed for situations like this (in
the range scan code) -- resuming a read with the readers kept but the
compactor recreated.
Also included are two test cases reproducing the problem, one with a
range tombstone, the other with a partition tombstone.
Fixes: #11668Closes#11671
(cherry picked from commit 5621cdd7f9)
The return from DescribeTable which describes GSIs and LSIs is missing
the Projection field. We do not yet support all the settings Projection
(see #5036), but the default which we support is ALL, and DescribeTable
should return that in its description.
Fixes#11470Closes#11693
(cherry picked from commit 636e14cc77)
EC2 instance metadata service can be busy, ret's retry to connect with
interval, just like we do in scylla-machine-image.
Fixes#10250
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#11688
(cherry picked from commit 6b246dc119)
As described in issue #11801, we saw in Alternator when a GSI has both partition and sort keys which were non-key attributes in the base, cases where updating the GSI-sort-key attribute to the same value it already had caused the entire GSI row to be deleted.
In this series fix this bug (it was a bug in our materialized views implementation) and add a reproducing test (plus a few more tests for similar situations which worked before the patch, and continue to work after it).
Fixes#11801Closes#11808
* github.com:scylladb/scylladb:
test/alternator: add test for issue 11801
MV: fix handling of view update which reassign the same key value
materialized views: inline used-once and confusing function, replace_entry()
(cherry picked from commit e981bd4f21)
When being stopped compaction manager may step on ENOSPC. This is not a
reason to fail stopping process with abort, better to warn this fact in
logs and proceed as if nothing happened
refs: #11245
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Commit a9805106 (table: seal_active_memtable: handle ENOSPC error)
made memtable flushing code stand ENOSPC and continue flusing again
in the hope that the node administrator would provide some free space.
However, it looks like the IO code may report back ENOSPC with some
exception type this code doesn't expect. This patch tries to fix it
refs: #11245
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The existing loop is very branchy in its attempts to find out whether or
not to abort. The "allowed_retries" count can be a good indicator of the
decision taken. This makes the code notably shorter and easier to extend
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Aborting too soon on ENOSPC is too harsh, leading to loss of
availability of the node for reads, while restarting it won't
solve the ENOSPC condition.
Fixes#11245
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#11246
Scylla's Bloom filter implementation has a minimal false-positive rate
that it can support (6.71e-5). When setting bloom_filter_fp_chance any
lower than that, the compute_bloom_spec() function, which writes the bloom
filter, throws an exception. However, this is too late - it only happens
while flushing the memtable to disk, and a failure at that point causes
Scylla to crash.
Instead, we should refuse the table creation with the unsupported
bloom_filter_fp_chance. This is also what Cassandra did six years ago -
see CASSANDRA-11920.
This patch also includes a regression test, which crashes Scylla before
this patch but passes after the patch (and also passes on Cassandra).
Fixes#11524.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11576
(cherry picked from commit 4c93a694b7)
DescribeTable is currently hard-coded to return PAY_PER_REQUEST billing
mode. Nevertheless, even in PAY_PER_REQUEST mode, the DescribeTable
operation must return a ProvisionedThroughput structure, listing both
ReadCapacityUnits and WriteCapacityUnits as 0. This requirement is not
stated in some DynamoDB documentation but is explictly mentioned in
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_ProvisionedThroughput.html
Also in empirically, DynamoDB returns ProvisionedThroughput with zeros
even in PAY_PER_REQUEST mode. We even had an xfailing test to confirm this.
The ProvisionedThroughput structure being missing was a problem for
applications like DynamoDB connectors for Spark, if they implicitly
assume that ProvisionedThroughput is returned by DescribeTable, and
fail (as described in issue #11222) if it's outright missing.
So this patch adds the missing ProvisionedThroughput structure, and
the xfailing test starts to pass.
Note that this patch doesn't change the fact that attempting to set
a table to PROVISIONED billing mode is ignored: DescribeTable continues
to always return PAY_PER_REQUEST as the billing mode and zero as the
provisioned capacities.
Fixes#11222
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11298
(cherry picked from commit 941c719a23)
When cross-shard barrier is abort()-ed it spawns a background fiber
that will wake-up other shards (if they are sleeping) with exception.
This fiber is implicitly waited by the owning sharded service .stop,
because barrier usage is like this:
sharded<service> s;
co_await s.invoke_on_all([] {
...
barrier.abort();
});
...
co_await s.stop();
If abort happens, the invoke_on_all() will only resolve _after_ it
queues up the waking lambdas into smp queues, thus the subseqent stop
will queue its stopping lambdas after barrier's ones.
However, in debug mode the queue can be shuffled, so the owning service
can suddenly be freed from under the barrier's feet causing use after
free. Fortunately, this can be easily fixed by capturing the shared
pointer on the shared barrier instead of a regular pointer on the
shard-local barrier.
fixes: #11303
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11553
The generator was first setting the marker then applied tombstones.
The marker was set like this:
row.marker() = random_row_marker();
Later, when shadowable tombstones were applied, they were compacted
with the marker as expected.
However, the key for the row was chosen randomly in each iteration and
there are multiple keys set, so there was a possibility of a key clash
with an earlier row. This could override the marker without applying
any tombstones, which is conditional on random choice.
This could generate rows with markers uncompacted with shadowable tombstones.
This broken row_cache_test::test_concurrent_reads_and_eviction on
comparison between expected and read mutations. The latter was
compacted because it went through an extra merge path, which compacts
the row.
Fix by making sure there are no key clashes.
Closes#11663
(cherry picked from commit 5268f0f837)
If user stops off-strategy via API, compaction manager can decide
to give up on it completely, so data will sit unreshaped in
maintenance set, preventing it from being compacted with data
in the main set. That's problematic because it will probably lead
to a significant increase in read and space amplification until
off-strategy is triggered again, which cannot happen anytime
soon.
Let's handle it by moving data in maintenance set into main one,
even if unreshaped. Then regular compaction will be able to
continue from where off-strategy left off.
Fixes#11543.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#11545
(cherry picked from commit a04047f390)
Range tombstones are kept in memory (cache/memtable) in
range_tombstone_list. It keeps them deoverlapped, so applying a range
tombstone which covers many range tombstones will erase existing range
tombstones from the list. This operation needs to be exception-safe,
so range_tombstone_list maintains an undo log. This undo log will
receive a record for each range tombstone which is removed. For
exception safety reasons, before pushing an undo log entry, we reserve
space in the log by calling std::vector::reserve(size() + 1). This is
O(N) where N is the number of undo log entries. Therefore, the whole
application is O(N^2).
This can cause reactor stalls and availability issues when replicas
apply such deletions.
This patch avoids the problem by reserving exponentially increasing
amount of space. Also, to avoid large allocations, switches the
container to chunked_vector.
Fixes#11211Closes#11215
(cherry picked from commit 7f80602b01)
Said method currently emits a partition-end. This method is only called
when the last fragment in the stream is a range tombstone change with a
position after all clustered rows. The problem is that
consume_partition_end() is also called unconditionally, resulting in two
partition-end fragments being emitted. The fix is simple: make this
method a no-op, there is nothing to do there.
Also add two tests: one targeted to this bug and another one testing the
crawling reader with random mutations generated for random schema.
Fixes: #11421Closes#11422
(cherry picked from commit be9d1c4df4)
Per-partition rate limiting added a new error type which should be
returned when Scylla decides to reject an operation due to per-partition
rate limit being exceeded. The new error code requires drivers to
negotiate support for it, otherwise Scylla will report the error as
`Config_error`. The existing error code override logic works properly,
however due to a mistake Scylla will report the `Config_error` code even
if the driver correctly negotiated support for it.
This commit fixes the problem by specifying the correct error code in
`rate_limit_exception`'s constructor.
Tested manually with a modified version of the Rust driver which
negotiates support for the new error. Additionally, tested what happens
when the driver doesn't negotiate support (Scylla properly falls back to
`Config_error`).
Branches: 5.1
Fixes: #11517Closes#11518
(cherry picked from commit e69b44a60f)
Commit 8ab57aa added a yield to the buffer-copy loop, which means that
the copy can yield before done and the multishard reader might see the
half-copied buffer and consider the reader done (because
`_end_of_stream` is already set) resulting in the dropping the remaining
part of the buffer and in an invalid stream if the last copied fragment
wasn't a partition-end.
Fixes: #11561
(cherry picked from commit 0c450c9d4c)
When configuring tcp-nodelay unconditionally, messaging service thinks
gossiper uses group index 1, though it had changed some time ago and now
those verbs belong to group 0.
fixes: #11465
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 2c74062962)
from Tomasz Grabiec
This series fixes lack of mutation associativity which manifests as
sporadic failures in
row_cache_test.cc::test_concurrent_reads_and_eviction due to differences
in mutations applied and read.
No known production impact.
Refs https://github.com/scylladb/scylladb/issues/11307Closes#11312
* github.com:scylladb/scylladb:
test: mutation_test: Add explicit test for mutation commutativity
test: random_mutation_generator: Workaround for non-associativity of mutations with shadowable tombstones
db: mutation_partition: Drop unnecessary maybe_shadow()
db: mutation_partition: Maintain shadowable tombstone invariant when applying a hard tombstone
mutation_partition: row: make row marker shadowing symmetric
(cherry picked from commit 484004e766)
This makes catching issues related to concurrent access of same or
adjacent entries more likely. For example, catches #11239.
Closes#11260
(cherry picked from commit 8ee5b69f80)
The intention was for these logs to be printed during the
database shutdown sequence, but it was overlooked that it's not
the only place where commitlog::shutdown is called.
Commitlogs are started and shut down periodically by hinted handoff.
When that happens, these messages spam the log.
Fix that by adding INFO commitlog shutdown logs to database::stop,
and change the level of the commitlog::shutdown log call to DEBUG.
Fixes#11508Closes#11536
(cherry picked from commit 9b6fc553b4)
Long-term index caching in the global cache, as introduced in 4.6, is a major
pessimization for workloads where accesses to the index are (spacially) sparse.
We want to have a way to disable it for the affected workloads.
There is already infrastructure in place for disabling it for BYPASS CACHE
queries. One way of solving the issue is hijacking that infrastructure.
This patch adds a global flag (and a corresponding CLI option) which controls
index caching. Setting the flag to `false` causes all index reads to behave
like they would in BYPASS CACHE queries.
Consequences of this choice:
- The per-SSTable partition_index_cache is unused. Every index_reader has
its own, and they die together. Independent reads can no longer reuse the
work of other reads which hit the same index pages. This is not crucial,
since partition accesses have no (natural) spatial locality. Note that
the original reason for partition_index_cache -- the ability to share
reads for the lower and upper bound of the query -- is unaffected.
- The per-SSTable cached_file is unused. Every index_reader has its own
(uncached) input stream from the index file, and every
bsearch_clustered_cursor has its own cached_file, which dies together with
the cursor. Note that the cursor still can perform its binary search with
caching. However, it won't be able to reuse the file pages read by
index_reader. In particular, if the promoted index is small, and fits inside
the same file page as its index_entry, that page will be re-read.
It can also happen that index_reader will read the same index file page
multiple times. When the summary is so dense that multiple index pages fit in
one index file page, advancing the upper bound, which reads the next index
page, will read the same index file page. Since summary:disk ratio is 1:2000,
this is expected to happen for partitions with size greater than 2000
partition keys.
Fixes#11202
(cherry picked from commit cdb3e71045)
The logger is proof against allocation failures, except if
--abort-on-seastar-bad-alloc is specified. If it is, it will crash.
The reclaim stall report is likely to be called in low memory conditions
(reclaim's job is to alleviate these conditions after all), so we're
likely to crash here if we're reclaiming a very low memory condition
and have a large stall simultaneously (AND we're running in a debug
environment).
Prevent all this by disabling --abort-on-seastar-bad-alloc temporarily.
Fixes#11549Closes#11555
(cherry picked from commit d3b8c0c8a6)
An incorrect size is returned from the function, which could lead to
crashes or undefined behavior. Fix by erroring out in these cases.
Fixes#11476
(cherry picked from commit 1c2eef384d)
Currently, when detaching the table from the database, we force-evict all queriers for said table. This series broadens the scope of this force-evict to include all inactive reads registered at the semaphore. This ensures that any regular inactive read "forgotten" for any reason in the semaphore, will not end up in said readers accessing a dangling table reference when destroyed later.
Fixes: https://github.com/scylladb/scylladb/issues/11264Closes#11273
* github.com:scylladb/scylladb:
querier: querier_cache: remove now unused evict_all_for_table()
database: detach_column_family(): use reader_concurrency_semaphore::evict_inactive_reads_for_table()
reader_concurrency_semaphore: add evict_inactive_reads_for_table()
(cherry picked from commit afa7960926)
Scenario:
cache = [
row(pos=2, continuous=false),
row(pos=after(2), dummy=true)
]
Scanning read starts, starts populating [-inf, before(2)] from sstables.
row(pos=2) is evicted.
cache = [
row(pos=after(2), dummy=true)
]
Scanning read finishes reading from sstables.
Refreshes cache cursor via
partition_snapshot_row_cursor::maybe_refresh(), which calls
partition_snapshot_row_cursor::advance_to() because iterators are
invalidated. This advances the cursor to
after(2). no_clustering_row_between(2, after(2)) returns true, so
advance_to() returns true, and maybe_refresh() returns true. This is
interpreted by the cache reader as "the cursor has not moved forward",
so it marks the range as complete, without emitting the row with
pos=2. Also, it marks row(pos=after(2)) as continuous, so later reads
will also miss the row.
The bug is in advance_to(), which is using
no_clustering_row_between(a, b) to determine its result, which by
definition excludes the starting key.
Discovered by row_cache_test.cc::test_concurrent_reads_and_eviction
with reduced key range in the random_mutation_generator (1024 -> 16).
Fixes#11239Closes#11240
* github.com:scylladb/scylladb:
test: mvcc: Fix illegal use of maybe_refresh()
tests: row_cache_test: Add test_eviction_of_upper_bound_of_population_range()
tests: row_cache_test: Introduce one_shot mode to throttle
row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy
leveled_manifest::logger.warn("Turns out that level {} is not disjoint, found {} overlapping SSTables, so compacting everything on behalf of {}.{}",level,overlapping_sstables,schema->ks_name(),schema->cf_name());
// Unfortunately no good limit to limit input size to max_sstables for LCS major
leveled_manifest::logger.warn("Turns out that level {} is not disjoint, found {} overlapping SSTables, so the level will be entirely compacted on behalf of {}.{}",level,overlapping_sstables,schema->ks_name(),schema->cf_name());
,enable_sstables_mc_format(this,"enable_sstables_mc_format",value_status::Unused,true,"Enable SSTables 'mc' format to be used as the default file format. Deprecated, please use \"sstable_format\" instead.")
,enable_sstables_md_format(this,"enable_sstables_md_format",value_status::Unused,true,"Enable SSTables 'md' format to be used as the default file format. Deprecated, please use \"sstable_format\" instead.")
,enable_dangerous_direct_import_of_cassandra_counters(this,"enable_dangerous_direct_import_of_cassandra_counters",value_status::Used,false,"Only turn this option on if you want to import tables from Cassandra containing counters, and you are SURE that no counters in that table were created in a version earlier than Cassandra 2.1."
" It is not enough to have ever since upgraded to newer versions of Cassandra. If you EVER used a version earlier than 2.1 in the cluster where these SSTables come from, DO NOT TURN ON THIS OPTION! You will corrupt your data. You have been warned.")
,enable_shard_aware_drivers(this,"enable_shard_aware_drivers",value_status::Used,true,"Enable native transport drivers to use connection-per-shard for better performance")
"Use separate schema commit log unconditionally rater than after restart following discovery of cluster-wide support for it.")
,nodeops_watchdog_timeout_seconds(this,"nodeops_watchdog_timeout_seconds",liveness::LiveUpdate,value_status::Used,120,"Time in seconds after which node operations abort when not hearing from the coordinator")
,nodeops_heartbeat_interval_seconds(this,"nodeops_heartbeat_interval_seconds",liveness::LiveUpdate,value_status::Used,10,"Period of heartbeat ticks in node operations")
"Keep SSTable index pages in the global cache after a SSTable read. Expected to improve performance for workloads with big partitions, but may degrade performance for workloads with small partitions.")
coredump_setup = interactive_ask_service('Do you want to enable coredumps?', 'Yes - sets up coredump to allow a post-mortem analysis of the Scylla state just prior to a crash. No - skips this step.', coredump_setup)
@@ -4,70 +4,65 @@ Raft Consensus Algorithm in ScyllaDB
Introduction
--------------
ScyllaDB was originally designed, following Apache Cassandra, to use gossip for topology and schema updates and the Paxos consensus algorithm for
strong data consistency (:doc:`LWT </using-scylla/lwt>`). To achieve stronger consistency without performance penalty, ScyllaDB 5.0 is turning to Raft - a consensus algorithm designed as an alternative to both gossip and Paxos.
ScyllaDB was originally designed, following Apache Cassandra, to use gossip for topology and schema updates and the Paxos consensus algorithm for
strong data consistency (:doc:`LWT </using-scylla/lwt>`). To achieve stronger consistency without performance penalty, ScyllaDB 5.x has turned to Raft - a consensus algorithm designed as an alternative to both gossip and Paxos.
Raft is a consensus algorithm that implements a distributed, consistent, replicated log across members (nodes). Raft implements consensus by first electing a distinguished leader, then giving the leader complete responsibility for managing the replicated log. The leader accepts log entries from clients, replicates them on other servers, and tells servers when it is safe to apply log entries to their state machines.
Raft uses a heartbeat mechanism to trigger a leader election. All servers start as followers and remain in the follower state as long as they receive valid RPCs (heartbeat) from a leader or candidate. A leader sends periodic heartbeats to all followers to maintain his authority (leadership). Suppose a follower receives no communication over a period called the election timeout. In that case, it assumes no viable leader and begins an election to choose a new leader.
Leader selection is described in detail in the `raft paper <https://raft.github.io/raft.pdf>`_.
Leader selection is described in detail in the `Raft paper <https://raft.github.io/raft.pdf>`_.
Scylla 5.0 uses Raft to maintain schema updates in every node (see below). Any schema update, like ALTER, CREATE or DROP TABLE, is first committed as an entry in the replicated Raft log, and, once stored on most replicas, applied to all nodes **in the same order**, even in the face of a node or network failures.
ScyllaDB 5.x may use Raft to maintain schema updates in every node (see below). Any schema update, like ALTER, CREATE or DROP TABLE, is first committed as an entry in the replicated Raft log, and, once stored on most replicas, applied to all nodes **in the same order**, even in the face of a node or network failures.
Following Scylla 5.x releases will use Raft to guarantee consistent topology updates similarly.
Following ScyllaDB 5.x releases will use Raft to guarantee consistent topology updates similarly.
.._raft-quorum-requirement:
Quorum Requirement
-------------------
Raft requires at least a quorum of nodes in a cluster to be available. If multiple nodes fail
and the quorum is lost, the cluster is unavailable for schema updates. See :ref:`Handling Failures <raft-handliing-failures>`
Raft requires at least a quorum of nodes in a cluster to be available. If multiple nodes fail
and the quorum is lost, the cluster is unavailable for schema updates. See :ref:`Handling Failures <raft-handling-failures>`
Note that when you have a two-DC cluster with the same number of nodes in each DC, the cluster will lose the quorum if one
Note that when you have a two-DC cluster with the same number of nodes in each DC, the cluster will lose the quorum if one
of the DCs is down.
**We recommend configuring three DCs per cluster to ensure that the cluster remains available and operational when one DC is down.**
Enabling Raft
---------------
Enabling Raft in ScyllaDB 5.0
===============================
Enabling Raft in ScyllaDB 5.0 and 5.1
=====================================
..note::
In ScyllaDB 5.0:
..warning::
In ScyllaDB 5.0 and 5.1, Raft is an experimental feature.
* Raft is an experimental feature.
* Raft implementation only covers safe schema changes. See :ref:`Safe Schema Changes with Raft <raft-schema-changes>`.
It is not possible to enable Raft in an existing cluster in ScyllaDB 5.0 and 5.1.
In order to have a Raft-enabled cluster in these versions, you must create a new cluster with Raft enabled from the start.
If you are creating a new cluster, add ``raft`` to the list of experimental features in your ``scylla.yaml`` file:
..warning::
..code-block::yaml
experimental_features:
- raft
**Do not** use Raft in production clusters in ScyllaDB 5.0 and 5.1. Such clusters won't be able to correctly upgrade to ScyllaDB 5.2.
If you upgrade to ScyllaDB 5.0 from an earlier version, perform a :doc:`rolling restart </operating-scylla/procedures/config-change/rolling-restart/>`
updating the ``scylla.yaml`` file for **each node** in the cluster to enable the experimental Raft feature:
..code-block::yaml
experimental_features:
- raft
When all the nodes in the cluster and updated and restarted, the cluster will begin to use Raft for schema changes.
Use Raft only for testing and experimentation in clusters which can be thrown away.
..warning::
Once enabled, Raft cannot be disabled on your cluster. The cluster nodes will fail to restart if you remove the Raft feature.
Verifying that Raft Is Enabled
When creating a new cluster, add ``raft`` to the list of experimental features in your ``scylla.yaml`` file:
..code-block::yaml
experimental_features:
- raft
Verifying that Raft is enabled
===============================
You can verify that Raft is enabled on your cluster in one of the following ways:
@@ -100,23 +95,23 @@ Safe Schema Changes with Raft
-------------------------------
In ScyllaDB, schema is based on :doc:`Data Definition Language (DDL) </cql/ddl>`. In earlier ScyllaDB versions, schema changes were tracked via the gossip protocol, which might lead to schema conflicts if the updates are happening concurrently.
Implementing Raft eliminates schema conflicts and allows full automation of DDL changes under any conditions, as long as a quorum
Implementing Raft eliminates schema conflicts and allows full automation of DDL changes under any conditions, as long as a quorum
of nodes in the cluster is available. The following examples illustrate how Raft provides the solution to problems with schema changes.
* A network partition may lead to a split-brain case, where each subset of nodes has a different version of the schema.
With Raft, after a network split, the majority of the cluster can continue performing schema changes, while the minority needs to wait until it can rejoin the majority. Data manipulation statements on the minority can continue unaffected, provided the :ref:`quorum requirement <raft-quorum-requirement>` is satisfied.
* Two or more conflicting schema updates are happening at the same time. For example, two different columns with the same definition are simultaneously added to the cluster. There is no effective way to resolve the conflict - the cluster will employ the schema with the most recent timestamp, but changes related to the shadowed table will be lost.
* Two or more conflicting schema updates are happening at the same time. For example, two different columns with the same definition are simultaneously added to the cluster. There is no effective way to resolve the conflict - the cluster will employ the schema with the most recent timestamp, but changes related to the shadowed table will be lost.
With Raft, concurrent schema changes are safe.
With Raft, concurrent schema changes are safe.
In summary, Raft makes schema changes safe, but it requires that a quorum of nodes in the cluster is available.
.._raft-handliing-failures:
.._raft-handling-failures:
Handling Failures
------------------
@@ -175,7 +170,7 @@ Examples
* - 1-4 nodes
- Schema updates are possible and safe.
- Try restarting the nodes. If the nodes are dead, :doc:`replace them with new nodes </operating-scylla/procedures/cluster-management/replace-dead-node-or-more/>`.
* - 1 DC
* - 1 DC
- Schema updates are possible and safe.
- When the DC comes back online, try restarting the nodes in the cluster. If the nodes are dead, :doc:`add 3 new nodes in a new region </operating-scylla/procedures/cluster-management/add-dc-to-existing-dc/>`.
:term:`Sorted Strings Table (SSTable)<SSTable>` is the persistent file format used by Scylla and Apache Cassandra. SSTable is saved as a persistent, ordered, immutable set of files on disk.
:term:`Sorted Strings Table (SSTable)<SSTable>` is the persistent file format used by ScyllaDB and Apache Cassandra. SSTable is saved as a persistent, ordered, immutable set of files on disk.
Immutable means SSTables are never modified; they are created by a MemTable flush and are deleted by a compaction.
The location of Scylla SSTables is specified in scylla.yaml ``data_file_directories`` parameter (default location: ``/var/lib/scylla/data``).
The location of ScyllaDB SSTables is specified in scylla.yaml ``data_file_directories`` parameter (default location: ``/var/lib/scylla/data``).
SSTable 3.0 (mc format) is more efficient and requires less disk space than the SSTable 2.x. SSTable version support is as follows:
SSTable 3.x is more efficient and requires less disk space than the SSTable 2.x.
* In Scylla 3.1 and above, mc format is enabled by default.
* In ScyllaDB 5.1 and above, the ``me`` format is enabled by default.
* In ScyllaDB 4.3 to 5.0, the ``md`` format is enabled by default.
* In ScyllaDB 3.1 to 4.2, the ``mc`` format is enabled by default.
* In ScyllaDB 3.0, the ``mc`` format is disabled by default. You can enable it by adding the ``enable_sstables_mc_format`` parameter set to ``true`` in the ``scylla.yaml`` file. For example:
.. code-block:: shell
enable_sstables_mc_format: true
* In Scylla 3.0, mc format is disabled by default and can be enabled by adding the ``enable_sstables_mc_format`` parameter as 'true' in ``scylla.yaml`` file.
.. REMOVE IN FUTURE VERSIONS - Remove the note above in version 5.2.
For example:
Additional Information
-------------------------
..code-block::shell
enable_sstables_mc_format: true
For more information on Scylla 3.x SSTable formats, see below:
For more information on ScyllaDB 3.x SSTable formats, see below:
*:doc:`SSTable 3.0 Data File Format <sstables-3-data-file-format>`
@@ -28,8 +28,13 @@ Table of contents mc-1-big-TOC.txt
This document focuses on the data file format but also refers to other components in parts where information stored in them affects the way we read/write the data file.
Note that the file on-disk format applies both to the "mc" and "md" SSTable format versions.
The "md" format only fixed the semantics of the (min|max)_clustering_key fields in the SSTable Statistics file, which are now valid for describing the accurate range of clustering prefixes present in the SSTable.
Note that the file on-disk format applies to all "m*" SSTable format versions ("mc", "md", and "me").
* The "md" format only fixed the semantics of the ``(min|max)_clustering_key`` fields in the SSTable Statistics file,
which are now valid for describing the accurate range of clustering prefixes present in the SSTable.
* The "me" format added the ``host_id`` of the host writing the SStable to the SSTable Statistics file.
It is used to qualify the commit log replay position that is also stored in the SSTable Statistics file.
See :doc:`SSTables 3.0 Statistics File Format </architecture/sstable/sstable3/sstables-3-statistics>` for more details.
This section describes the statements supported by CQL to insert, update, delete, and query data.
:ref:`SELECT <select-statement>`
@@ -99,11 +97,12 @@ alternatively, of the wildcard character (``*``) to select all the columns defin
Selectors
`````````
A :token:`selector` can be one of:
A :token:`selector` can be one of the following:
- A column name of the table selected to retrieve the values for that column.
- A casting, which allows you to convert a nested selector to a (compatible) type.
- A function call, where the arguments are selector themselves.
- A call to the :ref:`COUNT function <count-function>`, which counts all non-null results.
Aliases
```````
@@ -606,7 +605,7 @@ of eventual consistency on an event of a timestamp collision:
``INSERT`` statements happening concurrently at different cluster
nodes proceed without coordination. Eventually cell values
supplied by a statement with the highest timestamp will prevail.
supplied by a statement with the highest timestamp will prevail (see :ref:`update ordering <update-ordering>`).
Unless a timestamp is provided by the client, Scylla will automatically
generate a timestamp with microsecond precision for each
@@ -615,7 +614,7 @@ by the same node are unique. Timestamps assigned at different
nodes are not guaranteed to be globally unique.
With a steadily high write rate timestamp collision
is not unlikely. If it happens, i.e. two ``INSERTS`` have the same
timestamp, the lexicographically bigger value prevails:
timestamp, a conflict resolution algorithm determines which of the inserted cells prevails (see :ref:`update ordering <update-ordering>`).
Please refer to the :ref:`UPDATE <update-parameters>` section for more information on the :token:`update_parameter`.
@@ -723,8 +722,8 @@ Similarly to ``INSERT``, ``UPDATE`` statement happening concurrently at differen
cluster nodes proceed without coordination. Cell values
supplied by a statement with the highest timestamp will prevail.
If two ``UPDATE`` statements or ``UPDATE`` and ``INSERT``
statements have the same timestamp,
lexicographically bigger value prevails.
statements have the same timestamp, a conflict resolution algorithm determines which cells prevails
(see :ref:`update ordering <update-ordering>`).
Regarding the :token:`assignment`:
@@ -765,7 +764,7 @@ parameters:
Scylla ensures that query timestamps created by the same coordinator node are unique (even across different shards
on the same node). However, timestamps assigned at different nodes are not guaranteed to be globally unique.
Note that with a steadily high write rate, timestamp collision is not unlikely. If it happens, e.g. two INSERTS
have the same timestamp, conflicting cell values are compared and the cells with the lexicographically bigger value prevail.
have the same timestamp, a conflict resolution algorithm determines which of the inserted cells prevails (see :ref:`update ordering <update-ordering>` for more information):
-``TTL``: specifies an optional Time To Live (in seconds) for the inserted values. If set, the inserted values are
automatically removed from the database after the specified time. Note that the TTL concerns the inserted values, not
the columns themselves. This means that any subsequent update of the column will also reset the TTL (to whatever TTL
@@ -775,6 +774,55 @@ parameters:
-``TIMEOUT``: specifies a timeout duration for the specific request.
Please refer to the :ref:`SELECT <using-timeout>` section for more information.
.._update-ordering:
Update ordering
~~~~~~~~~~~~~~~
:ref:`INSERT <insert-statement>`, :ref:`UPDATE <update-statement>`, and :ref:`DELETE <delete_statement>`
operations are ordered by their ``TIMESTAMP``.
Ordering of such changes is done at the cell level, where each cell carries a write ``TIMESTAMP``,
other attributes related to its expiration when it has a non-zero time-to-live (``TTL``),
and the cell value.
The fundamental rule for ordering cells that insert, update, or delete data in a given row and column
is that the cell with the highest timestamp wins.
However, it is possible that multiple such cells will carry the same ``TIMESTAMP``.
There could be several reasons for ``TIMESTAMP`` collision:
* Benign collision can be caused by "replay" of a mutation, e.g., due to client retry, or due to internal processes.
In such cases, the cells are equivalent, and any of them can be selected arbitrarily.
*``TIMESTAMP`` collisions might be normally caused by parallel queries that are served
by different coordinator nodes. The coordinators might calculate the same write ``TIMESTAMP``
based on their local time in microseconds.
* Collisions might also happen with user-provided timestamps if the application does not guarantee
unique timestamps with the ``USING TIMESTAMP`` parameter (see :ref:`Update parameters <update-parameters>` for more information).
As said above, in the replay case, ordering of cells should not matter, as they carry the same value
and same expiration attributes, so picking any of them will reach the same result.
However, other ``TIMESTAMP`` conflicts must be resolved in a consistent way by all nodes.
Otherwise, if nodes would have picked an arbitrary cell in case of a conflict and they would
reach different results, reading from different replicas would detect the inconsistency and trigger
read-repair that will generate yet another cell that would still conflict with the existing cells,
with no guarantee for convergence.
Therefore, Scylla implements an internal, consistent conflict-resolution algorithm
that orders cells with conflicting ``TIMESTAMP`` values based on other properties, like:
* whether the cell is a tombstone or a live cell,
* whether the cell has an expiration time,
* the cell ``TTL``,
* and finally, what value the cell carries.
The conflict-resolution algorithm is documented in `Scylla's internal documentation <https://github.com/scylladb/scylladb/blob/master/docs/dev/timestamp-conflict-resolution.md>`_
and it may be subject to change.
Reliable serialization can be achieved using unique write ``TIMESTAMP``
and by using :doc:`Lightweight Transactions (LWT) </using-scylla/lwt>` to ensure atomicity of
:ref:`INSERT <insert-statement>`, :ref:`UPDATE <update-statement>`, and :ref:`DELETE <delete_statement>`.
.._delete_statement:
DELETE
@@ -814,7 +862,7 @@ For more information on the :token:`update_parameter` refer to the :ref:`UPDATE
In a ``DELETE`` statement, all deletions within the same partition key are applied atomically,
meaning either all columns mentioned in the statement are deleted or none.
If ``DELETE`` statement has the same timestamp as ``INSERT`` or
``UPDATE`` of the same primary key, delete operation prevails.
``UPDATE`` of the same primary key, delete operation prevails (see :ref:`update ordering <update-ordering>`).
A ``DELETE`` operation can be conditional through the use of an ``IF`` clause, similar to ``UPDATE`` and ``INSERT``
statements. Each such ``DELETE`` gets a globally unique timestamp.
.. Need some intro for UDF and native functions in general and point those to it.
.._udfs:
.._native-functions:
Functions
@@ -33,13 +32,15 @@ CQL supports two main categories of functions:
- The :ref:`aggregate functions <aggregate-functions>`, which are used to aggregate multiple rows of results from a
``SELECT`` statement.
.. In both cases, CQL provides a number of native "hard-coded" functions as well as the ability to create new user-defined
.. functions.
In both cases, CQL provides a number of native "hard-coded" functions as well as the ability to create new user-defined
functions.
.. .. note:: By default, the use of user-defined functions is disabled by default for security concerns (even when
.. enabled, the execution of user-definedfunctions is sandboxed and a "rogue" function should not be allowed to do
.. evil, but no sandbox is perfect so using user-defined functions is opt-in). See the ``enable_user_defined_functions``
.. in ``scylla.yaml`` to enable them.
..note:: Although user-defined functions are sandboxed, protecting the system from a "rogue" function, user-defined functions are disabled by default for extra security.
See the ``enable_user_defined_functions`` in ``scylla.yaml`` to enable them.
Additionally, user-defined functions are still experimental and need to be explicitly enabled by adding ``udf`` to the list of
``experimental_features`` configuration options in ``scylla.yaml``, or turning on the ``experimental`` flag.
See :ref:`Enabling Experimental Features <yaml_enabling_experimental_features>` for details.
.. A function is identifier by its name:
@@ -60,11 +61,11 @@ Native functions
Cast
````
Supported starting from Scylla version 2.1
Supported starting from ScyllaDB version 2.1
The ``cast`` function can be used to convert one native datatype to another.
The following table describes the conversions supported by the ``cast`` function. Scylla will silently ignore any cast converting a cast datatype into its own datatype.
The following table describes the conversions supported by the ``cast`` function. ScyllaDB will silently ignore any cast converting a cast datatype into its own datatype.
User-defined functions (UDFs) execute user-provided code in ScyllaDB. Supported languages are currently Lua and WebAssembly.
UDFs are part of the ScyllaDB schema and are automatically propagated to all nodes in the cluster.
UDFs can be overloaded, so that multiple UDFs with different argument types can have the same function name, for example::
CREATE FUNCTION sample ( arg int ) ...;
CREATE FUNCTION sample ( arg text ) ...;
When calling a user-defined function, arguments can be literals or terms. Prepared statement placeholders can be used, too.
CREATE FUNCTION statement
`````````````````````````
Creating a new user-defined function uses the ``CREATE FUNCTION`` statement. For example::
CREATE OR REPLACE FUNCTION div(dividend double, divisor double)
RETURNS NULL ON NULL INPUT
RETURNS double
LANGUAGE LUA
AS 'return dividend/divisor;';
``CREATE FUNCTION`` with the optional ``OR REPLACE`` keywords creates either a function
or replaces an existing one with the same signature. A ``CREATE FUNCTION`` without ``OR REPLACE``
fails if a function with the same signature already exists. If the optional ``IF NOT EXISTS``
keywords are used, the function will only be created only if another function with the same
signature does not exist. ``OR REPLACE`` and ``IF NOT EXISTS`` cannot be used together.
Behavior for null input values must be defined for each function:
*``RETURNS NULL ON NULL INPUT`` declares that the function will always return null (without being executed) if any of the input arguments is null.
*``CALLED ON NULL INPUT`` declares that the function will always be executed.
Function Signature
``````````````````
Signatures are used to distinguish individual functions. The signature consists of a fully-qualified function name of the <keyspace>.<function_name> and a concatenated list of all the argument types.
Note that keyspace names, function names and argument types are subject to the default naming conventions and case-sensitivity rules.
Functions belong to a keyspace; if no keyspace is specified, the current keyspace is used. User-defined functions are not allowed in the system keyspaces.
DROP FUNCTION statement
```````````````````````
Dropping a function uses the ``DROP FUNCTION`` statement. For example::
DROP FUNCTION myfunction;
DROP FUNCTION mykeyspace.afunction;
DROP FUNCTION afunction ( int );
DROP FUNCTION afunction ( text );
You must specify the argument types of the function, the arguments_signature, in the drop command if there are multiple overloaded functions with the same name but different signatures.
``DROP FUNCTION`` with the optional ``IF EXISTS`` keywords drops a function if it exists, but does not throw an error if it doesn’t.
.._aggregate-functions:
Aggregate functions
@@ -261,6 +321,10 @@ It also can be used to count the non-null value of a given column::
SELECT COUNT (scores) FROM plays;
..note::
Counting all rows in a table may be time-consuming and exceed the default timeout. In such a case,
see :doc:`Counting all rows in a table is slow </kb/count-all-rows>` for instructions.
User-defined aggregates allow the creation of custom aggregate functions. User-defined aggregates can be used in SELECT statement.
Each aggregate requires an initial state of type ``STYPE`` defined with the ``INITCOND`` value (default value: ``null``). The first argument of the state function must have type STYPE. The remaining arguments of the state function must match the types of the user-defined aggregate arguments. The state function is called once for each row, and the value returned by the state function becomes the new state. After all rows are processed, the optional FINALFUNC is executed with the last state value as its argument.
The ``STYPE`` value is mandatory in order to distinguish possibly overloaded versions of the state and/or final function, since the overload can appear after creation of the aggregate.
A complete working example for user-defined aggregates (assuming that a keyspace has been selected using the ``USE`` statement)::
CREATE FUNCTION accumulate_len(acc tuple<bigint,bigint>, a text)
RETURNS NULL ON NULL INPUT
RETURNS tuple<bigint,bigint>
LANGUAGE lua as 'return {acc[1] + 1, acc[2] + #a}';
CREATE OR REPLACE FUNCTION present(res tuple<bigint,bigint>)
RETURNS NULL ON NULL INPUT
RETURNS text
LANGUAGE lua as
'return "The average string length is " .. res[2]/res[1] .. "!"';
CREATE OR REPLACE AGGREGATE avg_length(text)
SFUNC accumulate_len
STYPE tuple<bigint,bigint>
FINALFUNC present
INITCOND (0,0);
CREATE AGGREGATE statement
``````````````````````````
The ``CREATE AGGREGATE`` command with the optional ``OR REPLACE`` keywords creates either an aggregate or replaces an existing one with the same signature. A ``CREATE AGGREGATE`` without ``OR REPLACE`` fails if an aggregate with the same signature already exists. The ``CREATE AGGREGATE`` command with the optional ``IF NOT EXISTS`` keywords creates an aggregate if it does not already exist. The ``OR REPLACE`` and ``IF NOT EXISTS`` phrases cannot be used together.
The ``STYPE`` value defines the type of the state value and must be specified. The optional ``INITCOND`` defines the initial state value for the aggregate; the default value is null. A non-null ``INITCOND`` must be specified for state functions that are declared with ``RETURNS NULL ON NULL INPUT``.
The ``SFUNC`` value references an existing function to use as the state-modifying function. The first argument of the state function must have type ``STYPE``. The remaining arguments of the state function must match the types of the user-defined aggregate arguments. The state function is called once for each row, and the value returned by the state function becomes the new state. State is not updated for state functions declared with ``RETURNS NULL ON NULL INPUT`` and called with null. After all rows are processed, the optional ``FINALFUNC`` is executed with last state value as its argument. It must take only one argument with type ``STYPE``, but the return type of the ``FINALFUNC`` may be a different type. A final function declared with ``RETURNS NULL ON NULL INPUT`` means that the aggregate’s return value will be null, if the last state is null.
If no ``FINALFUNC`` is defined, the overall return type of the aggregate function is ``STYPE``. If a ``FINALFUNC`` is defined, it is the return type of that function.
DROP AGGREGATE statement
````````````````````````
Dropping an user-defined aggregate function uses the DROP AGGREGATE statement. For example::
DROP AGGREGATE myAggregate;
DROP AGGREGATE myKeyspace.anAggregate;
DROP AGGREGATE someAggregate ( int );
DROP AGGREGATE someAggregate ( text );
The ``DROP AGGREGATE`` statement removes an aggregate created using ``CREATE AGGREGATE``. You must specify the argument types of the aggregate to drop if there are multiple overloaded aggregates with the same name but a different signature.
The ``DROP AGGREGATE`` command with the optional ``IF EXISTS`` keywords drops an aggregate if it exists, and does nothing if a function with the signature does not exist.
*`Get Started Lesson on Scylla University <https://university.scylladb.com/courses/scylla-essentials-overview/lessons/quick-wins-install-and-run-scylla/>`_
*`Get Started Lesson on ScyllaDB University <https://university.scylladb.com/courses/scylla-essentials-overview/lessons/quick-wins-install-and-run-scylla/>`_
*:doc:`CQL Reference </cql/index>`
*:doc:`cqlsh - the CQL shell </cql/cqlsh/>`
..panel-box::
:title:Use Scylla with Third-party Solutions
:title:Use ScyllaDB with Third-party Solutions
:id:"getting-started"
:class:my-panel
*:doc:`Migrate to Scylla </using-scylla/migrate-scylla>` - How to migrate your current database to Scylla
*:doc:`Integrate with Scylla </using-scylla/integrations/index>` - Integration solutions with Scylla
*:doc:`Migrate to ScyllaDB </using-scylla/migrate-scylla>` - How to migrate your current database to Scylla
*:doc:`Integrate with ScyllaDB </using-scylla/integrations/index>` - Integration solutions with Scylla
ScyllaDB Web Installer is a platform-agnostic installation script you can run with ``curl`` to install ScyllaDB on Linux.
See `ScyllaDB Download Center <https://www.scylladb.com/download/#server>`_ for information on manually installing ScyllaDB with platform-specific installation packages.
See `ScyllaDB Download Center <https://www.scylladb.com/download/#core>`_ for information on manually installing ScyllaDB with platform-specific installation packages.
The following matrix shows which Operating Systems, Platforms, and Containers / Instance Engines are supported with which versions of Scylla.
The following matrix shows which Operating Systems, Platforms, and Containers / Instance Engines are supported with which versions of ScyllaDB.
Scylla requires a fix to the XFS append introduced in kernel 3.15 (back-ported to 3.10 in RHEL/CentOS).
Scylla will not run with earlier kernel versions. Details in `Scylla issue 885 <https://github.com/scylladb/scylla/issues/885>`_.
ScyllaDB requires a fix to the XFS append introduced in kernel 3.15 (back-ported to 3.10 in RHEL/CentOS).
ScyllaDB will not run with earlier kernel versions. Details in `ScyllaDB issue 885 <https://github.com/scylladb/scylla/issues/885>`_.
.. REMOVE IN FUTURE VERSIONS - Remove information about versions from the notes below in version 5.2.
..note::
**Supported Architecture**
Scylla Open Source supports x86_64 for all versions and aarch64 starting from Scylla 4.6 and nightly build. In particular, aarch64 support includes AWS EC2 Graviton.
For Scylla Open Source **4.5** and later, the recommended OS and Scylla AMI/IMage OS is Ubuntu 20.04.4 LTS.
ScyllaDB Open Source supports x86_64 for all versions and AArch64 starting from ScyllaDB 4.6 and nightly build. In particular, aarch64 support includes AWS EC2 Graviton.
Scylla Open Source
-------------------
ScyllaDB Open Source
----------------------
..note::For Enterprise versions **prior to** 4.6, the recommended OS and Scylla AMI/Image OS is CentOS 7.
..note::
For Scylla Open Source versions **4.6 and later**, the recommended OS and Scylla AMI/Image OS is Ubuntu 20.04.
Recommended OS and ScyllaDB AMI/Image OS for ScyllaDB Open Source:
@@ -10,14 +10,21 @@ Trying to count all rows in a table using
SELECTCOUNT(1)FROMks.table;
often fails with **ReadTimeout** error.
may fail with the **ReadTimeout** error.
COUNT() is running a full-scan query on all nodes, which might take a long time to finish. Often the time is greater than Scylla query timeout.
One way to bypass this in Scylla 4.4 or later is increasing the timeout for this query using the :ref:`USING TIMEOUT <using-timeout>` directive, for example:
COUNT() runs a full-scan query on all nodes, which might take a long time to finish. As a result, the count time may be greater than the ScyllaDB query timeout.
One way to prevent that issue in Scylla 4.4 or later is to increase the timeout for the query using the :ref:`USING TIMEOUT <using-timeout>` directive, for example:
..code-block::cql
SELECTCOUNT(1)FROMks.tableUSINGTIMEOUT120s;
You can also get an *estimation* of the number **of partitions** (not rows) with :doc:`nodetool tablestats </operating-scylla/nodetool-commands/tablestats>`
You can also get an *estimation* of the number **of partitions** (not rows) with :doc:`nodetool tablestats </operating-scylla/nodetool-commands/tablestats>`.
..note::
ScyllaDB 5.1 includes improvements to speed up the execution of SELECT COUNT(*) queries.
To increase the count speed, we recommend upgrading to ScyllaDB 5.1 or later.
.. REMOVE IN FUTURE VERSIONS - Remove the note above in version 5.1.
*:doc:`Map CPUs to Scylla Shards </kb/map-cpu>` - Mapping between CPUs and Scylla shards
*:doc:`Recreate RAID devices </kb/raid-device>` - How to recreate your RAID devices without running scylla-setup
*:doc:`Configure Scylla Networking with Multiple NIC/IP Combinations </kb/yaml-address>` - examples for setting the different IP addresses in scylla.yaml
*:doc:`Updating the Mode in perftune.yaml After a ScyllaDB Upgrade </kb/perftune-modes-sync>`
In versions 5.1 (ScyllaDB Open Source) and 2022.2 (ScyllaDB Enterprise), we improved ScyllaDB's performance by `removing the rx_queues_count from the mode
condition <https://github.com/scylladb/seastar/pull/949>`_. As a result, ScyllaDB operates in
the ``sq_split`` mode instead of the ``mq`` mode (see :doc:`Seastar Perftune </operating-scylla/admin-tools/perftune>` for information about the modes).
If you upgrade from an earlier version of ScyllaDB, your cluster's existing nodes may use the ``mq`` mode,
while new nodes will use the ``sq_split`` mode. As using different modes across one cluster is not recommended,
you should change the configuration to ensure that the ``sq_split`` mode is used on all nodes.
This section describes how to update the `perftune.yaml` file to configure the ``sq_split`` mode on all nodes.
Procedure
------------
The examples below assume that you are using the default locations for storing data and the `scylla.yaml` file,
A new ``/etc/scylla.d/cpuset.conf`` will be generated on the output.
#. Compare the contents of the newly generated ``/etc/scylla.d/cpuset.conf`` with ``/etc/scylla.d/cpuset.conf.old`` you created in step 1.
- If they are exactly the same, rename ``/etc/scylla.d/perftune.yaml.old`` you created in step 1 back to ``/etc/scylla.d/perftune.yaml`` and continue to the next node.
- If they are different, move on to the next steps.
#. Restart the ``scylla-server`` service.
.. code-block:: console
nodetool drain
sudo systemctl restart scylla-server
#. Wait for the service to become up and running (similarly to how it is done during a :doc:`rolling restart </operating-scylla/procedures/config-change/rolling-restart>`). It may take a considerable amount of time before the node is in the UN state due to resharding.
*:doc:`Tracing </using-scylla/tracing>` - a ScyllaDB tool for debugging and analyzing internal flows in the server.
*:doc:`SSTableloader </operating-scylla/admin-tools/sstableloader>` - Bulk load the sstables found in the directory to a Scylla cluster
*:doc:`scylla-sstable </operating-scylla/admin-tools/scylla-sstable>` - Validates and dumps the content of SStables, generates a histogram, dumps the content of the SStable index.
*:doc:`scylla-types </operating-scylla/admin-tools/scylla-types/>` - Examines raw values obtained from SStables, logs, coredumps, etc.
*:doc:`Scylla SStable </operating-scylla/admin-tools/scylla-sstable>` - Validates and dumps the content of SStables, generates a histogram, dumps the content of the SStable index.
*:doc:`Scylla Types </operating-scylla/admin-tools/scylla-types/>` - Examines raw values obtained from SStables, logs, coredumps, etc.
*:doc:`cassandra-stress </operating-scylla/admin-tools/cassandra-stress/>` A tool for benchmarking and load testing a Scylla and Cassandra clusters.
This tool allows you to examine the content of SStables by performing operations such as dumping the content of SStables,
generating a histogram, validating the content of SStables, and more. See `Supported Operations`_ for the list of available operations.
Run ``scylla-sstable --help`` for additional information about the tool and the operations.
Run ``scyllasstable --help`` for additional information about the tool and the operations.
This tool is similar to SStableDump_, with notable differences:
* Built on the ScyllaDB C++ codebase, it supports all SStable formats and components that ScyllaDB supports.
* Expanded scope: this tool supports much more than dumping SStable data components (see `Supported Operations`_).
* More flexible on how schema is obtained and where SStables are located: SStableDump_ only supports dumping SStables located in their native data directory. To dump an SStable, one has to clone the entire ScyllaDB data directory tree, including system table directories and even config files. ``scylla sstable`` can dump sstables from any path with multiple choices on how to obtain the schema, see Schema_.
Currently, SStableDump_ works better on production systems as it automatically loads the schema from the system tables, unlike ``scylla sstable``, which has to be provided with the schema explicitly. On the other hand ``scylla sstable`` works better for off-line investigations, as it can be used with as little as just a schema definition file and a single sstable. In the future we plan on closing this gap -- adding support for automatic schema-loading for ``scylla sstable`` too -- and completely supplant SStableDump_ with ``scylla sstable``.
@@ -21,11 +31,82 @@ The command syntax is as follows:
..code-block::console
scylla-sstable <operation> <path to SStable>
scyllasstable <operation> <path to SStable>
You can specify more than one SStable.
Schema
^^^^^^
All operations need a schema to interpret the SStables with.
Currently, there are two ways to obtain the schema:
*``--schema-file FILENAME`` - Read the schema definition from a file.
*``--system-schema KEYSPACE.TABLE`` - Use the known definition of built-in tables (only works for system tables).
By default, the tool uses the first method: ``--schema-file schema.cql``; i.e. it assumes there is a schema file named ``schema.cql`` in the working directory.
If this fails, it will exit with an error.
The schema file should contain all definitions needed to interpret data belonging to the table.
* In addition to the table itself, the definition also has to includes any user defined types the table uses.
* The keyspace definition is optional, if missing one will be auto-generated.
* The schema file doesn't have to be called ``schema.cql``, this is just the default name. Any file name is supported (with any extension).
Dropped columns
***************
The examined sstable might have columns which were dropped from the schema definition. In this case providing the up-do-date schema will not be enough, the tool will fail when attempting to process a cell for the dropped column.
Dropped columns can be provided to the tool in the form of insert statements into the ``system_schema.dropped_columns`` system table, in the schema definition file. Example:
..code-block::cql
INSERTINTOsystem_schema.dropped_columns(
keyspace_name,
table_name,
column_name,
dropped_time,
type
)VALUES(
'ks',
'cf',
'v1',
1631011979170675,
'int'
);
CREATETABLEks.cf(pkintPRIMARYKEY,v2int);
System tables
*************
If the examined table is a system table -- it belongs to one of the system keyspaces (``system``, ``system_schema``, ``system_distributed`` or ``system_distributed_everywhere``) -- you can just tell the tool to use the known built-in definition of said table. This is possible with the ``--system-schema`` flag. Example:
The Load and Stream feature extends nodetool refresh. The new ``-las`` option loads arbitrary sstables that do not belong to a node into the cluster. It loads the sstables from the disk and calculates the data's owning nodes, and streams automatically.
For example, say the old cluster has 6 nodes and the new cluster has 3 nodes. We can copy the sstables from the old cluster to any of the new nodes and trigger the load and stream process.
Load and Stream make restores and migrations much easier:
* You can place sstable from every node to every node
* No need to run nodetool cleanup to remove unused data
Note that all the nodes in the cluster participate in the ``removenode`` operation to sync data if needed. For this reason, the operation will fail if one or more nodes in the cluster are not available.
In such a case, to ensure that the operation succeeds, you must explicitly specify a list of unavailable nodes with the ``--ignore-dead-nodes`` option.
@@ -41,14 +41,6 @@ Scylla nodetool repair command supports the following options:
nodetool repair -et 90874935784
nodetool repair --end-token 90874935784
-``-seq``, ``--sequential`` Use *-seq* to carry out a sequential repair.
For example, a sequential repair of all keyspaces on a node:
::
nodetool repair -seq
-``-hosts````--in-hosts`` syncs the **repair master** data subset only between a list of nodes, using host ID or Address. The list *must* include the **repair master**.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.