- Fix missing negation in the `if` in the background downloading fiber
- Add test to catch this case
- Improve the s3 proxy to inject errors if the same resource requested more than once
- Suppress client retry since retrying the same request when each produces multiple buffers may lead to the same data appear more than once in the buffer deque
- Inject exception from the test to simulate response callback failure in the middle
No need to backport anything since this class in not used yet
Closesscylladb/scylladb#24657
* github.com:scylladb/scylladb:
s3_test: Add s3_client test for non-retryable error handling
s3_test: Add trace logging for default_retry_strategy
s3_client: Fix edge case when the range is exhausted
s3_client: Fix indentation in try..catch block
s3_client: Stop retries in chunked download source
s3_client: Enhance test coverage for retry logic
s3_client: Add test for Content-Range fix
s3_client: Fix missing negation
s3_client: Refine logging
s3_client: Improve logging placement for current_range output
This PR introduces a new `comparable_bytes` class to add byte-comparable format support for all the [native cql3 data types](https://opensource.docs.scylladb.com/stable/cql/types.html#native-types) except `counter` type as that is not comparable. The byte-comparable format is a pre-requisite for implementing the trie based index format for our sstables(https://github.com/scylladb/scylladb/issues/19191). This implementation adheres to the byte-comparable format specification in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md
Note that support for composite data types like lists, maps, and sets has not been implemented yet and will be made available in a separate PR.
Refs https://github.com/scylladb/scylladb/issues/19407
New feature - backport not required.
Closesscylladb/scylladb#23541
* github.com:scylladb/scylladb:
types/comparable_bytes: add testcase to verify compatibility with cassandra
types/comparable_bytes: support variable-length natively byte-ordered data types
types/comparable_bytes: support decimal cql3 types
types/comparable_bytes: introduce count_digits() method
types/comparable_bytes: support uuid and timeuuid cql3 types
types/comparable_bytes: support varint cql3 type
types/comparable_bytes: support skipping sign byte write in decode_signed_long_type
types/comparable_bytes: introduce encode/decode_varint_length
types/comparable_bytes: support float and double cql3 types
types/comparable_bytes: support date, time and timestamp cql3 types
types/comparable_bytes: support bigint cql3 type
types/comparable_bytes: support fixed length signed integers
types/comparable_bytes: support boolean cql3 type
types: introduce comparable_bytes class
bytes_ostream: overload write() to support writing from FragmentedView
docs: fix minor typo in docs/dev/cql3-type-mapping.md
When describing a table, we need to do it carefully: if some
columns were dropped, we must specify that explicitly by
```
ALTER TABLE {table} DROP {column} USING TIMESTAMP ...
```
in the result of the DESCRIBE statement. Failing to do so
could lead to data resurrection.
However, if a table has been altered many, many times,
we might end up with a huge create statement. Constructing
it could, in turn, trigger an oversized allocation.
Some tests ran into that very problem in fact.
In this commit, we want to mitigate the problem: instead of
allocating a contiguous chunk of memory for the create
statement, we use `bytes_ostream` and `managed_bytes` to
possibly keep data scattered in memory. It makes handling
`cql3::description` less convenient in the code, but since
the struct is pretty much immediately serialized after
creating it, it's a very good trade-off.
A reproducer is intentionally not provided by this commit:
it's easy to test the change, but adding and dropping
a huge number of columns would take a really long amount
of time, so we need to omit it.
Fixesscylladb/scylladb#24018
Backport: all of the supported versions are affected, so we want to backport the changes there.
Closesscylladb/scylladb#24151
* github.com:scylladb/scylladb:
cql3/description: Serialize only rvalues of description
cql3: Represent create_statement using managed_string
cql3/statements/describe_statement.cc: Don't copy descriptions
cql3: Use managed_bytes instead of bytes in DESCRIBE
utils/managed_string.hh: Introduce managed_string and fragmented_ostringstream
The following cql3 data types - ascii, blob, duration, inet, and text -
are natively byte-ordered in their serialized forms. To encode them into
a byte-comparable format, zeros are escaped, and since these types have
variable lengths, the encoded form is terminated in an escaped state to
mark its end.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The decimal cql3 type is internally stored as a scale and an unscaled
integer. To convert them into a byte comparable format, they are first
normalized into a base-100 exponent and a mantissa that lies in [0.01, 1)
and then encoded into a byte sequence that preserves the numerical order.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Implemented a method `count_digits()` to return the number of significant
digits in a given boost::multiprecision:cpp_int. This is required to
convert big_decimal to a byte comparable format.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The uuid type values are composed of two fixed-length unsigned integers:
an msb and an lsb. The msb contains a version digit, which must be
pulled first in a byte-comparable representation. For version 1 uuids,
in addition to extracting the version digit first, the msb must be
rearranged to make it byte comparable. The lsb is written as is.
For the timeuuid type, the msb is handled simliar to the version 1 uuid
values. The lsb however is treated differently - the sign bits of all
bytes are inverted to preserve the legacy comparison order, which
compared individual bytes as signed values.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Any varint value less than 7 bytes is encoded using the signed long
encoding format and remaining values are all encoded using the full form
encoding :
<signbyte><length as unsigned integer - 7><7 or more bytes>,
where <signbyte> is 00 for negative numbers and FF for positive ones,
and the length's bytes are inverted if the number is negative (so that
longer length sorts smaller).
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The length of a varint value is encoded separately as an unsigned
variable-length integer. For negative varint values, the encoded bytes
are flipped to ensure that longer lengths sort smaller. This patch
implements both encoding and decoding logic for varint lengths and will
be used by the subsequent patch.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The sign bit is flipped for positive values to ensure that they are
ordered after negative values. For negative values, all the bytes are
inverted, allowing larger negative values to be ordered before smaller
ones.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Both the date and time cql3 types are internally unsigned fixed length
integers. Their serialized form is already byte comparable, so the
encoder and decoder return the serialized bytes as it is.
The timestamp type is encoded using the fixed length signed integer
encoding.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The bigint type, internally implemented as a long data type, is encoded
using a variable-length encoding similar to UTF-8. This enables a
significant amount of space to be saved when smaller numbers are
frequently used, while still permitting large values to be efficiently
encoded.
The first bit of the encoding represents the inverted sign (i.e., 1 for
positive, 0 for negative), followed by length encoded as a sequence of
bits matching the inverted sign. This is then followed by a differing
bit (except for 9-byte encodings) and the bits of the number's two's
complement.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
To encode fixed-length signed integers in a byte-comparable format, the
first bit of each value is inverted. This ensures that negative numbers
are ordered before positive ones during comparison. This patch adds
support for the data types : byte_type (tinyint), short_type (smallint),
and int32_type (int). Although long_type (bigint) is a fixed length
integer type, it has different byte comparable encoding and will be
handled separately in another patch.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
This patch implements a new class, `comparable_bytes`, designed to
implement methods for converting data values to and from byte-comparable
formats. The class stores the comparable bytes as `managed_bytes` and
currently provides the structure for all required methods. The actual
logic for converting various data types will be implemented in subsequent
patches.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Overloaded write() method to support writing a FragmentedView into
bytes_ostream. Also added a testcase to verify the implementation.
The new helper will be used by the byte_comparable implementation
during the encode/decode process.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Disable retries for S3 requests in the chunked download source to
prevent duplicate chunks from corrupting the buffer queue. The
response handler now throws an exception to bypass the retry
strategy, allowing the next range to be attempted cleanly.
This exception is only triggered for retryable errors; unretryable
ones immediately halt further requests.
Extend the S3 proxy to support error injection when the client
makes multiple requests to the same resource—useful for testing
retry behavior and failure handling.
Add the option to co-locate tablets of different tables. For example, a base table and its CDC table, or a local index.
main changes and ideas:
* "table group" - a set of one or more tables that should be co-located. (Example: base table and CDC table). A group consists of one base table and zero or more children tables.
* new column `base_table` in `system.tablets`: when creating a new table, it can be set to point to a base table, which the new table's tablets will be co-located with. when it's set, the tablet map information should be retrieved from the base table map. the child map doesn't contain per-tablet information.
* co-located tables always have the same tablet count and the same tablet replicas. each tablet operation - migration, resize, repair - is applied on all tablets in a synchronized manner by the topology coordinator.
* resize decision for a group is made by combining the per-table hints and comparing the average tablet size (over all tablets in the group) with the target tablet size.
* the tablets load balancer works with the base table as a representative of the group. it represents a single migration unit with some `group_size` that is taken into account.
* view tablets are co-located with base tablets when the partition keys match.
Fixes https://github.com/scylladb/scylladb/issues/17043
backport is not needed. this is preliminary work for support of MVs and CDC with tablets.
Closesscylladb/scylladb#22906
* github.com:scylladb/scylladb:
tablets: validate no clustering row mutations on co-located tables
raft_group0_client: extend validate_change to mixed_change type
docs: topology-over-raft: document co-located tables
tablet-mon.py: visual indication for co-located tablets
tablet-mon.py: handle co-located tablets
test/boost/view_schema_test.cc: fix race in wait_until_built
boost/tablets_test: test load balancing and resize of co-located tablets
test/tablets: test tablets colocation
tablets: co-locate view tablets with base when the partition keys match
test/pylib/tablets: common get_tablet_count api
test_mv_tablets: use get_tablet_replicas from common tablets api
test/pylib/tablets: fix test api to read tablet replicas from base table
tablets: allocator: create co-located tables in a single operation
alternator: prepare all new tables in a single announcement
migration_manager: add notification for creating multiple tables
tablets: read_tablet_transition_stage: read from base table
storage service: allow repair request only on base tables
tablets: keyspace_rf_change: apply on base table
storage service: generate tablet migration updates on base tables
tablets: replace all_tables method
tablets: split when all co-located tablets are ready
tablets: load balancer: sizing plan for table groups
tablets: load balancer: handle co-located tablets
tablets: allocate co-located tablets
tablets: handle migration of co-located tablets
storage service: add repair colocated tablets rpc
tablets: save and read tablet metadata of co-located tables
tablets: represent co-located tables in tablet metadata
tablets: add base_table column to system.tablets
docs: update system.tablets schema
Currently, repair_service::repair_tablets starts repair if there
is no ongoing tablet operations. The check does not consider global
topology operations, like tablet resize finalization.
Hence, if:
- topology is in the tablet_resize_finalization state;
- repair starts (as there is no tablet transitions) and holds the erm;
- resize finalization finishes;
then the repair sees a topology state different than the actual -
it does not see that the storage groups were already split.
Repair code does not handle this case and it results with
on_internal_error.
Start repair when topology is not busy. The check isn't atomic,
as it's done on a shard 0. Thus, we compare the topology versions
to ensure that the business check is valid.
Fixes: https://github.com/scylladb/scylladb/issues/24195.
Needs backport to all branches since they are affected
Closesscylladb/scylladb#24202
* github.com:scylladb/scylladb:
test: add test for repair and resize finalization
repair: postpone repair until topology is not busy
The test has two major problems
1. Wrongly computed time windows. Data was not spread across two 1-minute
windows causing the test to generate even three sstables instead
of two
2. Timestamp was not propagated to the prepared CQL statements. So
in fact, a current time was used implicitly
3. Because of the incorrect timestamp issue, the remaining tests
testing purged tombstones were affected as well.
Fixes https://github.com/scylladb/scylladb/issues/24532Closesscylladb/scylladb#24609
When describing a table, we need to do it carefully: if some
columns were dropped, we must specify that explicitly by
```
ALTER TABLE {table} DROP {column} USING TIMESTAMP ...
```
in the result of the DESCRIBE statement. Failing to do so
could lead to data resurrection.
However, if a table has been altered many, many times,
we might end up with a huge create statement. Constructing
it could, in turn, trigger an oversized allocation.
Some tests ran into that very problem in fact.
In this commit, we want to mitigate the problem: instead of
allocating a contiguous chunk of memory for the create
statement, we use `fragmented_ostringstream` and `managed_string`
to possibly keep data scattered in memory. It makes handling
`cql3::description` less convenient in the code, but since
the struct is pretty much immediately serialized after
creating it, it's a very good trade-off.
We provide a reproducer. It consistently passes with this commit,
while having about 50% chance of failure before it (based on my
own experiments). Playing with the parameters of the test
doesn't seem to improve that chance, so let's keep it as-is.
Fixesscylladb/scylladb#24018
Replace the duplicated get_tablet_replicas method in test_mv_tablets
with the common method from the tablets api, to reduce code duplication
and use the correct method that reads the tablet replicas from the base
table.
When reading tablet replicas from system.tablets, we need to refer to
the base table partition, if any.
We fix and simplify the test api for reading tablet replicas to read
from the base table.
The method all_tables in tablet_metadata is used for iterating over all
tables in the tablet metadata with their tablet maps.
Now that we have co-located tables we need to make the distinction on
which tables we want to iterate over. In some cases we want to iterate
over each group of co-located tables, treating them as one unit, and in
other cases we want to iterate over all tables, doesn't matter if they
are part of a co-located group and have a base table.
We replace all_tables with new methods that can be used for each of the
cases.
Fixes#24574
* Ensure we close the embedded load_cache objects on encryption shutdown, otherwise we can, in unit testing, get destruction of these while a timer is still active -> assert
* Add extra exception handling to `network_error_test_helper`, so even if test framework might exception-escape, we properly stop the network proxy to avoid use after free.
Closesscylladb/scylladb#24633
* github.com:scylladb/scylladb:
encryption_at_rest_test: Add exception handler to ensure proxy stop
encryption: Ensure stopping timers in provider cache objects
Move 3rd party services starting under `try` clause to avoid situation that main process is collapses without going stopping services.
Without this, if something wrong during start it will not trigger execution exit artifacts, so the process will stay forever.
This functionality in 2025.2 and can potentially affect jobs, so backport needed.
Closesscylladb/scylladb#24734
* github.com:scylladb/scylladb:
test.py: use unique hostname for Minio
test.py: Catch possible exceptions during 3rd party services start
If boost test is run such that we somehow except even in a test macro
such as BOOST_REQUIRE_THROW, we could end up not stopping the net proxy
used, causing a use after free.
This patch adds tests reproducing issue #24581, where Scylla incorrectly
parsed "decimal"-type literals in CQL with very high exponents, near or
above the 32-bit limit.
For example, 1.1234e-2147483647 was incorrectly read as 1.1234E+2147483649,
while it should be (as we explain in comments in the test) an error.
The tests in this patch failed (in multiple checks) before #24581 was
fixed, and pass after it was fixed.
These tests all pass on Cassandra 3, confirming our understanding on the
limits of "decimal" to be correct. But they fail on Cassandra 4 and 5 due
to a regression https://issues.apache.org/jira/browse/CASSANDRA-20723
in Cassandra, that mistakenly limited "decimal" exponents to just 309.
Refs #24581
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#24646
Currently the test indiscriminately injects failures into the flushes of
any table, via the IO extension mechanism. The tests want to check that
the node correctly handles the IO error by self isolating, however the
indiscriminate IO errors can have unintended consequences when they hit
raft, leading to disorderly shutdown and failure of the tests. Testing
raft's resiliency to IO errors if of course worth doing, but it is not
the goal of this particular test, so to avoid the fallout, the IO errors
are limited to the test tables only.
Fixes: https://github.com/scylladb/scylladb/issues/24637Closesscylladb/scylladb#24638
Whereas DynamoDB limits the names of tables, LSIs and GSIs to 255 characters each, Alternator currently has different (and lower) limitations:
1. A table name must be up to 222 characters.
2. For a GSI, the sum of the table's and GSI's name length, plus 1, must be up to 222 characters.
3. For an LSI, the sum of the table's and LSI's name length, plus 2, must be up to 222 characters.
The first patch documents these existing limitations, improves their testing, and fixes a tiny bug found by one of the tests (where UpdateTable adding a GSI's limit testing is off by one).
The second patch unfortunately shows with a reproducer (issue #24598) this limit of 222 is problematic and we may need to lower it: If a user creates a table of length 222 and then enables Alternator streams, Scylla shuts down on an IO error. This will need to be fixed later, but at least this patch properly documents the existing behavior.
No need to backport this patch - it is a very minor improvement that it is unlikely users care about and there is no potential for harm.
Closesscylladb/scylladb#24597
* github.com:scylladb/scylladb:
test/alternator: reproducer for streams bug with long table name
alternator: improve, document and test table/index name lengths
Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full prefix.
This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions.
Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.
Add a full-stack test which checks that rows with bad keys are correctly handled.
Fixes: https://github.com/scylladb/scylladb/issues/24489
The bug is present in all versions, has to be backported to all supported versions.
Closesscylladb/scylladb#24492
* github.com:scylladb/scylladb:
test/boost/sstable_datafile_test: add test for corrupt data
sstables/mx/writer: handler rows with empty keys
test/lib/cql_assertions: introduce columns_assertions
sstables: add corrupt_data_handler to sstables::sstables
tools/scylla-sstable: make large_data_handler a local
db: introduce corrupt_data_handler
mutation: introduce frozen_mutation_fragment_v2
mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
mutation/mutation_partition_view: extract de-ser of {clustering,static} row
idl-compiler.py: generate skip() definition for enums serializers
idl: extract full_position.idl from position_in_partition.idl
db/system_keyspace: add apply_mutation()
db/system_keyspace: introduce the corrupt_data table
Make sure the keys are full prefixes as it is expected to be the case for rows. At severeal occasions we have seen empty row keys make their ways into the sstables, despite the fact that they are not allowed by the CQL frontend. This means that such empty keys are possibly results of memory corruption or use-after-{free,copy} errors. The source of the corruption is impossible to pinpoint when the empty key is discovered in the sstable. So this patch adds checks for such keys to places where mutations are built: when building or unserializing mutations.
Fixes: https://github.com/scylladb/scylladb/issues/24506
Not a typical backport candidate (not a bugfix or regression fix), but we should still backport so we have the additional checks deployed to existing production clusters.
Closesscylladb/scylladb#24497
* github.com:scylladb/scylladb:
mutation: check key of inserted rows
compound: optimize is_full() for single-component types
The two tests in this patch reproduce issue #24598: When enabling
Alternator streams on an Alternator table with a very long name,
such as the maximum allowed name length 222, the result is an
I/O error and a Scylla shutdown.
The two tests are currently marked "skip", otherwise they would
crash the Scylla being tested.
Refs #24598
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Whereas DynamoDB limits the names of tables, LSIs and GSIs to 255
characters each, Alternator currently has different (and lower)
limitations:
1. A table name must be up to 222 characters.
2. For a GSI, the sum of the table's and GSI's name length, plus 1,
must be up to 222 characters.
3. For an LSI, the sum of the table's and LSI's name length, plus 2,
must be up to 222 characters.
These specific limitations were never documented, so in this patch we
add this information to docs/alternator/compatibility.md.
Moreover, these limitations where only partially tested, so in this patch
we add testing for more cases that we forgot to check - such as length
of LSI names (only GSI were checked before this patch), or adding a
GSI to an existing table. It is important to check all these corner
cases because there is a risk that if we attempt to create a table
without checking its length, we can end up with an I/O error that brings
down Scylla.
In one case - UpdateTable adding a GSI to an existing table - the new
test exposed a trivial bug: Because UpdateTable wants to verify the new
GSI doesn't have the same name as an existing LSI, it mistakenly applied
the LSI's length name limit instead of the GSI's name length limit,
which is one byte less than it should be. So this patch fixes this
trivial bug as well.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Audit component defines `audit` logger which it uses only for `error` and `info` logs,
regarding `audit` module initialization and errors during audit log writing.
This change introduces `debug` level logs on the happy path of audit log writes.
Fixes: https://github.com/scylladb/scylladb/issues/23773
No backport needed - this is a small quality-of-life improvement.
Closesscylladb/scylladb#24658
* github.com:scylladb/scylladb:
audit: change audit test logger level to `debug`
audit: introduce debug level logs on happy path
Audit module tests should show the `debug` level messages.
This change makes audit_test.py `audit` module log level to `debug`.
Closesscylladb/scylladb#23773
This test asserts that a read repair really happened. To ensure this
happens it writes a single partition after enabling the database_apply
error injection point. For some reason, the write is sometimes reordered
with the error injection and the write will get replicated to both nodes
and no read repair will happen, failing the test.
To make the test less sensitive to such rare reordering, add a
clustering column to the table and write a 100 rows. The chance of *all*
100 of them being reordered with the error injection should be low
enough that it doesn't happen again (famous last words).
Fixes: #24330Closesscylladb/scylladb#24403
Add run ID for process output file to be not overwritten in the next case: first run failed, second passed. They are using the same name, so the second run will overwrite and delete the file. This will help to investigate in case of C++ test fails
Add attaching Scylla log files to allure report in case test failed. This is an alternative for link in JUnit report that exists in CI. That change will help to investigate the cluster tests fails. Example can be found in the failed [job](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2980/allure/).
Backport is not needed, this is only framework enhancements
Closesscylladb/scylladb#24677
* github.com:scylladb/scylladb:
test.py: Attach node logs in allure report in case of fail
test.py: Add run id to the boost output file