Reproduces #23284
Currently skipped in release mode since it requires
the `short_tablet_stats_refresh_interval` interval.
Ref #24641
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We have a lot of places in the code where
a token_metadata_ptr is kept in an automatic
variable and destroyed when it leaves the scope.
since it's a referenced counted lw_shared_ptr,
the token_metadata object is rarely destroyed in
those cases, but when it is, it doesn't go through
clear_gently, and in particular its tablet_metadata
is not cleared gently, leading to inefficient destruction
of potentially many foreign_ptr:s.
This patch calls clear_and_destroy_impl that gently
clears and destroys the impl object in the background
using the shared_token_metadata.
Fixes#13381
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
So we can use the local shared_token_metadata instance
for safe background destroy of token_metadata_impl:s.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Copy `auth_test.py` from scylla-dtest test suite, remove all not next_gating tests from it, and make it works with `test.py`
As a part of the porting process, remove unused imports and markers, remove non-next_gating tests and tests marked with `required_features("!consistent-topology-changes")` marker.
Remove `test_permissions_caching` test because it's too flaky when running using test.py
Also, make few time execution optimizations:
- remove redundant `time.sleep(10)`
- use smaller timeouts for CQL sessions
Enable the test in `suite.yaml` (run in dev mode only.)
Additional modifications to test.py/dtest shim code:
- Modify ManagerClient.server_update_config() method to change multiple config options in one call in addition to one `key: value` pair.
- Implement the method using slightly modified `set_configuration_options()` method of `ScyllaCluster`.
- Copy generate_cluster_topology() function from tools/cluster_topology.py module.
- Add support for `bootstrap` parameter for `new_node()` function.
- Rework `wait_for_any_log()` function.
Closesscylladb/scylladb#24648
* github.com:scylladb/scylladb:
test.py: dtest: make auth_test.py run using test.py
test.py: dtest: rework wait_for_any_log()
test.py: dtest: add support for bootstrap parameter for new_node
test.py: dtest: add generate_cluster_topology() function
test.py: dtest: add ScyllaNode.set_configuration_options() method
test.py: pylib/manager_client: support batch config changes
test.py: dtest: copy unmodified auth_test.py
test.py: dtest: add missed markers to pytest.ini
Multiple tests are currently flaky due to graceful shutdown
timing out when flushing tables takes more than a minute. We still
don't understand why flushing is sometimes so slow, but we suspect
it is an issue with new machines spider9 and spider11 that CI runs
on. All observed failures happened on these machines, and most of
them on spider9.
In this commit, we increase the timeout of graceful shutdown as
a temporary workaround to improve CI stability. When we get to
the bottom of the issue and fix it, we will revert this change.
Ref #12028
It's a temporary workaround to improve CI stability, we don't
have to backport it.
Closesscylladb/scylladb#24802
- Fix missing negation in the `if` in the background downloading fiber
- Add test to catch this case
- Improve the s3 proxy to inject errors if the same resource requested more than once
- Suppress client retry since retrying the same request when each produces multiple buffers may lead to the same data appear more than once in the buffer deque
- Inject exception from the test to simulate response callback failure in the middle
No need to backport anything since this class in not used yet
Closesscylladb/scylladb#24657
* github.com:scylladb/scylladb:
s3_test: Add s3_client test for non-retryable error handling
s3_test: Add trace logging for default_retry_strategy
s3_client: Fix edge case when the range is exhausted
s3_client: Fix indentation in try..catch block
s3_client: Stop retries in chunked download source
s3_client: Enhance test coverage for retry logic
s3_client: Add test for Content-Range fix
s3_client: Fix missing negation
s3_client: Refine logging
s3_client: Improve logging placement for current_range output
This PR introduces a new `comparable_bytes` class to add byte-comparable format support for all the [native cql3 data types](https://opensource.docs.scylladb.com/stable/cql/types.html#native-types) except `counter` type as that is not comparable. The byte-comparable format is a pre-requisite for implementing the trie based index format for our sstables(https://github.com/scylladb/scylladb/issues/19191). This implementation adheres to the byte-comparable format specification in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md
Note that support for composite data types like lists, maps, and sets has not been implemented yet and will be made available in a separate PR.
Refs https://github.com/scylladb/scylladb/issues/19407
New feature - backport not required.
Closesscylladb/scylladb#23541
* github.com:scylladb/scylladb:
types/comparable_bytes: add testcase to verify compatibility with cassandra
types/comparable_bytes: support variable-length natively byte-ordered data types
types/comparable_bytes: support decimal cql3 types
types/comparable_bytes: introduce count_digits() method
types/comparable_bytes: support uuid and timeuuid cql3 types
types/comparable_bytes: support varint cql3 type
types/comparable_bytes: support skipping sign byte write in decode_signed_long_type
types/comparable_bytes: introduce encode/decode_varint_length
types/comparable_bytes: support float and double cql3 types
types/comparable_bytes: support date, time and timestamp cql3 types
types/comparable_bytes: support bigint cql3 type
types/comparable_bytes: support fixed length signed integers
types/comparable_bytes: support boolean cql3 type
types: introduce comparable_bytes class
bytes_ostream: overload write() to support writing from FragmentedView
docs: fix minor typo in docs/dev/cql3-type-mapping.md
When describing a table, we need to do it carefully: if some
columns were dropped, we must specify that explicitly by
```
ALTER TABLE {table} DROP {column} USING TIMESTAMP ...
```
in the result of the DESCRIBE statement. Failing to do so
could lead to data resurrection.
However, if a table has been altered many, many times,
we might end up with a huge create statement. Constructing
it could, in turn, trigger an oversized allocation.
Some tests ran into that very problem in fact.
In this commit, we want to mitigate the problem: instead of
allocating a contiguous chunk of memory for the create
statement, we use `bytes_ostream` and `managed_bytes` to
possibly keep data scattered in memory. It makes handling
`cql3::description` less convenient in the code, but since
the struct is pretty much immediately serialized after
creating it, it's a very good trade-off.
A reproducer is intentionally not provided by this commit:
it's easy to test the change, but adding and dropping
a huge number of columns would take a really long amount
of time, so we need to omit it.
Fixesscylladb/scylladb#24018
Backport: all of the supported versions are affected, so we want to backport the changes there.
Closesscylladb/scylladb#24151
* github.com:scylladb/scylladb:
cql3/description: Serialize only rvalues of description
cql3: Represent create_statement using managed_string
cql3/statements/describe_statement.cc: Don't copy descriptions
cql3: Use managed_bytes instead of bytes in DESCRIBE
utils/managed_string.hh: Introduce managed_string and fragmented_ostringstream
The following cql3 data types - ascii, blob, duration, inet, and text -
are natively byte-ordered in their serialized forms. To encode them into
a byte-comparable format, zeros are escaped, and since these types have
variable lengths, the encoded form is terminated in an escaped state to
mark its end.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The decimal cql3 type is internally stored as a scale and an unscaled
integer. To convert them into a byte comparable format, they are first
normalized into a base-100 exponent and a mantissa that lies in [0.01, 1)
and then encoded into a byte sequence that preserves the numerical order.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Implemented a method `count_digits()` to return the number of significant
digits in a given boost::multiprecision:cpp_int. This is required to
convert big_decimal to a byte comparable format.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The uuid type values are composed of two fixed-length unsigned integers:
an msb and an lsb. The msb contains a version digit, which must be
pulled first in a byte-comparable representation. For version 1 uuids,
in addition to extracting the version digit first, the msb must be
rearranged to make it byte comparable. The lsb is written as is.
For the timeuuid type, the msb is handled simliar to the version 1 uuid
values. The lsb however is treated differently - the sign bits of all
bytes are inverted to preserve the legacy comparison order, which
compared individual bytes as signed values.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Any varint value less than 7 bytes is encoded using the signed long
encoding format and remaining values are all encoded using the full form
encoding :
<signbyte><length as unsigned integer - 7><7 or more bytes>,
where <signbyte> is 00 for negative numbers and FF for positive ones,
and the length's bytes are inverted if the number is negative (so that
longer length sorts smaller).
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The length of a varint value is encoded separately as an unsigned
variable-length integer. For negative varint values, the encoded bytes
are flipped to ensure that longer lengths sort smaller. This patch
implements both encoding and decoding logic for varint lengths and will
be used by the subsequent patch.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The sign bit is flipped for positive values to ensure that they are
ordered after negative values. For negative values, all the bytes are
inverted, allowing larger negative values to be ordered before smaller
ones.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Both the date and time cql3 types are internally unsigned fixed length
integers. Their serialized form is already byte comparable, so the
encoder and decoder return the serialized bytes as it is.
The timestamp type is encoded using the fixed length signed integer
encoding.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The bigint type, internally implemented as a long data type, is encoded
using a variable-length encoding similar to UTF-8. This enables a
significant amount of space to be saved when smaller numbers are
frequently used, while still permitting large values to be efficiently
encoded.
The first bit of the encoding represents the inverted sign (i.e., 1 for
positive, 0 for negative), followed by length encoded as a sequence of
bits matching the inverted sign. This is then followed by a differing
bit (except for 9-byte encodings) and the bits of the number's two's
complement.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
To encode fixed-length signed integers in a byte-comparable format, the
first bit of each value is inverted. This ensures that negative numbers
are ordered before positive ones during comparison. This patch adds
support for the data types : byte_type (tinyint), short_type (smallint),
and int32_type (int). Although long_type (bigint) is a fixed length
integer type, it has different byte comparable encoding and will be
handled separately in another patch.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
This patch implements a new class, `comparable_bytes`, designed to
implement methods for converting data values to and from byte-comparable
formats. The class stores the comparable bytes as `managed_bytes` and
currently provides the structure for all required methods. The actual
logic for converting various data types will be implemented in subsequent
patches.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Overloaded write() method to support writing a FragmentedView into
bytes_ostream. Also added a testcase to verify the implementation.
The new helper will be used by the byte_comparable implementation
during the encode/decode process.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Disable retries for S3 requests in the chunked download source to
prevent duplicate chunks from corrupting the buffer queue. The
response handler now throws an exception to bypass the retry
strategy, allowing the next range to be attempted cleanly.
This exception is only triggered for retryable errors; unretryable
ones immediately halt further requests.
Extend the S3 proxy to support error injection when the client
makes multiple requests to the same resource—useful for testing
retry behavior and failure handling.
Add the option to co-locate tablets of different tables. For example, a base table and its CDC table, or a local index.
main changes and ideas:
* "table group" - a set of one or more tables that should be co-located. (Example: base table and CDC table). A group consists of one base table and zero or more children tables.
* new column `base_table` in `system.tablets`: when creating a new table, it can be set to point to a base table, which the new table's tablets will be co-located with. when it's set, the tablet map information should be retrieved from the base table map. the child map doesn't contain per-tablet information.
* co-located tables always have the same tablet count and the same tablet replicas. each tablet operation - migration, resize, repair - is applied on all tablets in a synchronized manner by the topology coordinator.
* resize decision for a group is made by combining the per-table hints and comparing the average tablet size (over all tablets in the group) with the target tablet size.
* the tablets load balancer works with the base table as a representative of the group. it represents a single migration unit with some `group_size` that is taken into account.
* view tablets are co-located with base tablets when the partition keys match.
Fixes https://github.com/scylladb/scylladb/issues/17043
backport is not needed. this is preliminary work for support of MVs and CDC with tablets.
Closesscylladb/scylladb#22906
* github.com:scylladb/scylladb:
tablets: validate no clustering row mutations on co-located tables
raft_group0_client: extend validate_change to mixed_change type
docs: topology-over-raft: document co-located tables
tablet-mon.py: visual indication for co-located tablets
tablet-mon.py: handle co-located tablets
test/boost/view_schema_test.cc: fix race in wait_until_built
boost/tablets_test: test load balancing and resize of co-located tablets
test/tablets: test tablets colocation
tablets: co-locate view tablets with base when the partition keys match
test/pylib/tablets: common get_tablet_count api
test_mv_tablets: use get_tablet_replicas from common tablets api
test/pylib/tablets: fix test api to read tablet replicas from base table
tablets: allocator: create co-located tables in a single operation
alternator: prepare all new tables in a single announcement
migration_manager: add notification for creating multiple tables
tablets: read_tablet_transition_stage: read from base table
storage service: allow repair request only on base tables
tablets: keyspace_rf_change: apply on base table
storage service: generate tablet migration updates on base tables
tablets: replace all_tables method
tablets: split when all co-located tablets are ready
tablets: load balancer: sizing plan for table groups
tablets: load balancer: handle co-located tablets
tablets: allocate co-located tablets
tablets: handle migration of co-located tablets
storage service: add repair colocated tablets rpc
tablets: save and read tablet metadata of co-located tables
tablets: represent co-located tables in tablet metadata
tablets: add base_table column to system.tablets
docs: update system.tablets schema
Currently, repair_service::repair_tablets starts repair if there
is no ongoing tablet operations. The check does not consider global
topology operations, like tablet resize finalization.
Hence, if:
- topology is in the tablet_resize_finalization state;
- repair starts (as there is no tablet transitions) and holds the erm;
- resize finalization finishes;
then the repair sees a topology state different than the actual -
it does not see that the storage groups were already split.
Repair code does not handle this case and it results with
on_internal_error.
Start repair when topology is not busy. The check isn't atomic,
as it's done on a shard 0. Thus, we compare the topology versions
to ensure that the business check is valid.
Fixes: https://github.com/scylladb/scylladb/issues/24195.
Needs backport to all branches since they are affected
Closesscylladb/scylladb#24202
* github.com:scylladb/scylladb:
test: add test for repair and resize finalization
repair: postpone repair until topology is not busy
The test has two major problems
1. Wrongly computed time windows. Data was not spread across two 1-minute
windows causing the test to generate even three sstables instead
of two
2. Timestamp was not propagated to the prepared CQL statements. So
in fact, a current time was used implicitly
3. Because of the incorrect timestamp issue, the remaining tests
testing purged tombstones were affected as well.
Fixes https://github.com/scylladb/scylladb/issues/24532Closesscylladb/scylladb#24609
When describing a table, we need to do it carefully: if some
columns were dropped, we must specify that explicitly by
```
ALTER TABLE {table} DROP {column} USING TIMESTAMP ...
```
in the result of the DESCRIBE statement. Failing to do so
could lead to data resurrection.
However, if a table has been altered many, many times,
we might end up with a huge create statement. Constructing
it could, in turn, trigger an oversized allocation.
Some tests ran into that very problem in fact.
In this commit, we want to mitigate the problem: instead of
allocating a contiguous chunk of memory for the create
statement, we use `fragmented_ostringstream` and `managed_string`
to possibly keep data scattered in memory. It makes handling
`cql3::description` less convenient in the code, but since
the struct is pretty much immediately serialized after
creating it, it's a very good trade-off.
We provide a reproducer. It consistently passes with this commit,
while having about 50% chance of failure before it (based on my
own experiments). Playing with the parameters of the test
doesn't seem to improve that chance, so let's keep it as-is.
Fixesscylladb/scylladb#24018
Replace the duplicated get_tablet_replicas method in test_mv_tablets
with the common method from the tablets api, to reduce code duplication
and use the correct method that reads the tablet replicas from the base
table.
When reading tablet replicas from system.tablets, we need to refer to
the base table partition, if any.
We fix and simplify the test api for reading tablet replicas to read
from the base table.
The method all_tables in tablet_metadata is used for iterating over all
tables in the tablet metadata with their tablet maps.
Now that we have co-located tables we need to make the distinction on
which tables we want to iterate over. In some cases we want to iterate
over each group of co-located tables, treating them as one unit, and in
other cases we want to iterate over all tables, doesn't matter if they
are part of a co-located group and have a base table.
We replace all_tables with new methods that can be used for each of the
cases.
Fixes#24574
* Ensure we close the embedded load_cache objects on encryption shutdown, otherwise we can, in unit testing, get destruction of these while a timer is still active -> assert
* Add extra exception handling to `network_error_test_helper`, so even if test framework might exception-escape, we properly stop the network proxy to avoid use after free.
Closesscylladb/scylladb#24633
* github.com:scylladb/scylladb:
encryption_at_rest_test: Add exception handler to ensure proxy stop
encryption: Ensure stopping timers in provider cache objects
Move 3rd party services starting under `try` clause to avoid situation that main process is collapses without going stopping services.
Without this, if something wrong during start it will not trigger execution exit artifacts, so the process will stay forever.
This functionality in 2025.2 and can potentially affect jobs, so backport needed.
Closesscylladb/scylladb#24734
* github.com:scylladb/scylladb:
test.py: use unique hostname for Minio
test.py: Catch possible exceptions during 3rd party services start
If boost test is run such that we somehow except even in a test macro
such as BOOST_REQUIRE_THROW, we could end up not stopping the net proxy
used, causing a use after free.
As a part of the porting process, remove unused imports and
markers, remove non-next_gating tests and tests marked with
`required_features("!consistent-topology-changes")` marker.
Remove `test_permissions_caching` test because it's too
flaky when running using test.py
Also, make few time execution optimizations:
- remove redundant `time.sleep(10)`
- use smaller timeouts for CQL sessions
Enable the test in suite.yaml (run in dev mode only)
Make `wait_for_any_log()` function to work closer to the original
dtest's version: use `ScyllaLogFile.grep()` method instead of
the usage of `ScyllaNode.wait_log_for()` with a small timeout to
have at least one try to find.
Also, add `max_count` argument to `.grep()` method for the
optimization purpose.
Technically, `new_node()`'s `bootstrap` parameter used to mark a node
as a seed if it's False. In test.py, seeds parameter passed on start of
a node, so, save it as `ScyllaNode.bootstrap` attribute to use in
`ScyllNode.start()` method.
Modify ManagerClient.server_update_config() method to change
multiple config options in one call in addition to one `key: value`
pair. All internal machinery converted to get a values dict as a
parameter. Type hints were adjusted too.